Every once in a while I read something that is so insightful, so clearly written and so well documented that it enters my own personal pantheon of “Best Ever” documents. I recently added a new, simply divine article titled Best Practices for Scientific Computing and hope that everyone reading this post also takes the time to read that article. I’m including the outline here only to encourage you to read the article in it’s entirety. It is extremely well written.Continue reading
When k-means clustering fails
Letting the computer automatically find groupings in data is incredibly powerful and is at the heart of “data mining” and “machine learning”. One of the most widely used methods for clustering data is k-means clustering. Unfortunately, k-means clustering can fail spectacularly as in the example below.Continue reading
Optimizing Data Access – Know your Hardware
The Library of Congress has a lot of information — hundreds of millions of pages of books and manuscripts. But no one has ever suggested that we store all of that information in a single, billion-page book. Instead, individual books are stored on shelves in stacks in rooms according to an organized system. Managing large datasets is just the same: data should exist in manageable sized files stored in hierarchically organized directories. Unfortunately, many people working with large datasets try to do just the opposite. This post describes how converting thirty 200Gb files into three million 200Kb files reduced data access times from several hours to under a second.Continue reading
Data Management Questionnaire
Sometimes merely filling out a questionnaire can cause you to think about problems in a new way. When asked to answer a question that has never occurred to you before, you may find yourself reevaluating some of your core assumptions — assumptions you may not have known you had. That is the power of asking questions. Our data management questionnaire poses questions in 12 categories that will help you figure out what you need, what you want, and perhaps give you a hint of how to get there.Continue reading
Standard Country Names
What’s in a name? That which we call a rose
By any other name would smell as sweet.
Ahhh love. Juliet speaks lovely poetry but we learn, as the story unfolds, that names and the identification they impart are in fact extremely important. This is no less true in data management where country names are anything but standardized.Continue reading
Qualitative Display of Air Quality Data
Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.Edward Tufte, The Visual Display of Quantitative Information
This post briefly summarizes our thoughts on best practices for designing public-facing data graphics for air quality data. Focus will be on the types of charts we feel are appropriate to use with data (e.g. from low-cost sensors) that may not be as accurate as data collected by monitors using Federal Regulatory or Federal Equivalent Methods (see FRMs/FEMs and Sensors). Visualization types discussed will include:
- time-series charts
- status and forecast tables
Cross-origin requests with beakr
beakr is a lightweight and flexible web framework that allows you to incorporate R code as the Middleware responsible for handling web requests. At Mazama Science, we developed beakr to simplify the process of creating R-based web services that we use to deliver a variety of products: data files, images, rendered Rmarkdown documents, etc.
Web Frameworks for R – A Brief Overview
Having recently announced the beakr web framework for R, we have received several questions about context and why we choose beakr over other options for some of our web services. This post will attempt to answer some of those questions by providing a few opinions on beakr and other web frameworks for R.
The comparison will by no means be exhaustive but will attempt to briefly summarize some of the key features each web framework has to offer. While there are some differences in the approach each package takes to developing web services, they all share similar basic functionality. In the end, the choice of a particular framework will come down largely to personal preference.Continue reading
One of the big jokes among people who manage scientific datasets goes like this:
The great thing about standards is … there are so many to choose from!
While this one liner may never make it to late-night TV, there is much truth to it. Many “standards” exist, and many more are invented each month to accommodate the special needs of new types of data or new software for processing data.
One standard, however, stands far above other options and should always be adopted: ISO 8601– the international standard for representing dates and times.Continue reading
The world of scientific data management, analysis, visualization and public access is changing so rapidly it can be difficult to keep up with developments even in one’s own field. Staying abreast of progress in all areas of science, let alone business, is an impossibility.
Then there are big picture questions about how the whole scientific endeavor is changing:
- Does the long tradition of intellectual property rights with respect to science data apply in today’s cut-and-paste world?
- How can scientists and policymakers use on-line tools to collaborate across the vast divide that separates them?
- What role does the interested, intellegent layman play in the the dissemination and analysis of scientific data?
- How can better delivery of data and analysis products improve the utility of publicly sponsored, publicly owned data?
These are the types of questions that occupy us every day at Mazama Science. We spend a tremendous amount of time thinking about them ourselves and seeking answers from others in our broad community of contacts.
In this blog we hope to distill some of that group knowledge in the hopes that it may be useful or inspiring to those attempting to do similar work — supporting a data-focused, scientific approach to society’s pressing issues.