Uncategorized – Page 2 – Working With Data

Data Structures – Tabular vs. Relational

With enough effort it is possible to fit a square peg into a round hole. But we have all learned — sometimes more than once — that it is much easier if peg and hole have the same shape.

Logging and error handling in operational systems

Operational systems, by definition, need to work without human input. Systems are considered “operational” after they have ben thoroughly tested and shown to work properly with a variety of input.

However, no software is perfect and no real-world system operates with 100% availability or 100% consistent input. Things occasionally go wrong – perhaps intermittently. In a situation with occasional failures it is vitally important to have good logging and error handling. The MazamaCoreUtils R package helps with these tasks.

Data volumes

Despite what they say, size does matter.

Successful data management is all about finding the proper tools and formats for dealing with your data. There is no one-size-fits-all solution. The very first question you should be asking yourself is: “How much data are we talking about?”

Best Best Practices Ever!

Every once in a while I read something that is so insightful, so clearly written and so well documented that it enters my own personal pantheon of “Best Ever” documents. I recently added a new, simply divine article titled Best Practices for Scientific Computing and hope that everyone reading this post also takes the time to read that article. I’m including the outline here only to encourage you to read the article in it’s entirety. It is extremely well written.

When k-means clustering fails

Letting the computer automatically find groupings in data is incredibly powerful and is at the heart of “data mining” and “machine learning”. One of the most widely used methods for clustering data is k-means clustering. Unfortunately, k-means clustering can fail spectacularly as in the example below.

Optimizing Data Access – Know your Hardware

The Library of Congress has a lot of information — hundreds of millions of pages of books and manuscripts. But no one has ever suggested that we store all of that information in a single, billion-page book. Instead, individual books are stored on shelves in stacks in rooms according to an organized system. Managing large datasets is just the same: data should exist in manageable sized files stored in hierarchically organized directories. Unfortunately, many people working with large datasets try to do just the opposite. This post describes how converting thirty 200Gb files into three million 200Kb files reduced data access times from several hours to under a second.

Data Management Questionnaire

Sometimes merely filling out a questionnaire can cause you to think about problems in a new way. When asked to answer a question that has never occurred to you before, you may find yourself reevaluating some of your core assumptions — assumptions you may not have known you had. That is the power of asking questions. Our data management questionnaire poses questions in 12 categories that will help you figure out what you need, what you want, and perhaps give you a hint of how to get there.

Standard Country Names

What’s in a name? That which we call a rose
By any other name would smell as sweet.

Ahhh love. Juliet speaks lovely poetry but we learn, as the story unfolds, that names and the identification they impart are in fact extremely important. This is no less true in data management where country names are anything but standardized.

Methow Valley Air Quality

Mazama Science has released a new set of tutorials demonstrating the use of air quality R packages to investigate data from regulatory monitors and low-cost sensors. This post is just a short summary of what the tutorials cover. We invite anyone interested in wildfire smoke and air quality to run through the tutorials and provide feedback.

Qualitative Display of Air Quality Data

Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.
Edward Tufte, The Visual Display of Quantitative Information

This post briefly summarizes our thoughts on best practices for designing public-facing data graphics for air quality data. Focus will be on the types of charts we feel are appropriate to use with data (e.g. from low-cost sensors) that may not be as accurate as data collected by monitors using Federal Regulatory or Federal Equivalent Methods (see FRMs/FEMs and Sensors). Visualization types discussed will include:

maps
time-series charts
calendars
status and forecast tables