Working with data from varied sources can be frustrating — some data will be in CSV format; some in XML; some available as HTML pages; other data as relational databases or MS Excel spreadsheets.
This post will cover the UNIX tools that every data manager needs to be familiar with in order to work with varied data sources.
With enough effort it is possible to fit a square peg into a round hole. But we have all learned — sometimes more than once — that it is much easier if peg and hole have the same shape.
Despite what they say, size does matter.
Successful data management is all about finding the proper tools and formats for dealing with your data. There is no one-size-fits-all solution. The very first question you should be asking yourself is: “How much data are we talking about?”
The Library of Congress has a lot of information — hundreds of millions of pages of books and manuscripts. But no one has ever suggested that we store all of that information in a single, billion-page book. Instead, individual books are stored on shelves in stacks in rooms according to an organized system. Managing large datasets is just the same: data should exist in manageable sized files stored in hierarchically organized directories. Unfortunately, many people working with large datasets try to do just the opposite. This post describes how converting thirty 200Gb files into three million 200Kb files reduced data access times from several hours to under a second.
Sometimes merely filling out a questionnaire can cause you to think about problems in a new way. When asked to answer a question that has never occurred to you before, you may find yourself reevaluating some of your core assumptions — assumptions you may not have known you had. That is the power of asking questions. Our data management questionnaire poses questions in 12 categories that will help you figure out what you need, what you want, and perhaps give you a hint of how to get there.
What’s in a name? That which we call a rose
By any other name would smell as sweet.
Ahhh love. Juliet speaks lovely poetry but we learn, as the story unfolds, that names and the identification they impart are in fact extremely important. This is no less true in data management where country names are anything but standardized.
On the left we have zero, our integer measure of nothingness. On the right we have missing value, aka N/A, aka NA, our signal that the value of a datapoint is unknown. Everyone who deals with data has to deal with this important distinction. And far too often people get it wrong.
In the marketplace, the needs of producers and consumers are often at odds: producers want higher prices, consumers lower ones; producers want easy assembly, consumers easy dis-assembly; producers want flexibility and rapid prototyping, consumers reliability and long-term support.
The same competing needs exist in the world of scientific data management where producers of data and consumers of data often operate in very different worlds with very different sets of tools.
What? Where? When?
These are key questions that every scientist or other collector of environmental data must answer.
- What is the value of the thing we are measuring?
- Where are we taking the measurement?
- When are we taking the measurement?
In a previous post we discussed how to standardize “when”. But what about “where”?
One of the big jokes among people who manage scientific datasets goes like this:
The great thing about standards is … there are so many to choose from!
While this one liner may never make it to late-night TV, there is much truth to it. Many “standards” exist, and many more are invented each month to accommodate the special needs of new types of data or new software for processing data.
One standard, however, stands far above other options and should always be adopted: ISO 8601– the international standard for representing dates and times.