Optimizing Data Access – Know your Hardware

The Library of Congress has a lot of information — hundreds of millions of pages of books and manuscripts. But no one has ever suggested that we store all of that information in a single, billion-page book. Instead, individual books are stored on shelves in stacks in rooms according to an organized system. Managing large datasets is just the same:  data should exist in manageable sized files stored in hierarchically organized directories. Unfortunately, many people working with large datasets try to do just the opposite. This post describes how converting thirty 200Gb files into three million 200Kb files reduced data access times from several hours to under a second.

Continue reading

Data Management Questionnaire

Sometimes merely filling out a questionnaire can cause you to think about problems in a new way.  When asked to answer a question that has never occurred to you before, you may find yourself reevaluating some of your core assumptions — assumptions you may not have known you had.  That is the power of asking questions. Our data management questionnaire poses questions in 12 categories that will help you figure out what you need, what you want, and perhaps give you a hint of how to get there.

Continue reading

Standard Country Names

What’s in a name?  That which we call a rose
By any other name would smell as sweet.

Ahhh love.  Juliet speaks lovely poetry but we learn, as the story unfolds, that names and the identification they impart are in fact extremely important.  This is no less true in data management where country names are anything but standardized.

Continue reading