Despite what they say, size does matter.
Successful data management is all about finding the proper tools and formats for dealing with your data. There is no one-size-fits-all solution. The very first question you should be asking yourself is: “How much data are we talking about?”
The Library of Congress has a lot of information — hundreds of millions of pages of books and manuscripts. But no one has ever suggested that we store all of that information in a single, billion-page book. Instead, individual books are stored on shelves in stacks in rooms according to an organized system. Managing large datasets is just the same: data should exist in manageable sized files stored in hierarchically organized directories. Unfortunately, many people working with large datasets try to do just the opposite. This post describes how converting thirty 200Gb files into three million 200Kb files reduced data access times from several hours to under a second.