The Library of Congress has a lot of information — hundreds of millions of pages of books and manuscripts. But no one has ever suggested that we store all of that information in a single, billion-page book. Instead, individual books are stored on shelves in stacks in rooms according to an organized system. Managing large datasets is just the same: data should exist in manageable sized files stored in hierarchically organized directories. Unfortunately, many people working with large datasets try to do just the opposite. This post describes how converting thirty 200Gb files into three million 200Kb files reduced data access times from several hours to under a second.
But before we tell you that story, we have to tell you this one:
Physical Constraints
One of the problems with our billion-page book is that we would need special equipment to handle it. A relatively affordable forklift might handle a million-page book but our B-book needs expensive, custom cranes just to flip to a particular chapter. In the physical world of paper and forklifts it is obvious to everyone that a reasonable ‘chunk’ of information is the amount a person can carry in one hand — a book.
When writing software, most people forget that they are also limited by physical constraints because they cannot see or hear the physical objects handling their data. Hardware manufacturers have worked very hard to reduce the size and noise and heat output of computer components to the point that most people have no clue what is going on inside their computer. It’s just ‘magic’. Back in the days of punch cards and magnetic tape and mercury delay line memory people had a lot more respect for the real, tangible aspect of data input and output (I/O) and memory (today’s RAM). Those days are long gone but it’s still important to be aware of the physical limitations of our current hardware.
Disk I/O
The importance of disk I/O is best summed up in a 2009 article titled Performance killer: disk I/O:
Many people think of “performance tuning” as optimizing loops, algorithms, and memory use. In truth, however, you don’t get the huge performance gains from optimizing CPU and memory use (which is good), but from eliminating I/O calls.
Disk I/O is responsible for almost all slow websites and desktop applications. It’s true. Watch your CPU use next time you open a program, or your server is under load. CPUs aren’t the bottleneck anymore – your hard drive is. At the hardware level, the hard drive is the slowest component by an incredibly large factor. Today’s memory ranges between 3200 and 10400 MB/s. In contrast, today’s desktop hard drive speeds average about 50 MB/s (Seagate 500GB), with high-end drives getting 85MB/s (WD 640, Seagate 1TB). If you’re looking at bandwidth, hard drives are 200-300 times slower. Bandwidth, though, isn’t the killer – it’s the latency. Few modern hard drives have latencies under 13 milliseconds – while memory latency is usually about 5 nanoseconds – 2,000 times faster.
Accessing data on your hard drive is painfully slow compared to working with it once it is in memory. It is useful to keep an image of a hard drive in your head when thinking about this. Wikipedia has a detailed page on disk drive performance characteristics which describes the moving parts which cause ‘rotational latency’ and ‘seek time’. While the 2009 numbers in the quote above will be a little different from newly purchased hard drives, the overall message remains the same today: Avoid unnecessary disk I/O like the plague.
If disk reads made loud noises or heated up your room you would be very aware of how much disk IO was happening. Instead, we have to rely on system tools to give us a heads up. On linux systems you may wish to install the sysstat
package which contains several utilities for monitoring system performance. For real-time analysis of disk operations you can use either of the following commands:
iostat
with the ‘-d’ flag
$ iostat -d 5 Linux 2.6.32-31-generic (ubuntu) 07/07/2011 _i686_ (1 CPU) Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 4.51 321.91 73.80 2584461 592544 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 4.60 566.40 0.00 2832 0 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 8.00 1860.80 0.00 9304 0 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 16.20 3696.00 9.60 18480 48
sar
with the ‘-b’ flag
$ sar -b 5 Linux 2.6.32-31-generic (ubuntu) 07/07/2011 _i686_ (1 CPU) 12:41:20 PM tps rtps wtps bread/s bwrtn/s 12:41:25 PM 15.03 14.83 0.20 3459.72 1.60 12:41:30 PM 16.17 15.57 0.60 3877.05 9.58 12:41:35 PM 9.62 9.22 0.40 2355.11 3.21 12:41:40 PM 16.97 16.37 0.60 3945.71 9.58
If you are working with very large datasets and your software feels bogged down you should have a look at the disk I/O statistics and compare them to when your system is idle. Chances are your are experiencing a disk I/O bottleneck.
RAM
If every gigabyte of RAM was the size of a coffee table people would learn to use RAM more efficiently. As it is, many people don’t even know how much RAM they have available. Computer manufacturers have hidden this from us by providing virtual memory that allows our software to think it has access to more memory than physically exists in RAM. The only problem is that virtual memory works through a paging system which reads and writes ‘memory’ to disk whenever RAM gets full. This takes us right back to the disk I/O bottleneck above.
To minimize paging, there is only one piece of advice to offer about RAM: Always install as much physical RAM as your system can hold.
For real time analysis of virtual memory use vmstat
.
$ vmstat 5 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 1448 18384 26016 468020 0 0 132 23 82 145 6 1 93 0 1 0 1448 28532 26024 457220 0 0 0 3 289 129 87 13 0 0 1 0 1448 28532 26024 457220 0 0 0 0 298 140 88 12 0 0 1 0 1448 28532 26032 457220 0 0 0 3 288 124 87 13 0 0
Big Files vs. Little Files
From the data management perspective our goal is to make access to data as quick and painless as possible. When data consumers (often ourselves) wish to work with very large datasets (>100 Gb) they should be able to do so without taxing either their computer’s resources or their own patience. In most cases, users of large datasets only want to work with a tiny fraction of the full dataset at a time — one data point out of every hundred or hundred thousand.
If a data graphic will be the final product of an analysis it is worth calculating the maximum amount of data that can possibly be conveyed in that graphic. Once again this is determined by a physical limitation: the number of pixels. A 500 x 500 data graphic has 250,000 pixels, each of which has 4 channels: red, green, blue and alpha transparency. Each channel is specified by two hex values (00-FF) equivalent to 1 byte. So a 500 x 500 data graphic can display at most 1 Megabyte’s worth of data. Any additional data will result in overplotting.
Of course many analyses process more data than is ultimately plotted in the final data visualization. On the other hand, most visualizations don’t specify the color and transparency of every single XY location. Even scaling our 1 Megabyte data maximum up by a factor of 10 to account for analysis and normal overplotting gives us an amount of data that can be read quickly from disk and easily accommodated in memory without paging. Thus, if we wish to minimize disk I/O, our optimal file size should be <10 Mb regardless of the size of the full dataset. Ideally, data access would look like this:
- determine which file to open
- open data file
- read entire file in a single pass (no disk seeks)
- close file
- work with data in memory without paging
The trick is to organize the data in such a way that a data consumer’s query can be answered opening as few files as possible.
Three Million 200 Kb Files
We were recently tasked with providing interactive access to 30 years worth of atmospheric science model trajectories. This 6.3 Terabyte dataset had the following structure: 2000 locations × 30 years × 365 days × 4 times daily × 7 heights × 240 hourly timesteps per trajectory × 4 variables at each timestep (lat, lon, height, pressure) × 4 bytes per floating point value. The data were organized into 30 ~ 200 Gb files in a custom format, each containing all the data for a single year. The goal was to provide interactive access to trajectories associated with a particular location for a particular climatological date, e.g. June 15 from all years (or some subset of years). Unfortunately, reading in data for a single site from all ten files took over 6 hours on a newer Linux box that had no other major processes running. (That disk drive was screaming even though it couldn’t be heard.) Needless to say, it’s hard to build an interactive data visualization system when data access alone takes 6 hours.
The solution to the problem lay in one of the core principles of data management: Organize the data in the manner in which it will be queried.
In this case, that meant extracting single site data from the multi-site annual files and then reformatting the data in the following manner:
- data from each site are stored in a directory named by site, e.g. ‘Site_0134’
- data are organized into netcdf files named by calendar day (+hour), e.g.‘JDay_112_06.nc’
- each netcdf file contains three gridded axes: timestep × year × height
- only timesteps 1-72 are saved as per client specification, further reducing size
Each file is only 200 Kb in size but there are now almost 3 million of them. But who cares! Every computer has an operating system who’s job is the management of millions of files. No extra overhead is needed. Data ingest now involves reading in the entire contents of a single file (no disk seeks) and takes well under a second — over 10,000 times faster than before. There is no concern about paging as 200 Kb easily fits into RAM. As the client got used to this improved speed they of course wanted to see multiple days worth of data and even multiple sites at a time. But even opening a hundred files data access times have remained below 2 seconds.
Take Home Message
- disk I/O is the biggest performance killer
- data can be reformatted to match the structure of the questions being asked
- keeping files sizes to < 10 Mb minimizes disk seeks
- keeping active data volumes to < 10 Mb minimizes RAM paging
- data access times can and should be in the 1-3 second range
Best of luck restructuring your data for optimum data access.
A previous version of this article originally appeared in 2010 at WorkingwithData.