In the marketplace, the needs of producers and consumers are often at odds: producers want higher prices, consumers lower ones; producers want easy assembly, consumers easy dis-assembly; producers want flexibility and rapid prototyping, consumers reliability and long-term support.
The same competing needs exist in the world of scientific data management where producers of data and consumers of data often operate in very different worlds with very different sets of tools.
Although examples could be drawn from any field of science, climate and environmental data provide some superb examples that highlight the different world views of data producers and data consumers.
Synoptic vs. Time-series
Weather data is collected every hour and is fed into data ingest models that create as output synoptic fields — descriptions of the state of the atmosphere at a specific time but on a broad spatial scale. These fields are used as input to forecasting models that compare the most recent field with earlier fields and calculate a forecast of the state of the atmosphere at specific times in the future. The organizing principle for data input and output files is xy-region by time point.
Climate models work the same way, calculating the state of the global climate one time point at a time. The output of these models, even when stored as multi-dimensional NetCDF files, is typically organized as a series of snapshots at specific times. This series of snapshots is the world view of these data producers.
Data consumers come in many varieties of course. There will be some who are interested in generating maps of weather or climate at some future time or date: the “map users”. For these users, the snapshot world view will be an excellent fit.
But what about someone who is interested in looking at a time series representation of the daily weather or monthly climate at a particular location? To assemble the data for this representation, our “time-series user” must open and read each (potentially multi-Gigabyte) snapshot in order to extract a single value at their location of interest. A time-series of 1,000 points may require processing of a Terabyte of data. Clearly the “time-series user” would prefer the data be organized as “time-series by location”. After opening the file for their location of interest they would simply read all of the data.
Time-series vs. Synoptic
In the world of environmental science, the reverse scenario is often true. Data are typically collected and organized by location through the use of a unique “Station ID”. Samples taken at one location in different years have different timestamps but the same “Station ID”. Any data consumer wishing to generate a synoptic view of the data — all stations for a particular year — must reorganize the data in order to generate the maps or other broad scale representations they desire
Serving Data Consumers
To our way of thinking, scientific data management should be about meeting the needs of the data consumers — scientists, policy makers and engaged members of the public. In order for science to inform public policy, the process of working with scientific data must be made easier. A tremendous amount of time and effort is spent reformatting data for use with specific analysis tools and an equally tremendous amount of subtlety and detail is lost with each reformatting. It is up to the data managers to make sure that data are made available in structures and formats that help the ultimate users. Sometimes this means doing things in a less than cutting edge manner.
Making data available via the latest XML-WSDL-web-service frameworks may fit with software engineering best practices but these data are unlikely to be useful to biologists, environmental consultants, geologists, hospital administrators, petroleum engineers, physicists or anyone else without a computer science degree.
Expecting these people to have access to computer staff who can help them is often a very poor assumption. The chances that they will write Java, C or Python code to work with the data are slim. The chances that they will reach for their favorite trusted analysis and visualization package — sadly, sometimes only Microsoft Excel — is high. If their favorite package does not support a particular data format, those data are essentially unavailable to them. (We are not recommending abandoning modern, information-age approaches to data delivery — only supplementing them with formats that are accessible to the huge number of intelligent individuals still working with bronze-age tools.)
Data Consumer Checklist
We provide the following checklist with the hopes that it will inspire data managers to step out of their data producer and software engineering world views and think about what would be most useful to those at the other end of the data pipeline.
- Identify one or more groups of data consumers — people who want to do analysis and visualization with the data.
- Identify which software tools they use — statistical pacakges like R, S+, Statistica, etc.; multi-dimensional engines like Matlab, IDL, Octave, etc.; spreadsheets like MicroSoft Excel or OpenOffice; specialized software for a particular community of practice.
- Identify any standards (formats, metadata conventions, variable names) that exist within a particular community of practice.
- Determine how they want to work with the data — eg. synoptic vs. time-series.
- Seek out a representative from the identified users who will work as a guinea pig to test the data formats you create.
- Be prepared to offer the same data in multiple formats to satisfy the needs of different groups of consumers.
In the end, good scientific data management is about increasing the efficiency of the data consumers by anticipating and then meeting their needs. If all goes well, our efforts at data management will scale terrifically as every hour we spend making data more useful will be multiplied by the number of data consumers who no longer have to do this work.
It is our hope that federal and state science agencies will purse careful data management with the same energy which they have devoted to data access. Finding data at aggregation sites like data.gov is a wonderful thing. But being able to actually use data is equally important.
A previous version of this article originally appeared at WorkingwithData.