Data Management Questionnaire

Sometimes merely filling out a questionnaire can cause you to think about problems in a new way.  When asked to answer a question that has never occurred to you before, you may find yourself reevaluating some of your core assumptions — assumptions you may not have known you had.  That is the power of asking questions. Our data management questionnaire poses questions in 12 categories that will help you figure out what you need, what you want, and perhaps give you a hint of how to get there.

For those who always watch the credits at the end of movies we have a little more to say along the lines of philosophical underpinnings after we present the questionnaire.  For the impatient, here you go:

Please answer each question to the best of your ability and be honest when the answer is “I don’t know.” or “I’m not sure.”

Questions about the community involved in creating, using and managing the data

1) Data Providers

  • Why were the data collected?
  • Who are the data providers?
  • How do you get feedback from them?
  • What questions do they have about the data?
  • What software tools do they currently use to work with data?
  • What data formats do they prefer?
  • What leverage exists to get their cooperation?
  • What resources do they have?
  • What internal procedures do they have that impact when and how data are delivered?
  • What political needs must be met?
  • What would make them happy customers?

2) Data Consumers

  • Who are the data consumers?
  • How do you get feedback from them?
  • What questions do they have about the data?
  • What software tools do they currently use to work with data?
  • What data formats do they prefer?
  • How do they want to interact with the data?
  • Are there requirements to interact with other software systems?  (e.g. “web services”)
  • What would make them happy customers?

3) Data Management

  • Is data management a dedicated activity?
  • Who is responsible for data management?
  • Do they have a background in the field of study?
  • Is this data management project driven by the needs of software engineering or of science?
  • What financial resources does the individual/team have?
  • What personnel resources do they have?
  • What hardware resources do they have?
  • How will they communicate with data providers and data consumers?
  • Who will provide long term support for data management?

Questions about the nature of the data.

4) Volume

  • How many actual numeric measurements (not including textual meta-data) are made in a year? — thousand/million/billion/trillion
  • How many are made in a day?

5) ‘Speed’

  • What temporal precision is needed for incoming data? — second/minute/hour/day/month/year
  • How up-to-date should the ‘released data’ be? — Everything up to the last minute/hour/day/month/year?

6) ‘Shape’

  • How are data currently stored? — text file/CSV/XML/GIS/spreadsheet/RDBMS/binary file/other
  • Can the data be expressed as a single row-by-column table?
  • Are any data geo-spatially located?
  • Do any data have regular spacing along an axis with physical units? — latitude/longitude/depth/height/time
  • What additional metadata must be stored?

Questions about functionality to be achieved.

7) Validation

  • How are data currently being validated?
  • How are successfully validated data points being identified?

8) Versioning

  • How is raw data being versioned?  (e.g. How are changes to the data store being tracked?)
  • Can earlier versions be retrieved?
  • How is released data being versioned?

9) Provenance

  • How is the history and origin of each data point being tracked as data goes from individual submissions to larger aggregations?

10) Authorization

  • Who is allowed to enter data?
  • Who is allowed to extract data?
  • What should be open to the general public?
  • What kind of secure technology is mandated/desired?

11) Analysis

  • Are any specific kinds of analysis associated with the data?
  • Is it desirable to build a system that helps users perform appropriate analysis?
  • What software requirements does this impose upon this data management project?

12) Interactive Access

  • What sort of interactive access should be provided to data consumers? — subsetting/querying/reformatting/analysis/visualization

We hope that these questions are both self-explanatory and thought provoking.  By asking these questions we do not intend to overwhelm small projects with dreams that are too big for their budgets.  Rather we hope to save both providers and consumers of data time and money by getting them to think ahead a little to how their data will be used productively rather than simply ignored in our current data deluge.

Our philosophy of the practice of data management is that it should always be in the service of both the data providers and data consumers (1 & 2).  One must have a thorough understanding of the original questions that motivated the data collection effort and the additional questions that the data can be used to answer before attempting to come up with data management solutions.  The tools and formats that are currently in use as well as the political landscape in the provider and consumer communities set the fundamental frame for organizing data.

The questions about data management resources and responsibilities (3) are absolutely key.  We believe that data management should always be a dedicated component of any data gathering or data analysis activity.  Data management occupies a fundamental position between the worldview of the data providers and that of the data consumers.  Any ideas that these two groups are one-and-the-same must be left behind.  In this modern age of open data dissemination, the expectation is that any data that are collected will be combined with other data and used by a wide variety of individuals, each with a different set of needs.  It is important, therefore, for data managers to identify target audiences and provide added benefit to these ‘customers’.  The simplest way to gauge the success of a data management project is to ask:  “Did we make the job of person X easier?”  We also feel very strongly that those involved in any data management project should have some familiarity with the discipline involved.  At a minimum they should be in regular contact with the providers and consumers of data.

Questions 4 – 6 are designed to get people to really think about what kind of structure might be best for their data.  We have plenty of war stories about SQL databases being set up with dozens of interrelated tables and complex schemas for data sets that were no longer being updated.  As it turned out, the data could be written out as a single CSV file of a thousand rows by a hundred columns and this simpler format was actually more useful to the downstream consumers of data.  Others build overly complex systems based on outdated ideas of how much data can be stored on disk or in memory.  The ‘speed’ issue (5) has to do with transactional vs. archival databases and may be the topic of a future post but boils down to how up-to-date data access needs to be.  Are you designing an airline booking system that needs to be aware of transactions completed a few seconds ago or are your data only updated and released once a month?

The last section on functionality has several leading questions to get people thinking about what a data management system could have.  We believe that data should be validated, updated, trackable, etc. and these questions need to be answered before you design your data handling system.  We also believe that it is in the interest of everyone to make vetted data, analyses and visualizations available to as broad an audience as possible.  Publicly available data and analysis allow us to harness the interest and skills of people we may not even know about.  It is important to plan on internet access even if the budget only allows for a web page pointing to publicly available CSV files.  Even that would be a far sight better than no data access at all or keeping the data essentially locked up in a complicated software system.

Best of luck answering your own data management questions!

A previous version of this article originally appeared in 2010 at WorkingwithData.

Leave a Reply