One of the big jokes among people who manage scientific datasets goes like this:
The great thing about standards is … there are so many to choose from!
While this one liner may never make it to late-night TV, there is much truth to it. Many “standards” exist, and many more are invented each month to accommodate the special needs of new types of data or new software for processing data.
There is, however, one exception that proves the rule: ISO 8601– the international standard for representing dates and times.
What is so hard about understanding dates?
Every American child knows how to write a date like “Feb 26, 2020”. For the purposes of internationalization we must of course substitute the month name with a number leading to the familiar “2/26/20” older Americans use when writing a paper check. After the whole Y2K debacle a few years back we also learned that we really need to spell out the full year which leads us to “2/26/2020”. So what is wrong with this format?
Well, the French would write it as “26/2/2020” , the Germans as “26.2.2020”, and the Chinese as “2020-02-26”
It is clear that there is more we need to worry about if we wish to have machine readable, fully internationalized dates:
- Are they little-, big- or middle-endian (day-month-year, year-month-day or month-day-year)
- What separator is used?
- Are single digit months and days zero-padded?
- What is the relevant time zone? (If you didn’t think time zones were important, just remember that the attack on Pearl Harbor took place on December 8th according to the Japanese calendar, not FDR’s “December 7, 1941 — a date which will live in infamy …”)
In fact, humans have come up with a bewildering variety of ways to express calendar dates. Which is precisely what leads us to ISO 8601.
The International Organization for Standards Technical Corrigendum 8601-1:2019 describes an unambiguous standard for representing dates, times and time zone information. For only 158 Swiss Franks you can buy a copy for your library. (In the interest of brevity we will spare you the diatribe against charging for descriptions of international standards.)
As usual, the Wikipedia page has an excellent description of the indented of use of ISO 8601:
In general, ISO 8601 applies to representations and formats of dates in the Gregorian (and potentially proleptic Gregorian) calendar, of times based on the 24-hour timekeeping system (with optional UTC offset), of time intervals, and combinations thereof. The standard does not assign any specific meaning to elements of the date/time to be represented; the meaning will depend on the context of its use. In addition, dates and times to be represented cannot include words with no specified numerical meaning in the standard (e.g., names of years in the Chinese calendar) or that do not use characters (e.g., images, sounds).https://en.wikipedia.org/wiki/ISO_8601
Any date-time combination between the adoption of the Gregorian calender (1582-1929 on a per country basis) and “December 31, 9999” can be unambiguously represented with this format.
As careful data managers, our job often involves cleaning up datasets that were generated by humans and human-oriented tools and making them consistent and machine readable. We are often surprised by the lengths to which data providers will go to make this task difficult:
- spreadsheets with 2-digit years, month names and non-zero padded days
- databases with fields for years and Julian days (i.e. day number between 1 and 365)
- XML output with dates, locations and local times rather than UTC times
- ambiguous all-numeric, non-zero padded formats like “2122009”
Each of these choices may have made sense to those collecting the data and may in fact be required in order to work with the hardware and software needed for their particular project. But that doesn’t mean that these choices must be inflicted on the ultimate end users of the data. Luckily, the ISO 8601 standard is supported by every important computer language. So parsing non-standard date information to create ISO 8601 date representations is relatively straightforward.
Whatever we are using for data storage, when working with dates it is important to include a field, column, element, etc. that contains the ASCII ISO 8601 representation of the date.
What this means in practical terms is that for every dataset we work with we must:
- Understand how to use ISO 8601 in our chosen programming language or analysis software.
- Identify whether the source data has an ISO 8601 date field.
- If it does, validate each entry in this field.
- If it does not, process the available date information found in other fields and create the ISO 8601 date string which should be stored in an appropriately named field e.g. “ISO_datetime” (though we prefer just “datetime”).
We do not worry about the data duplication involved in having “day”, ”month”, ”year” and “datetime” fields. Data storage is cheap. Human misunderstanding is expensive. Once we have the “datetime” field in our dataset we will have greatly enhanced the utility of the dataset for all downstream users, be they human or machine.
Dealing with the inconsistencies of date and time formats has resulted in a larger waste of human effort than probably any other issue in data management. Lets put the issue entirely to rest by insisting that every dataset with dates include a “datetime” field.