What’s in a name? That which we call a rose
By any other name would smell as sweet.
Ahhh love. Juliet speaks lovely poetry but we learn, as the story unfolds, that names and the identification they impart are in fact extremely important. This is no less true in data management where country names are anything but standardized.
Anyone care to explain the difference between “England”, “Great Britain” and the “United Kingdom”? Sadly, these politically distinct entities are often confounded. If we start with England and then add Scotland and Wales we get “Great Britain”. If we then add Northern Ireland we end up with the “United Kingdom” or UK. (But remember, even the UK does not officially include the Isle of Mann or the Channel Islands which are dependencies of the crown.) Country names are inextricably bound to the vagaries of history and politics.
Our first example only looked at the name of one country in the language of that country. Other countries of course have their own word for the UK: Royaume-Uni (French), Reino Unido (Spanish), Vereinigtes Königreich (German), Verenigd Koninkrijk (Dutch). Linguists will recognize that these are all local versions of the English phrase “United Kingdom”.
Translating country names is not always this simple. Each of Germany’s neighbors has a name for Germany that stems from the Germanic tribe that they had to deal with back in the Roman era: Allemagne (French), Germania (Italy) (but tedesco for a German person), Duitsland (Dutch), Niemcy (Polish). Whatever agreement these nations might reach in their assessment of Germany’s past, there is no agreement on what to call the Bundesrepublik Deutschland.
Country name inconsistencies
As this is an English language site, we will investigate country name inconsistencies a little further by looking at some problematic names found in a few English language datasets. (Names are from 2010 versions.) These datasets are all high value, well curated datasets that make every attempt to be standardized:
|CIA World Fact Book||Ireland||Russia||Libya||Iran||Burma||Hong Kong||United States|
|US Census Bureau IDB||Ireland||Russia||Libya||Iran||Burma||Hong Kong||United States|
|BP Statistical Review||Republic of Ireland||Russian Federation||Libya||Iran||Myanmar||China Hong Kong SAR||US|
|UNICEF||Ireland||Russian Federation||Libyan Arab Jamahiriya||Iran, Islamic Republic of||Myanmar||Hong Kong, China||United States of America|
|Joint Oil Data Initiative||Ireland||Russian Federation||Libyan Arab Jamahiriya||Iran, Islamic Republic of||Myanmar||Hong Kong China||United States of America|
It’s quite an interesting mix. The only “British” dataset is careful to use the full “Republic of Ireland” and apparently wants to appease China with its use of “China Hong Kong SAR”. Meanwhile, the Riyadh based Joint Oil Data Initiative is careful about “Libyan Arab Jamahiriya” and the “Islamic Republic of” Iran. By contrast, the American datasets seem almost colloquial, leaving off all the formal titles and even using “Burma” instead of “Myanmar”. These names may be in common use in the US but they don’t always match other groups’ standards.
As data managers we always look for standard identifiers to represent well defined concepts. Political units known as nations are an excellent example of such a well defined concept. No one disagrees about what is meant whether one types “US”, “USA”, United States” or “United States of America”. But for anyone working with data, especially those attempting to merge data from different sources it makes a huge difference whether the national identifier is standardized or not. Writing code to catch all possible variations of English language country names is just the sort of tedious exercise that can be avoided by the proper use of standard identifiers.
ISO 3166-1 alpha-2
Luckily, ISO-3166-1 alpha-2, is a widely adopted standard that will meet our needs. This standard specifies a unique, 2-character code for each nation. Once we have the codes, we can create tables of country names in different languages. Here is another table with the ISO-3166-1 alpha-2 codes and the recommend English and French names available from the ISO site.
|English names||IRELAND||RUSSIAN FEDERATION||LIBYAN ARAB JAMAHIRIYA||IRAN, ISLAMIC REPUBLIC OF||MYANMAR||HONG KONG||UNITED STATES|
|French names||IRLANDE||RUSSIE, FÉDÉRATION DE||LIBYENNE, JAMAHIRIYA ARABE||IRAN, RÉPUBLIQUE ISLAMIQUE D’||MYANMAR||HONG-KONG||ÉTATS-UNIS|
One of the truisms of scientific data management is that it is always more robust in the long run to store identifiers in a standardized, computer readable fashion. When human readable output is needed, code can be written to generate whatever local version of the standard identifier is needed. This is much preferable to storing human friendly identifiers that prevent data from different sources from being combined. Or worse, that work in some cases but not others.
It is quite possible to have your cake and eat it, too. We are not recommending that you create Excel spreadsheets that are devoid of human readable names, requiring your users to know that DZ stands for Algeria. Instead, just add another column labeled country_code or ISO_3166_1_alpha_2 or similar that contains the appropriate code. That way you and the users of your data will be able to write software to automatically process the data without worrying about whether a cell has “USA” or “United States”.
If, on the other hand, your identifiers live in a database and will be processed before a human ever sees them then you should only store the identifiers and have the presentation layer of software convert the coded identifiers into human readable strings, preferably in the readers language of choice.
Over the years, of course, other standards have come and gone. Perhaps the most important competing 2-character code was the FIPS 10-4 standard developed by the US National Institute of Standards and Technology. This standard was withdrawn by NIST in 2008 in favor of the ISO 3166 standard.
Unfortunately, (as of 2010) not every US agency has abandoned their old fashioned ways. Databases for the Energy Information Administration Country Energy Profiles and the CIA World Factbook still use the old FIPS 10-4 codes so you need to be careful if you are accessing those databases programatically. (If you aren’t you may be surprised at Switzerland’s huge population and coal production — in ISO 3166 ‘CH’ means Switzerland but in FIPS 10-4 ‘CH’ means China.)
python Babel module
As an added incentive to encourage the use of these country codes we would like to introduce you to the python Babel module. The Babel module helps with internationalization (aka i18n) and we will use it to create a table of country names in different languages from our standardized codes. Do not be put off by the geeky nature of the Babel home page — you don’t need to learn much to make use of this excellent module.
Installation is a snap on Mac OS X or Ubuntu Linux:
sudo easy_install babel
Translations for our country codes are found in Locale objects in the module, one for each language. As you might expect, the supported languages are themselves encoded — this time according to ISO 639-2 language codes. We’ll reserve detailed description of the Babel module for another post. For now, lets cut to the chase and show the code that will generate HTML rows containing the list of country names in the languages of those countries:
#!/usr/bin/python # -*- coding: utf-8 -*- import sys from babel import Locale # NOTE: 'my' dropped as the Burmese characters won't display on my Mac for language in ('ga','ru','ar','fa','zh','en'): locale = Locale(language) print('<tr>') print('<td>' + language + '</td>') for country in ('IE','RU','LY','IR','MM','HK','US'): print('<td>' + locale.territories[country] + '</td>') print('</tr>')
Eleven actual lines of code and here is the result:
|ga||Éire||Cónaidhm na Rúise||An Libia||An Iaráin||Maenmar||R.R.S. na Síne Hong Cong||Stáit Aontaithe Mheiriceá|
|ru||Ирландия||Россия||Ливия||Иран||Мьянма||Гонконг, Особый Административный Район Китая||США|
|ar||أيرلاندا||روسيا||ليبيا||ايران||ميانمار||هونج كونج الصينية||الولايات المتحدة الأمريكية|
|fa||ایرلند||روسیه||لیبی||ایران||میانمار||هنگکنگ، ناحیهٔ ویژهٔ حکومتی چین||ایالات متحدهٔ امریکا|
|en||Ireland||Russia||Libya||Iran||Myanmar||Hong Kong SAR China||United States|
Pretty amazing, eh?
So now you know why we always use standardized codes for country names and languages. And you should too!
A previous version of this article originally appeared in 2010 at WorkingwithData.