Standard Country Names

What’s in a name? That which we call a rose
By any other name would smell as sweet.

Ahhh love. Juliet speaks lovely poetry but we learn, as the story unfolds, that names and the identification they impart are in fact extremely important. This is no less true in data management where country names are anything but standardized.

Anyone care to explain the difference between “England”, “Great Britain” and the “United Kingdom”? Sadly, these politically distinct entities are often confounded. If we start with England and then add Scotland and Wales we get “Great Britain”. If we then add Northern Ireland we end up with the “United Kingdom” or UK. (But remember, even the UK does not officially include the Isle of Mann or the Channel Islands which are dependencies of the crown.) Country names are inextricably bound to the vagaries of history and politics.

Our first example only looked at the name of one country in the language of that country. Other countries of course have their own word for the UK: Royaume-Uni (French), Reino Unido (Spanish), Vereinigtes Königreich (German), Verenigd Koninkrijk (Dutch). Linguists will recognize that these are all local versions of the English phrase “United Kingdom”.

Translating country names is not always this simple. Each of Germany’s neighbors has a name for Germany that stems from the Germanic tribe that they had to deal with back in the Roman era: Allemagne (French), Germania (Italy) (but tedesco for a German person), Duitsland (Dutch), Niemcy (Polish). Whatever agreement these nations might reach in their assessment of Germany’s past, there is no agreement on what to call the Bundesrepublik Deutschland.

Country name inconsistencies

As this is an English language site, we will investigate country name inconsistencies a little further by looking at some problematic names found in a few English language datasets. (Names are from 2010 versions.) These datasets are all high value, well curated datasets that make every attempt to be standardized:

CIA World Fact Book	Ireland	Russia	Libya	Iran	Burma	Hong Kong	United States
US Census Bureau IDB	Ireland	Russia	Libya	Iran	Burma	Hong Kong	United States
BP Statistical Review	Republic of Ireland	Russian Federation	Libya	Iran	Myanmar	China Hong Kong SAR	US
UNICEF	Ireland	Russian Federation	Libyan Arab Jamahiriya	Iran, Islamic Republic of	Myanmar	Hong Kong, China	United States of America
Joint Oil Data Initiative	Ireland	Russian Federation	Libyan Arab Jamahiriya	Iran, Islamic Republic of	Myanmar	Hong Kong China	United States of America

It’s quite an interesting mix. The only “British” dataset is careful to use the full “Republic of Ireland” and apparently wants to appease China with its use of “China Hong Kong SAR”. Meanwhile, the Riyadh based Joint Oil Data Initiative is careful about “Libyan Arab Jamahiriya” and the “Islamic Republic of” Iran. By contrast, the American datasets seem almost colloquial, leaving off all the formal titles and even using “Burma” instead of “Myanmar”. These names may be in common use in the US but they don’t always match other groups’ standards.

As data managers we always look for standard identifiers to represent well defined concepts. Political units known as nations are an excellent example of such a well defined concept. No one disagrees about what is meant whether one types “US”, “USA”, United States” or “United States of America”. But for anyone working with data, especially those attempting to merge data from different sources it makes a huge difference whether the national identifier is standardized or not. Writing code to catch all possible variations of English language country names is just the sort of tedious exercise that can be avoided by the proper use of standard identifiers.

ISO 3166-1 alpha-2

Luckily, ISO-3166-1 alpha-2, is a widely adopted standard that will meet our needs. This standard specifies a unique, 2-character code for each nation. Once we have the codes, we can create tables of country names in different languages. Here is another table with the ISO-3166-1 alpha-2 codes and the recommend English and French names available from the ISO site.

code element	IE	RU	LY	IR	MM	HK	US
English names	IRELAND	RUSSIAN FEDERATION	LIBYAN ARAB JAMAHIRIYA	IRAN, ISLAMIC REPUBLIC OF	MYANMAR	HONG KONG	UNITED STATES
French names	IRLANDE	RUSSIE, FÉDÉRATION DE	LIBYENNE, JAMAHIRIYA ARABE	IRAN, RÉPUBLIQUE ISLAMIQUE D’	MYANMAR	HONG-KONG	ÉTATS-UNIS

One of the truisms of scientific data management is that it is always more robust in the long run to store identifiers in a standardized, computer readable fashion. When human readable output is needed, code can be written to generate whatever local version of the standard identifier is needed. This is much preferable to storing human friendly identifiers that prevent data from different sources from being combined. Or worse, that work in some cases but not others.

It is quite possible to have your cake and eat it, too. We are not recommending that you create Excel spreadsheets that are devoid of human readable names, requiring your users to know that DZ stands for Algeria. Instead, just add another column labeled country_code or ISO_3166_1_alpha_2 or similar that contains the appropriate code. That way you and the users of your data will be able to write software to automatically process the data without worrying about whether a cell has “USA” or “United States”.

If, on the other hand, your identifiers live in a database and will be processed before a human ever sees them then you should only store the identifiers and have the presentation layer of software convert the coded identifiers into human readable strings, preferably in the readers language of choice.

FIPS 10-4

Over the years, of course, other standards have come and gone. Perhaps the most important competing 2-character code was the FIPS 10-4 standard developed by the US National Institute of Standards and Technology. This standard was withdrawn by NIST in 2008 in favor of the ISO 3166 standard.

Unfortunately, (as of 2010) not every US agency has abandoned their old fashioned ways. Databases for the Energy Information Administration Country Energy Profiles and the CIA World Factbook still use the old FIPS 10-4 codes so you need to be careful if you are accessing those databases programatically. (If you aren’t you may be surprised at Switzerland’s huge population and coal production — in ISO 3166 ‘CH’ means Switzerland but in FIPS 10-4 ‘CH’ means China.)

python Babel module

As an added incentive to encourage the use of these country codes we would like to introduce you to the python Babel module. The Babel module helps with internationalization (aka i18n) and we will use it to create a table of country names in different languages from our standardized codes. Do not be put off by the geeky nature of the Babel home page — you don’t need to learn much to make use of this excellent module.

Installation is a snap on Mac OS X or Ubuntu Linux:

sudo easy_install babel

Translations for our country codes are found in Locale objects in the module, one for each language. As you might expect, the supported languages are themselves encoded — this time according to ISO 639-2 language codes. We’ll reserve detailed description of the Babel module for another post. For now, lets cut to the chase and show the code that will generate HTML rows containing the list of country names in the languages of those countries:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys
from babel import Locale

# NOTE:  'my' dropped as the Burmese characters won't display on my Mac
for language in ('ga','ru','ar','fa','zh','en'):
    locale = Locale(language)
    print('<tr>')
    print('<td>' + language + '</td>')

    for country in ('IE','RU','LY','IR','MM','HK','US'):
        print('<td>' + locale.territories[country] + '</td>')

    print('</tr>')

Eleven actual lines of code and here is the result:

ga	Éire	Cónaidhm na Rúise	An Libia	An Iaráin	Maenmar	R.R.S. na Síne Hong Cong	Stáit Aontaithe Mheiriceá
ru	Ирландия	Россия	Ливия	Иран	Мьянма	Гонконг, Особый Административный Район Китая	США
ar	أيرلاندا	روسيا	ليبيا	ايران	ميانمار	هونج كونج الصينية	الولايات المتحدة الأمريكية
fa	ایرلند	روسیه	لیبی	ایران	میانمار	هنگ‌کنگ، ناحیهٔ ویژهٔ حکومتی چین	ایالات متحدهٔ امریکا
zh	爱尔兰	俄罗斯	利比亚	伊朗	缅甸	中国香港特别行政区	美国
en	Ireland	Russia	Libya	Iran	Myanmar	Hong Kong SAR China	United States

Pretty amazing, eh?

So now you know why we always use standardized codes for country names and languages. And you should too!

A previous version of this article originally appeared in 2010 at WorkingwithData.

Country name inconsistencies

ISO 3166-1 alpha-2

FIPS 10-4

python Babel module

Share this:

Related

Leave a Reply Cancel reply