Ten UNIX commands every data manager should know

Working with data from varied sources can be frustrating — some data will be in CSV format; some in XML; some available as HTML pages; other data as relational databases or MS Excel spreadsheets.

This post will cover the UNIX tools that every data manager needs to be familiar with in order to work with varied data sources.

When data formats are inconsistent, the best thing to do is to find some lowest common denominator that is shared by all.  Ultimately, that common denominator is ASCII text and good data managers need to have a suite of tools to help them convert data files into ASCII and then to work with that ASCII text.

Unix shell basics

Readers should not be concerned if they are not Unix gurus.  We assume only a passing familiarity with Unix commands and access to a machine with a Unix shell terminal (e.g. Mac OS X).  However, you should already be familiar with the following basic commands:

  • pwd — present working directory
  • mkdir — make a new directory
  • cd — change directory
  • ls — list contents of a directory
  • rm — remove a file
  • man — display a manual page for a command
  • echo — print something to the stdout
  • ; — separator between Unix commands (runs multiple commands without returning the shell prompt)
  • \ — line continuation (spread a long command over multiple input lines)
  • | — pipe the output of one command as the input of another command
  • > — send the output of a command to a file
  • >> — append the output of a command to an existing file

Find out more about any Unix command by using the man command (e.g. man rm).  When a man page is being displayed you can use CTRL-f and CTRL-b to page forward and back and q to quit and return to the shell prompt.

Here are the 10 commands you should try to master:

1) cat

The cat command concatenates one or more files and sends the result to standard output.  You can use cat at the beginning of a Unix pipeline.

2) head & tail

The head and tail commands allow you to trim a file to include only the first (or last) N lines.

3) wc

Use wc to count the number of lines, words or bytes in a file.

4) cut

The cut command cuts a column out of an ascii file based on character position or some delimiter.

5) paste

The paste command can take two files with the same number of lines and join them together into a single file.

Interlude

To exercise these first few commands we will need two sample files, one with rownames and another with rowdata.  For mysterious reasons, these files begin life separated.  (Detailed explanations of the use of each command are not given.  Read the man pages to learn more about how command line flags change the default behavior of each command.)

rownames.txt

"name"
a
b
c
d
e
f
g
h
i
j

rowdata.txt

"integer value","float value","2-digit ID"
1,1.0,"01"
2,2.0,"02"
3,3.0,"03"
4,4.0,"04"
5,5.0,"05"
6,6.0,"06"
7,7.0,"07"
8,8.0,"08"
9,9.0,"09"
10,10.0,"10"

Now we can use our Unix commands to do some useful work.

Assuming we had never seen these files before, we could take a quick peek at the beginning of the rowdata.txt file to see what it looks like:

head -3 rowdata.txt
"integer value","float value","2-digit ID"
1,1.0,"01"
2,2.0,"02"

After similarly checking on rownames.txt we might inquire whether the files have the same number of lines:

wc -l row*.txt
 11 rowdata.txt
 11 rownames.txt
 22 total

They each have eleven lines so we could use paste to join them into a single file.  But perhaps we only want the “2-digit ID” column from rowdata.txt.  Let’s first practice cutting out that column:

cut -d',' -f3 rowdata.txt
"2-digit ID"
"01"
"02"
"03"
"04"
"05"
"06"
"07"
"08"
"09"
"10"

Excellent.  Now let’s create a new file by cutting out the third column of rowdata.txt and piping it as input to the paste command which we will tell to use the TAB delimiter:

cut -d',' -f3 rowdata.txt | paste -d'\t' rownames.txt - 
"name" "2-digit ID"
a    "01"
b    "02"
c    "03"
d    "04"
e    "05"
f    "06"
g    "07"
h    "08"
i    "09"
j    "10"

As we see, the real power of the Unix shell environment is the ability to string together commands that have tightly focused functionality.

Now on to the remaining, more complex commands.

6) grep

The grep command is a very powerful pattern matching command that uses regular expressions to match individual lines in a file..  Learning how to use regular expressions will take a while but they are extremely powerful once you do.

7) sort

The sort command takes a bunch of lines of text as input, sorts them and writes them to stdout.

8) sed

The Unix stream editor processes input one line at a time and can do all kinds of manipulations on the text based on a special set of commands passed to sed.  The sedcommand uses basic regular expressions to find text to manipulate.  These are slightly different from the extended regular expressions used by grep.  One common source of confusion: Do not use the newline character, ‘\n’ in sed regular expressions. Thankfully, there are many examples to learn from.  You should spend some time looking at the sed oneliners page to see what is possible.

9) curl

Use the curl command to get the contents at a particular URL.  This command has a scary number of arguments but is super helpful for grabbing data from a web site.

10) xmllint

The xmllint command is very useful for ‘pretty printing’ data in XML format so that tools like grep and sed can work on the output one line at a time.

Putting it all Together

In the following three examples we will use our set of tools to perform the following tasks:

  1. process the latest data from the CDC to review the progress of the H1N1 flu virus
  2. convert a human-friendly but machine-unreadable set of unemployment data into a simple CSV file
  3. examine the consistency of data formats used within a Federal XML dataset

Getting regular updates from the CDC

Our first example will involve the weekly flu report from the CDC.  The data are available here.

This is an XML dataset that can best be processed with XML tools.  But for those who don’t live and breathe XML this can be an onerous task.  Instead we will gather some information of interest using our Unix tools  Because this file is updated weekly we will want to create a script that works on the remote data.  Looking at the data we determine that the last 273 lines correspond to the most recent week:

curl http://www.cdc.gov/flu/weekly/flureport.xml | tail -273
...

To figure out what reporting levels are possible we examine the contents of the <label> element for the entire XML document.  We use curl in ‘silent mode’, grep for only those lines containing ‘label’ and then use cut to snip out that portion of each line between ‘>’ and ‘<‘.  Finally, we sort with the ‘-u’  flag to remove duplicate results:

curl -s http://www.cdc.gov/flu/weekly/flureport.xml | grep label | \
cut -d'>' -f2 | cut -d'<' -f1 | sort -u
Local Activity
No Activity
No Report
Regional
Sporadic
Widespread

For a quick update on how many states and territories are reporting ‘Widespread’ flu we can use grep and wc:

curl -s http://www.cdc.gov/flu/weekly/flureport.xml | tail -273 | \
grep 'Widespread' | wc -l
24

This presentation of information isn’t as nice as the CDC’s page but serves to demonstrate how one can go about harvesting XML data from the web without writing any code.


2016-09-09 NOTE:  Examples below here reference sources of data that no longer exist. They are left in for the command line examples they provide.


Converting human-friendly data into machine readable data

The dataset we will be working with this time is titled “Employment status of the civilian non-institutional population 16 years and over by sex, 1973 to date” from the Bureau of Labor Statistics.  You may wish to look at the file in a new window.

First we download the data to a local file and look at the file to figure out which lines we want:

curl ftp://ftp.bls.gov/pub/special.requests/lf/aat2.txt -o Employment_Stats.txt \
cat -n Employment_Stats.txt
...

To extract the data just for the women we will clip out the relevant section and grep for those lines with ‘…’:

head -103 Employment_Stats.txt | tail -39 | grep '...'
...

Because this file has constant column widths we can cut out the columns for “Year”, “Percent of Total Population” and “Percent of Employed”, making sure to include one space character before the latter data columns:

head -103 Employment_Stats.txt | tail -39 | grep '...' | cut -c 6-9,61-65,83-87
...

Now just convert the spaces to commas and include a header line:

echo '"year","percent of population","percent of employed"'; head -103 Employment_Stats.txt | \
tail -39 | grep '...' | cut -c 6-9,61-65,83-87 | \
sed 's/ /,/g' "year","percent of population","percent of employed"
1973,44.7,42.0
1974,45.7,42.6
1975,46.3,42.0
1976,47.3,43.2
...

Now we have a file that can be understood by modern software tools.  (Wasn’t that was easier than writing a dedicated program to process this data?)

Investigating data consistency

For our last example we will work with the XML data contained in the Recreation Information Database (RIDB).  The RIDB server seems a little overtaxed so we download the file for further processing:

curl http://www.recdata.gov/RIDBWebService/RIDBService.jws/getAllRecElementsForOrgID?anOrgID=-1 -o RIDB.xml

This database contains information compiled from various agencies and is riddled with inconsistencies.  Let’s use our tools to examine just one.  As grep and sed work on a single line at a time we will need to ‘pretty print’ our RIDB.xml file to put each XML element on its own line.  This is where xmllint comes in:

xmllint --format RIDB.xml | head -20
<!--?xml version="1.0" encoding="utf-8"?-->

...

Let’s take a few seconds to process this 25 MByte file to see which organizations have contributed data:

xmllint --format RIDB.xml | grep 'OrgName' | cut -d'>' -f2 | cut -d'<' -f1 | sort -u
American Battle Monuments Commission
Bureau of Engraving and Printing
Bureau of Land Management
Bureau of Reclamation
Department of Agriculture
Department of Commerce
Department of Defense
Department of the Interior
Department of the Treasury
Department of Transportation
Fairfax County
Fairfax County Park Authority
FEDERAL
Fish and Wildlife Service
National Archives and Records Administration
National Ocean Service
National Park Service
National Register of Historic Places
PRIVATE
ReserveAmerica
Smithsonian Institution
Smithsonian Institution Affiliations Program
STATE
Tennessee Valley Authority
US Air Force
US Army Corps of Engineers
USDA Forest Service
Wilderness Institute
Wilderness.net

Now let’s take a closer look at the phone numbers to see if they are consistently formatted.  We’ll start by exporting them to their own file:

xmllint --format RIDB.xml | grep 'RecAreaPhone' | cut -d'>' -f2 | \
cut -d'<' -f1 | sort -u -n > sorted_phone.txt

Reviewing the sorted_phone.txt file we see the various formats we might expect:

  • (###) ###-####
  • (###) ### ####
  • ###-###-####
  • ###.###.####

as well as some serious outliers.  Most of the offenders can be found by checking for letters:

grep [a-zA-Z] sorted_phone.txt
207-827-6138, ext. 17
304-636-1800 x280
410-228-2692 x118
413-548-8002 ext. 111
505-761-8794 ext. 29
509-575-5848 X202
510-562-PARK
(706) 856-0300 or 888-893-0678
800-939-PARK
828-271-4779 ext 210
870-473-2869 Or 870-364-3167
907-246-4250 or&nbsp; 907-246-3339
907-456-0250 and 800-362-4546
970-882-2213 or 970-946-0461
Headquarters 202.485.9880 <br> Visitor Informaiton 202.426.6841 <br>
Headquarters 307-739-3300 <br>
Headquarters 503-861-2471 <br> Visitor Information 503-861-2471ext.214 <br> Visitor Information (TDD) 503-861-4408 <br> Education Program Information 503-861-4422 <br>
Headquarters 815/588-6040 <br>
Headquarters 907-442-8300 <br> Summer Visitor Information 907-442-3760 <br>
Headquarters (TDD) (615) 893-9501 <br> Visitor Information (TDD) (615) 893-9501 <br>
Visitor Information (301) 432-5124 <br> Headquarters (301) 432-7672 <br>
Visitor Information (308) 586-2581 <br>
Visitor Information 757-898-2410 <br>
Visitor Information (804) 732-3531 <br>

The last few entries are problematic.  Storing text descriptions and escaped HTML markup in a field reserved for telephone numbers is not really considered best practice.  The folks in charge of RIDB should do a little more data validation before merging the various sources that contribute to their database.

So there you have it.  You can do data reformatting, remote data access and XML investigation with only a handful of Unix tools and essentially no programming.  Hone these skills and you will become wonderfully efficient at working with the datasets that exist out there ‘in the wild’.

Happy Exploring!


A previous version of this article originally appeared in 2009 at WorkingwithData.

Leave a Reply