Strata Gems: The timeless utility of sed and awk

A little command line knowledge goes a long way

We’re publishing a new Strata Gem each day all the way through to December 24. Yesterday’s Gem: Where to find data.

Strata 2011Edison famously said that genius is 1% inspiration and 99% perspiration. Much the same can be said for data analysis. The business of obtaining, cleaning and loading the data often takes the lion’s share of the effort.

Now over 30 years old, the UNIX command line utilities sed and awk are useful tools for cleaning up and manipulating data. In their Taxonomy of Data Science, Hilary Mason and Chris Wiggins note that when cleaning data, “Sed, awk, grep are enough for most small tasks, and using either Perl or Python should be good enough for the rest.” A little aptitude with command line tools can go a long way.

sed is a stream editor: it operates on data in a serial fashion as it reads it. You can think of sed as a way to batch up a bunch of search and replace operations that you might perform in a text editor. For instance, this command will replace all instances of “foo” with “bar” within a file:

sed -e 's/foo/bar/g' myfile.txt

Anybody who has used regular expressions within a text editor or programming language will find sed easy to grasp. Awk takes a little more getting used to. A record-oriented tool, awk is the right tool to use when your data contains delimited fields that you want to manipulate.

Consider this list of names, which we’ll imagine lives in the file presidents.txt.

George Washington John Adams Thomas Jefferson James Madison James Monroe

To extract just the first names, we can use the following command:

$ awk '{ print $1 }' presidents.txt George John Thomas James James

Or, to just find those records with “James” as the first name:

$ awk '$1 ~ /James/ { print }' presidents.txt James Madison James Monroe

Awk can do a lot more, and features programming concepts such as variables, conditionals and loops. But just a basic grasp of how to match and extract fields will get you far.

For more information, attend the Strata Data Bootcamp, where Hilary Mason is an instructor, or read sed & awk.

tags: , , , , ,
  • Piers Harding

    Hi -
    Little typo in awk ‘$5 ~ /James/ { print }’ presidents.txt -> should be $1

    Cheers.

  • http://radar.oreilly.com/edd Edd Dumbill

    Piers, thanks for the catch. I updated the post to incorporate the fix.

  • http://www.jfc.org.uk/ James Carter

    I wouldn’t be troubling awk with your two examples. sed is more than capable.

    Extract the first names:
    sed ‘s/ .*$//g’

    Only the Jameses:
    sed ‘/^James /!d’

    If sed isn’t enough (not as often as you’d think) I tend to skip awk and go straight for ruby these days.

  • Paul McCullough

    Don’t forget regex – it cuts across sed, grep, ruby. Does anyone advocate avoiding regex altogether?

  • http://viewstate.wordpress.com Nitin Ahuja

    Awk is awesome but another really interesting tool to analyze data is Microsoft’s Logparser. It can be used to parse data using a SQL like syntax. Really powerful utility.

  • http://www.corventis.com Andrew Brown

    I put perl and awk in the same bag. They are powerful and terse but even the author can have trouble reading the program in 6 months.

    Python is the current answer. It’s easy to read syntax and is alive and growing.

  • Patrick Carroll

    Back in the day, when I was working for BNR, I remember using sed and awk in a a collection of utilities used to cut over to a Nortel AccessNode.

    The basic idea was to read the provisioning of the existing piece of equipment, rewrite to TL/1 commands to provision the AccessNode, and cut over.

    Worked beautifully.

    I really like the UNIX approach of pipes and filters.