If you live in the United States (and maybe even if you don’t), you probably spent a lot of time this week thinking about elections, politicians, and the frequent shortcomings of each. Here are some lessons from the data world that our pundits would be wise to heed.
Lesson 1: Sharing is caring
If you’ve ever used Yelp, you’re probably familiar with the “People Who Viewed this Also Viewed …” feature (tucked in the lower-right corner of the window, below the street map). That feature, unsurprisingly, relies on the ability to process a few months’ worth of access logs to analyze user behavior.
According to Dave Marin, a search and data-mining engineer at Yelp, the company generates something like 100GB of daily log data. And of course, in order to process all that data, they use distributed computing. That used to mean running MapReduce on an in-house Hadoop cluster, via a Python-package framework they call MRJob. But they found that very seldom did they make use of all the nodes … and when they did, a large batch job would delay lots of small jobs. Le sigh.
Then — insert sparkles here — the team discovered Amazon’s Elastic MapReduce (EMR) service. That’s when they decided to migrate their entire code base over to Amazon so they could dispose of their own Hadoop cluster and rent such services on an as-needed basis. Oh, and they also decided to share MRJob with all of us, so we can help make it better (and Amazon can sell EMR services to those of us without our own Hadoop clusters?).
- If you’re new to MapReduce and want to learn about it, MRJob is for you.
- If you want to run a huge machine learning algorithm, or do some serious log processing, but don’t want to set up a Hadoop cluster, MRJob is for you.
- If you have a Hadoop cluster already and want to run Python scripts on it, MRJob is for you.
- If you want to migrate your Python code base off your Hadoop cluster to EMR, MRJob is for you.
- (If you don’t want to write Python, MRJob is not so much for you. But we can fix that.)
To learn more or download the code, check out the GitHub page.
Lesson 2: History matters
The Committee on Data for Science and Technology (CODATA) launched an initiative on Oct. 29 to make a global inventory of “threatened data.” That includes myriad kinds of analog data, as well as digitized data that exists on older, degradable formats, such as floppy disks or magnetic tape.
One purpose of the inventory is to preserve old records and sources of data. But another purpose is to help researchers and preservationists prioritize what to save. It may not be possible to keep everything, but having an inventory will help us know where to focus our energies. As Nature News reports:
Climate-change studies, for example, require data series on temperature and rainfall reaching back further than digital records. Some scientists are having to leaf through old ships’ logs for clues to past weather patterns.
Politicians, I realize, sometimes prefer old documents to stay buried. But cataloging and saving information seems like a pretty good plan to me.
Registration for Strata 2011 is now open. Save 20% with the
Lesson 3: Forget expensive suits
As elegant as this graphic is, only one word comes to mind: ugh.
Lesson 4: Confusion can cost you
Nowhere is the power of plain speech more quantifiably evident this week than in Expedia’s discovery that a confusing data field had been costing them $12 million a year.
Expedia analysts were trying to figure out why a number of people who had entered information like dates and credit card numbers, and then clicked the “Buy Now” button, never completed their transactions. They began correlating information about these events to discover patterns in the transaction failures.
Lo and behold, it turned out that the transactions were being rejected during credit card verification because customers were entering incorrect addresses. And why were they entering incorrect addresses, you ask?
“We had an optional field on the site under ‘Name’, which was ‘Company’,” Joe Megibow, Expedia’s VP of global analytics, told Silicon.com. “It confused some customers who filled out the ‘Company’ field with their bank name.”
These customers then went on to enter the address of their bank in the address fields.
The fix, of course, was simply to delete the confusing field. According to Megibow, this caused a major change “overnight,” leading to an additional $12 million annual profit. And he says they have identified 50-60 more such changes by using analytics.
Score one more for simplicity. For as every politician surely knows, the devil is in the details.
Send us news
Email us news, tips and interesting tidbits at email@example.com.