Here are a few of the data stories that caught my attention this week:
Twitter’s coming Storm
In a blog post late last week, Twitter announced that it plans to open source Storm, its Hadoop-like data processing tool. Storm was developed by BackType, the social media analytics company that Twitter acquired last month. Several of BackType’s other technologies, including ElephantDB, have already been open sourced, and Storm will join them this fall, according to Nathan Marz, formerly of BackType now of Twitter.
Marz’s post digs into how Storm works as well as how it can be applied. He notes that a Storm cluster is only “superficially similar” to a Hadoop cluster. Instead of running MapReduce “jobs,” Storm runs “topologies.” One of the key differences is that a MapReduce job eventually finishes, whereas a topology processes messages “forever (or until you kill it).” This makes Storm useful, among other things, for processing real-time streams of data, continuous computation, and distributed RPC.
Touting the technology’s ease-of-use, Marz lists the following complexities “under the hood: guaranteed message processing, robust process management, fault detection and automatic reassignment, efficient message passing, and local mode and distributed mode. More details — and more documentation — will follow in September 19 when Storm is officially open sourced.
Mapping the London riots
Using real-time social streams and mapping tools in a crisis situation is hardly new. We’ve seen citizens, developers, journalists, governments alike undertake these efforts following a multitude of natural disasters. But the violence that erupted in London over weekend has proven yet again that these data tools are important for both safety and for analysis and understanding. Indeed, as journalist Kevin Anderson argued, “data journalists and social scientists should join forces” to understand the causes and motivations for the riots, rather than the more traditional “hours of speculation on television and acres of newsprint positing theories.”
NPR’s Matt Stiles was just one of the data journalists who picked up the mantle. Using data from The Guardian, he created a map that highlighted riot locations, overlaid on top of a colored representation of indices of deprivation.” This makes a pretty compelling visualization, demonstrating that the areas with the most incidents of violence are also the least well-off areas of London.
In a reflective piece in PaidContent, James Cridland examined his experiences trying to use social media to map the riots. He created a Google Map where he was marking “verified incident areas.” As he describes it, however, that verifiability became quite challenging. His “lessons learned” included realizations about what constitutes a reliable source.
“Twitter is not a reliable source: I lost count of the amount of times I was told that riots were occurring in Derby or Manchester. They weren’t, yet on Twitter they were being reported as fact, despite the Derbyshire Constabulary and Greater Manchester Police issuing denials on Twitter. I realised that, in order for this map to be useful, every entry needed to be verified, and verifiable for others too. For every report, I searched Google News, Twitter, and major news sites to try and establish some sort of verification. My criteria was that something had to be reported by an established news organisation (BBC, Sky, local newspapers) or by multiple people on Twitter in different ways.
Cridland points out that the traditional news media wasn’t reliable either, as the BBC for example reported disturbances that never occurred or misreported their location.
“Many people don’t know what a reliable source is,” he concludes. “I discovered it was surprisingly easy to check the veracity of claims being made on Twitter by using the Internet to check and cross-reference, rather than blindly retweet.”
When data disappears
Following the riots in the U.K., there is now a trove of data — from Blackberry Messenger, from Twitter, from CCTV — that the authorities can utilize to investigate “what happened.” There are also probably plenty of people who wish that data would just disappear.
What happens when that actually happens? How can we ensure that important digital information is preserved? Those were the questions asked in an Op-Ed in Sunday’s The New York Times. Kari Kraus, an assistant professor in the College of Information Studies and the English department at the University of Maryland, makes a strong case for why “digitization” isn’t really the end-of-the-road when it comes to preservation.
“For all its many promises, digital storage is perishable, perhaps even more so than paper. Disks corrode, bits “rot” and hardware becomes obsolete.
But that doesn’t mean digital preservation is pointless: if we’re going to save even a fraction of the trillions of bits of data churned out every year, we can’t think of digital preservation in the same way we do paper preservation. We have to stop thinking about how to save data only after it’s no longer needed, as when an author donates her papers to an archive. Instead, we must look for ways to continuously maintain and improve it. In other words, we must stop preserving digital material and start curating it.
She points to the efforts made to curate and preserve video games, something that highlights the struggles of not just saving the content — the games — but the technology — NES cartridges, for example, as well as the gaming systems themselves. “It might seem silly to look to video-game fans for lessons on how to save our informational heritage, but in fact complex interactive games represent the outer limit of what we can do with digital preservation.” By figuring out the complexities around preserving this sort of material — a game, a console, for example — we can get a better sense of how to develop systems to preserve other things, whether it’s our Twitter archives, digital maps of London, or genetic data.
Got data news?
Send me an email.