Here are a few of the data stories that caught my attention this week.
Infochimps makes its big data expertise available in a platform
The big data marketplace Infochimps announced this week that it will begin offering the platform that it’s built for itself to other companies — as both a platform-as-a-service and an on-premise solution. “The technical needs for Infochimps are pretty substantial,” says CEO Joe Kelly, and the company now plans to help others get up-to-speed with implementing a big data infrastructure.
Infochimps has offered datasets for download or via API for a number of years (see my May 2011 interview with the company here), but the startup is now making the transition to offer its infrastructure to others. Likening its big data marketplace to an “iTunes for data,” Infochimps says it’s clear that we still need a lot more “iPods” in production before most companies are able to handle the big data deluge.
Infochimps will now offer its in-house expertise to others. That includes a number of tools that one might expect: AWS, Hadoop, and Pig. But it also includes Ironfan, Infochimps’ management tool built on top of Chef.
Infochimps isn’t abandoning the big data marketplace piece of its business. However, its move to support companies with their big data efforts is indication there’s still quite a bit of work to do before everyone’s quite ready to “do stuff” with the big data we’re accumulating.
How do you anonymize online publications?
A fascinating piece of research is set to to appear at IEEE S&P on the subject of Internet-scale authorship identification based on “stylometry,” which is an analysis of writing style. The paper was co-authored by Arvind Narayanan, Hristo Paskov, Neil Gong, John Bethencourt, Emil Stefanov, Richard Shin and Dawn Song. They’ve been able to correctly identify writers 20% of the time based on looking at what they’ve published online before. It’s a finding with serious implications for online anonymity and free speech, the team notes.
“The good news for authors who would like to protect themselves against de-anonymization is it appears that manually changing one’s style is enough to throw off these attacks,” says Narayanan.
Open data for the public data
O’Reilly Media has just published a report on “Data for the Public Good.” In the report, Alex Howard makes the argument for a systemic approach to thinking about open data and the public sector, examining the case for a “public good” around public data as well as around governmental, journalistic, healthcare, and crisis situations (to name but a few scenarios and applications).
Howard notes that the success of recent open data initiatives “won’t depend on any single chief information officer, chief executive or brilliant developer. Data for the public good will be driven by a distributed community of media, nonprofits, academics and civic advocates focused on better outcomes, more informed communities and the new news, in whatever form it is delivered.” Although many municipalities have made the case for open data initiatives, there’s more to the puzzle, Howard argues, including recognizing the importance of personal data and making the case for a “hybridized public-private data.”
The “Data for the Public Good" report is available for free as a PDF, ePUB, or MOBI download.
Got data news?
Feel free to email me.