ENTRIES TAGGED "data exhaust"

Strata Week: The Open Data Institute aims to mine the gold in open government data

The ODI's official launch, MIT's Kinect Kinetics project, and legal ways authorities are tracking us.

Here are a few stories from the data space that caught my attention this week.

Open government data gets a startup incubator

The Open Data Institute (ODI), founded by Tim Berners-Lee and artificial intelligence pioneer Nigel Shadbolt, officially launched this week in the U.K. As Berners-Lee and Shadbolt noted in “There’s gold to be mined from all our data (PDF),” the institute was initially funded and commissioned by the U.K. government to “help the public sector to use its own data more effectively” and that by “[w]orking with private companies and universities, it will also develop the capability of U.K. businesses to exploit open data, fostering a generation of open data entrepreneurs.” The institute’s mission is outlined on its website:

“The Open Data Institute will catalyse the evolution of an open data culture to create economic, environmental, and social value. It will unlock supply, generate demand, create and disseminate knowledge to address local and global issues. We will convene world-class experts to collaborate, incubate, nurture and mentor new ideas, and promote innovation. We will enable anyone to learn and engage with open data, and empower our teams to help others through professional coaching and mentoring.”

Jamillah Knowles reports at The Next Web that the institute is already hosting its first startups, including agile big data specialists Mastodon C; corporate information aggregator OpenCorporates; location-based data startup Placr; and Locatable, a startup aiming to help people find their perfect place to live.

Coinciding with the launch, the institute received an investment boost. As Ingrid Lunden reports at TechCrunch, the U.K. government has committed £10 million over the next five years (about $16 million); this week, investment firm Omidyar Network, co-founded by eBay founder Pierre Omidyar and his wife Pam, invested an additional $750,000 in the ODI. Lunden notes that though the ODI is focused on the U.K., having an international investment company on board “gives the effort a potential profile beyond these borders.”

In related news, O’Reilly Radar’s Alex Howard talked with open government developer Eric Mill, who together with GovTrack.us founder Josh Tauberer and New York Times developer Derek Willis published data and scrapers for legislation in Congress from THOMAS.gov in the public domain at github.com/unitedstates. Mill told Howard he’s hoping this work will serve as an example for government to publish the information themselves in the future:

“It would be fantastic if the relevant bodies published this data themselves and made these datasets and scrapers unnecessary. It would increase the information’s accuracy and timeliness, and probably its breadth. It would certainly save us a lot of work! Until that time, I hope that our approach to this data, based on the joint experience of developers who have each worked with it for years, can model to government what developers who aim to serve the public are actually looking for online.”

You can read Howard’s full interview with Mills about building the scraper and the accompanying dataset, using GitHub as a platform, and how the data is being used here.

Read more…

Comment |

Strata Week: Real-time Hadoop

Cloudera ventures into real-time queries with Impala, data centers are the new landfill, and Jesper Andersen looks at the relationship between art and data.

Here are a few stories from the data space that caught my attention this week.

Cloudera’s Impala takes Hadoop queries into real-time

Cloudera ventured into real-time Hadoop querying this week, opening up its Impala software platform. As Derrick Harris reports at GigaOm, Impala — an SQL query engine — doesn’t rely on MapReduce, making it faster than tools such as Hive. Cloudera estimates its queries run 10 times faster than Hive, and Charles Zedlewski, Cloudera’s cloud VP of products, told Harris that “small queries can run in less than a second.”

Harris notes that Zedlewski pointed out that Impala wasn’t designed to replace business intelligence (BI) tools, and that “Cloudera isn’t interested in selling BI or other analytic applications.” Rather, Impala serves as the execution engine, still relying on software from Cloudera partners — Zedlewski told Harris, “We’re sticking to our knitting as a platform vendor.”

Joab Jackson at PC World reports that “[e]ventually, Impala will be the basis of a Cloudera commercial offering, called the Cloudera Enterprise RTQ (Real-Time Query), though the company has not specified a release date.”

Impala has plenty of competition on this playing field, which Harris also covers, and he notes the significance of all the recent Hadoop innovation:

“I can’t underscore enough how critical all of this innovation is for Hadoop, which in order to add substance to its unparalleled hype needed to become far more useful to far more users. But the sudden shift from Hadoop as a batch-processing engine built on MapReduce into an ad hoc SQL querying engine might leave industry analysts and even Hadoop users scratching their heads.”

You can read more from Harris’ piece here and Jackson’s piece here. Wired also has an interesting piece on Impala, covering the Google F1 database upon which it is based and the Googler Cloudera hired away to help build it.

(Cloudera CEO Mike Olson discussed Impala, Hadoop and the importance of real-time at this week’s Strata Conference + Hadoop World.)

Read more…

Comment: 1 |

Visualization of the Week: The living city

Visualizing the urban flow of Geneva.

This week's visualization illustrates the paths people take through Geneva, based on the digital traces left by their mobile phones.

Comment: 1 |