Here are a few stories from the data space that caught my attention this week.
Cloudera’s Impala takes Hadoop queries into real-time
Cloudera ventured into real-time Hadoop querying this week, opening up its Impala software platform. As Derrick Harris reports at GigaOm, Impala — an SQL query engine — doesn’t rely on MapReduce, making it faster than tools such as Hive. Cloudera estimates its queries run 10 times faster than Hive, and Charles Zedlewski, Cloudera’s cloud VP of products, told Harris that “small queries can run in less than a second.”
Harris notes that Zedlewski pointed out that Impala wasn’t designed to replace business intelligence (BI) tools, and that “Cloudera isn’t interested in selling BI or other analytic applications.” Rather, Impala serves as the execution engine, still relying on software from Cloudera partners — Zedlewski told Harris, “We’re sticking to our knitting as a platform vendor.”
Joab Jackson at PC World reports that “[e]ventually, Impala will be the basis of a Cloudera commercial offering, called the Cloudera Enterprise RTQ (Real-Time Query), though the company has not specified a release date.”
Impala has plenty of competition on this playing field, which Harris also covers, and he notes the significance of all the recent Hadoop innovation:
“I can’t underscore enough how critical all of this innovation is for Hadoop, which in order to add substance to its unparalleled hype needed to become far more useful to far more users. But the sudden shift from Hadoop as a batch-processing engine built on MapReduce into an ad hoc SQL querying engine might leave industry analysts and even Hadoop users scratching their heads.”
You can read more from Harris’ piece here and Jackson’s piece here. Wired also has an interesting piece on Impala, covering the Google F1 database upon which it is based and the Googler Cloudera hired away to help build it.
(Cloudera CEO Mike Olson discussed Impala, Hadoop and the importance of real-time at this week’s Strata Conference + Hadoop World.)
Big data landfills?
The growing mountains of big data aren’t only causing concerns over the ever-increasing amounts of data exhaust and how it’s being used and manipulated, but also over the actual mountains of data centers that are being constructed to house all this data. Avere CEO Ron Bianchini has a guest post at Wired this week taking a look at the problem. He compares data centers to landfills in terms of the environmental concerns raised, noting that the 509,147 data centers in the world take up enough space to fill 5,955 football fields, or landfills.
Bianchini presents several stats showing that the growing data problem, and hence the growing data center problem, isn’t slowing down. He notes that the number of Internet users around the world has more than doubled in five years — “from 1.043 billion (16% of the world’s population, June 2006) to 2.11 billion (30%, June 2011) (source: Internet World Stats)” — and that smartphone users are projected to increase from “500 million in 2011 to 2 billion by 2015 (International Telecommunications Union).”
He argues the root problem behind the ever-growing number of data centers is one of crippled innovation:
“Warehouses were ripped out, as technology innovators found more practical solutions for consumers and enterprises. So, why haven’t we found a similar solution for data? … Much of this is part and parcel to the big guys in the industry stunting the needed evolution so they can continue to sell legacy equipment that doesn’t address today’s problems. Ultimately, these storage godfathers cripple innovation and continue to grow and amplify the data center conundrum. The real answer is we need to look at this from a business process perspective and apply new ways of thinking architecturally …”
You can read more from Bianchini’s piece here.
Jesper Andersen on the intersection of art and data
O’Reilly Media’s online managing editor Mac Slocum sat down with Bloom Studios founder Jesper Andersen this week at Strata Conference + Hadoop World in New York City. Andersen addressed the use of data from several angles, including how big data can be used to create experiences and how visualizations are becoming interfaces. Andersen also addressed a question about the intersection of art and data, noting there are three main ways he sees this use of data evolving (at the 2:20 mark):
“The first one is the Aaron Koblin flight patterns take on things, where there’s a really lovely visual interaction or display. The data really supplies texture or complexity. The meaning may be secondary or come out over viewing the data through the visual transformation, but it’s not primary.
“I then also see another set of people like Ben Rubin and Listening Post and a lot of the work Jonathan Harris does, where the art becomes a presentation of the existence of the data. So, Ben Rubin would show IRC logs in chat rooms and such and show people who aren’t experienced with that data source that there’s this thing — the artistic interpretation was this giant LED screen. And Jonathan Harris is showing that people are writing their emotions down on the web, and that’s basically the hook of that piece and it was phenomenally successful, just showing this data exists.
“And then I see what may be the future — artists engaging with the data to find a connection. Something like what Stamen did with CNN to [visualize the military] deaths in Afghanistan, where they link the location of where a soldier was killed in Afghanistan with their home town in the United States in this dual map experience; you really feel that connection going back and forth between the two locations and how abstract it is in our normal lives.”
You can view Andersen’s entire interview in the following video:
Additional keynote and interviews from the conference can be found in the Strata Conference + Hadoop World NYC 2012 playlist on the O’Reilly YouTube channel.
Tip us off
News tips and suggestions are always welcome, so please send them along.