ENTRIES TAGGED "real-time query"
As data sizes continue to grow, interactive query systems may start adopting the sampling approach central to BlinkDB
Interactive query analysis for (Hadoop scale data) has recently attracted the attention of many companies and open source developers – some examples include Cloudera’s Impala, Shark, Pivotal’s HAWQ, Hadapt, CitusDB, Phoenix, Sqrrl, Redshift, and BigQuery. These solutions use distributed computing, and a combination of other techniques including data co-partitioning, caching (into main memory), runtime code generation, and columnar storage.
One approach that hasn’t been exploited as much is sampling. By this I mean employing samples to generate approximate answers, and speed up execution. Database researchers have written papers on approximate answers, but few working (downloadable) systems are actually built on this approach.
Approximate query engine from U.C. Berkeley’s Amplab
An interesting, open source database released yesterday0 uses sampling to scale to big data. BlinkDB is a massively-parallel, approximate query system from UC Berkeley’s Amplab. It uses a series of data samples to generate approximate answers. Users compose queries by specifying either error bounds or time constraints, BlinkDB uses sufficiently large random samples to produce answers. Because random samples are stored in memory1, BlinkDB is able to provide interactive response times:
Data stores are rolling out easy-to-use analysis tools
Originated by the NSA, Apache Accumulo is a BigTable inspired data store known for being highly scalable and for its interesting security model. Federal agencies and Defense contractors have deployed Accumulo on clusters of a thousand or more servers. It also uses “cell-level” security to control access to values stored in individual cells1.
What Accumulo was lacking were easy-to-use, standard analytic engines that allow users to interact with data. The release of Sqrrl Enterprise this past week fills that gap. Sqrrl Enterprise provides an initial set of analytic engines for the Accumulo ecosystem2. It includes support for interactive SQL, fulltext search, and queries over graph data. Each of these engines takes into account security labels placed on data: since every data object ingested into Sqrrl has a security label, (query & analytic) results incorporate those access levels. Analysts interact with data as they normally would. For example Sqrrl’s indexing technology accounts for security labels, and search queries are written in standard Lucene syntax. Reminiscent of the Phoenix project for HBase3, SQL queries4 in Sqrrl are converted into optimized Accumulo iterators.
Shark is 100X faster than Hive for SQL, and 100X faster than Hadoop for machine-learning
Hadoop’s strength is in batch processing, MapReduce isn’t particularly suited for interactive/adhoc queries. Real-time1 SQL queries (on Hadoop data) are usually performed using custom connectors to MPP databases. In practice this means having connectors between separate Hadoop and database clusters. Over the last few months a number of systems that provide fast SQL access within Hadoop clusters have garnered attention. Connectors between Hadoop and fast MPP database clusters are not going away, but there is growing interest in moving many interactive SQL tasks into systems that coexist on the same cluster with Hadoop.
Having a Hadoop cluster support fast/interactive SQL queries dates back a few years to HadoopDB, an open source project out of Yale. The creators of HadoopDB have since started a commercial software company (Hadapt) to build a system that unites Hadoop/MapReduce and SQL. In Hadapt, a (Postgres) database is placed in nodes of a Hadoop cluster, resulting in a system2 that can use MapReduce, SQL, and search (Solr). Now on version 2.0, Hadapt is a fault-tolerant system that comes with analytic functions (HDK) that one can use via SQL. Read more…
Cloudera ventures into real-time queries with Impala, data centers are the new landfill, and Jesper Andersen looks at the relationship between art and data.
Here are a few stories from the data space that caught my attention this week.
Cloudera’s Impala takes Hadoop queries into real-time
Cloudera ventured into real-time Hadoop querying this week, opening up its Impala software platform. As Derrick Harris reports at GigaOm, Impala — an SQL query engine — doesn’t rely on MapReduce, making it faster than tools such as Hive. Cloudera estimates its queries run 10 times faster than Hive, and Charles Zedlewski, Cloudera’s cloud VP of products, told Harris that “small queries can run in less than a second.”
Harris notes that Zedlewski pointed out that Impala wasn’t designed to replace business intelligence (BI) tools, and that “Cloudera isn’t interested in selling BI or other analytic applications.” Rather, Impala serves as the execution engine, still relying on software from Cloudera partners — Zedlewski told Harris, “We’re sticking to our knitting as a platform vendor.”
Joab Jackson at PC World reports that “[e]ventually, Impala will be the basis of a Cloudera commercial offering, called the Cloudera Enterprise RTQ (Real-Time Query), though the company has not specified a release date.”
Impala has plenty of competition on this playing field, which Harris also covers, and he notes the significance of all the recent Hadoop innovation:
“I can’t underscore enough how critical all of this innovation is for Hadoop, which in order to add substance to its unparalleled hype needed to become far more useful to far more users. But the sudden shift from Hadoop as a batch-processing engine built on MapReduce into an ad hoc SQL querying engine might leave industry analysts and even Hadoop users scratching their heads.”
You can read more from Harris’ piece here and Jackson’s piece here. Wired also has an interesting piece on Impala, covering the Google F1 database upon which it is based and the Googler Cloudera hired away to help build it.
(Cloudera CEO Mike Olson discussed Impala, Hadoop and the importance of real-time at this week’s Strata Conference + Hadoop World.)