Interactive Big Data analysis using approximate answers

As data sizes continue to grow, interactive query systems may start adopting the sampling approach central to BlinkDB

Interactive query analysis for (Hadoop scale data) has recently attracted the attention of many companies and open source developers – some examples include Cloudera’s Impala, Shark, Pivotal’s HAWQ, Hadapt, CitusDB, Phoenix, Sqrrl, Redshift, and BigQuery. These solutions use distributed computing, and a combination of other techniques including data co-partitioning, caching (into main memory), runtime code generation, and columnar storage.

One approach that hasn’t been exploited as much is sampling. By this I mean employing samples to generate approximate answers, and speed up execution. Database researchers have written papers on approximate answers, but few working (downloadable) systems are actually built on this approach.

Approximate query engine from U.C. Berkeley’s Amplab
An interesting, open source database released yesterday0 uses sampling to scale to big data. BlinkDB is a massively-parallel, approximate query system from UC Berkeley’s Amplab. It uses a series of data samples to generate approximate answers. Users compose queries by specifying either error bounds or time constraints, BlinkDB uses sufficiently large random samples to produce answers. Because random samples are stored in memory1, BlinkDB is able to provide interactive response times:


Read more…

Comment: 1 |

Four data themes to watch from Strata + Hadoop World 2012

In-memory data storage, SQL, data preparation and asking the right questions all emerged as key trends at Strata + Hadoop World.

At our successful Strata + Hadoop World conference (including successfully avoiding Sandy), a few themes emerged that resonated with my interests and experience as a hands-on data analyst and as a researcher who tracks technology adoption trends. Keep in mind that these themes reflect my personal biases. Others will have a different take on their own key takeaways from the conference.

1. In-memory data storage for faster queries and visualization

Interactive or real-time query for large datasets is seen as a key to analyst productivity (real-time as in query times fast enough to keep the user in the flow of analysis, from sub-second to less than a few minutes). The existing large-scale data management schemes aren’t fast enough and reduce analytical effectiveness when users can’t explore the data by quickly iterating through various query schemes. We see companies with large data stores building out their own in-memory tools, e.g., Dremel at Google, Druid at Metamarkets, and Sting at Netflix, and new tools, like Cloudera’s Impala announcement at the conference, UC Berkeley’s AMPLab’s Spark, SAP Hana, and Platfora.

We saw this coming a few years ago when analysts we pay attention to started building their own in-memory data store sandboxes, often in key/value data management tools like Redis, when trying to make sense of new, large-scale data stores. I know from my own work that there’s no better way to explore a new or unstructured data set than to be able to quickly run off a series of iterative queries, each informed by the last. Read more…

Comment |
Big data, but with a familiar face

Big data, but with a familiar face

Martin Hall explains how Karmasphere is integrating Hadoop into enterprises.

You don't have to throw away existing investments in skills and tools to use Hadoop for big data, as Karmasphere's Martin Hall explains.

Comments: 2 |
Joe Stump on data, APIs, and why location is up for grabs

Joe Stump on data, APIs, and why location is up for grabs

The SimpleGEO CTO and former Digg architect discusses NoSQL and location's future

I recently had a long conversation with Joe Stump, CTO of SimpleGeo, about location, geodata, and the NoSQL movement. Stump, who was formerly lead architect at Digg, had a lot to say. Here’s the highlights, you can find the full interview elsewhere on Radar.

Comments: 6 |
Counting Unique Users in Real-time with Streaming Databases

Counting Unique Users in Real-time with Streaming Databases

As the web increasingly becomes real-time, marketers and publishers need analytic tools that can produce real-time reports. As an example, the basic task of calculating the number of unique users is typically done in batch mode (e.g. daily) and in many cases using a random sample from relevant log files. If unique user counts can be accurately computed in real-time, publishers and marketers can mount A/B tests or referral analysis to dynamically adjust their campaigns.

Comments: 6 |
Big Data: Technologies and Techniques for Large-Scale Data

Big Data: Technologies and Techniques for Large-Scale Data

Our belief that proficiency in managing and analyzing large amounts of data distinguishes market leading companies, led to a recent report designed to help users understand the different large-scale data management techniques. Our report on Big Data Technologies was the result of interviews with over thirty experts, including research scientists, (open-source) hackers, vendors, data analysts, and entrepreneurs. I recently sat down with my co-author, Roger Magoulas (Director of Research at O’Reilly), who agreed talk about our report and Big Data in general.

Comments: 3 |