Andy Konwinski

Andy Konwinski is a postdoc in the AMPLab at UC Berkeley focused on large scale distributed computing and cluster scheduling. He co-created and is a committer on the Apache Mesos project that has been adopted by Twitter as their private cloud platform. He also worked with systems engineers and researchers at Google on Omega, their next generation cluster scheduling system. More recently, he lead the AMP Camp Big Data Bootcamp and has been contributing to the Spark project.

The future of big data with BDAS, the Berkeley Data Analytics Stack

Preview of an upcoming tutorial at Strata Santa Clara 2013

By Andy KonwinskiIon Stoica, and Matei Zaharia

This month at Strata, the U.C. Berkeley AMPLab will be running a full day of big data tutorials.In this post, we present the motivation and vision for the Berkeley Data Analytics Stack (BDAS), and an overview of several BDAS components that we released over the past two years, including Mesos, Spark, Spark Streaming, and Shark.

While batch processing systems like Hadoop MapReduce paved the way for organizations to ask questions about big datasets, they represent only the beginning of what users need to do with big data. More and more, users wish to move from periodically building reports about datasets to continuously using new data to make informed business decisions in real-time. Achieving these goals imposes three key requirements on big data processing:

  • Low latency queries: Interactive ad-hoc queries allows data scientists to find valuable inferences faster, or explore a larger solution space to make better decisions. Furthermore, there is an increasing need for stream processing, as this allows organizations to make decisions in real-time, such as detecting an SLA violation and fixing the problem before the users notice, or deciding what ads to show based on user’s live tweets.
  • Sophisticated analysis: People are increasingly looking to use new state of art algorithms, such as predictive machine learning algorithms, to make better forecasts and decisions.
  • Unification of existing data computation models: Users want to integrate interactive queries, batch, and streaming processing to handle the ever increasing requirements of their processing pipelines. For example, detecting anomalies in user behavior may require (1) stream processing to compare the behavior of users in real-time across different segments (e.g., genre, ages, location, device), (2) interactive queries to detect differences in user’s daily (or weekly) behavior, and (3) batch processing to build sophisticated predictive models.

In response to the above requirements, more than three years ago we began building BDAS.

One of the key technology trends that has driven the design and implementation of BDAS is Moore’s law dramatic impact on memory capacities and prices. During the past 5 years alone (between February 2007 and February 2012), memory prices have dropped by 16x. Today, servers with hundreds of GB of RAM are not uncommon. This trend lead us to aggressively pursue the design of in-memory processing systems to speed up data analysis and enable complex computations. In addition, BDAS can also be substantially faster with on-disk data.

We are still in the early days of building out BDAS, but already we’ve taken large strides towards our vision. We’ve released several key open source components of the stack:

  • Spark, a computation engine built on top of the popular HDFS that efficiently support iterative processing (e.g., ML algorithms), and interactive queries. Spark provides an easy-to-program interface that is available in Java, Python, and Scala.
  • Spark Streaming, a new component of Spark that provides highly scalable, fault-tolerant streaming processing. With this functionality, Spark provides integrated support for all major computation models: batch, interactive, and streaming.
  • Shark, a large-scale data warehouse system that runs on top of Spark and is backward-compatible with Apache Hive, allowing users to run unmodified Hive queries on existing Hive workhouses. Shark is able to run Hive queries 100x faster when the data fits in memory and up to 5-10x faster when the data is stored on disk.
  • Mesos, a cluster manager that provides efficient resource isolation and sharing across distributed applications such as Hadoop, MPI, Hypertable, and Spark. As a result, Mesos allows users to easily build complex pipelines involving algorithms implemented in various frameworks.

How are people using BDAS?

So far, we have seen cluster operators using Mesos to run multiple frameworks (including Hadoop, Storm, and Spark) on the same cluster, and analysts and developers using Spark and Shark for a variety of analytics and statistical learning applications, mostly on Hadoop and Hive data. Spark has enabled new applications which are not supported well by the Hadoop ecosystem, such as interactive dashboards where users can drill into data. Companies that have talked publicly about their use of BDAS components include Twitter, ClearStory Data, Conviva, Quantified, and Yahoo! (See here and here for related stories.)

At the upcoming Strata conference in Santa Clara, we will run two tutorials on the BDAS stack. The first will be an Introduction to BDAS featuring Spark, Spark Streaming, and Shark. Then, in our second tutorial, attendees will get hands-on with Spark, Spark Streaming and Shark, to do real data analysis. We look forward to seeing you there!

Comment |