ENTRIES TAGGED "Berkeley Data Analytics Stack"

Four steps to analyzing big data with Spark

By Andy Konwinski, Ion Stoica, and Matei Zaharia

In the UC Berkeley AMPLab, we have embarked on a six year project to build a powerful next generation big data analytics platform: the Berkeley Data Analytics Stack (BDAS). We have already released several components of BDAS including Spark, a fast distributed in-memory analytics engine, and in February we ran a sold out tutorial at the Strata conference in Santa Clara teaching attendees how to use Spark and other components of the BDAS stack.

In this blog post we will walk through four steps to getting hands-on using Spark to analyze real data. For an overview of the motivation and key components of BDAS, check out our previous Strata blog post.

Read more…

Comment: 1 |

The future of big data with BDAS, the Berkeley Data Analytics Stack

Preview of an upcoming tutorial at Strata Santa Clara 2013

By Andy KonwinskiIon Stoica, and Matei Zaharia

This month at Strata, the U.C. Berkeley AMPLab will be running a full day of big data tutorials.In this post, we present the motivation and vision for the Berkeley Data Analytics Stack (BDAS), and an overview of several BDAS components that we released over the past two years, including Mesos, Spark, Spark Streaming, and Shark.

While batch processing systems like Hadoop MapReduce paved the way for organizations to ask questions about big datasets, they represent only the beginning of what users need to do with big data. More and more, users wish to move from periodically building reports about datasets to continuously using new data to make informed business decisions in real-time. Achieving these goals imposes three key requirements on big data processing:

  • Low latency queries: Interactive ad-hoc queries allows data scientists to find valuable inferences faster, or explore a larger solution space to make better decisions. Furthermore, there is an increasing need for stream processing, as this allows organizations to make decisions in real-time, such as detecting an SLA violation and fixing the problem before the users notice, or deciding what ads to show based on user’s live tweets.
  • Sophisticated analysis: People are increasingly looking to use new state of art algorithms, such as predictive machine learning algorithms, to make better forecasts and decisions.
  • Unification of existing data computation models: Users want to integrate interactive queries, batch, and streaming processing to handle the ever increasing requirements of their processing pipelines. For example, detecting anomalies in user behavior may require (1) stream processing to compare the behavior of users in real-time across different segments (e.g., genre, ages, location, device), (2) interactive queries to detect differences in user’s daily (or weekly) behavior, and (3) batch processing to build sophisticated predictive models.

In response to the above requirements, more than three years ago we began building BDAS.

Read more…

Comment |