Bridging the Divide Between Big Data and (Big) Algorithms

Strata SC 2014 Session Postmortem

By Alice Zheng

In February, GraphLab took a road trip to Strata, a Big Data conference organized by O’Reilly. It was a gathering of close to 3100 people–engineers, business folks, industry evangelists, and data scientists. We had a lot of fun meeting and socializing with our peers and customers. Amidst all the conference excitement, we presented two talks. Carlos Guestrin, our intrepid CEO, held a tutorial on large-scale machine learning. I gave a talk in the Hardcore Data Science track.

Given the diversity of the audience, this was a difficult talk to pin down. After banging my head against the wall for some time, I decided to go with what interests *me*. As a machine learning researcher and an industry observer, I’ve always puzzled over these questions: What exact is Big Data? What kind of tools do we really need to build? How? Big Data discussions often span a bewildering spectrum of topics. At one end of the spectrum, people talk about Big Data, data processing, data cleaning, and simple analytics. At the other end, people talk about complex machine learning models. There is a disconnect. There is something in between that is seldom talked about, and yet is crucial for efficient analysis: data structures.

Photo provided courtesy of Alice Zheng.

Photo provided courtesy of Alice Zheng.

Data structures are the glue between data and algorithms. Raw data must be turned into data structures–whether in memory or on disk–before they can be operated on. Algorithms depend on the underlying data structures to support their computation needs. An efficient implementation of the right data structure can be the key to efficient analysis. GraphLab is known for its distributed graphs. But graphs are not the whole story. Many algorithms are indeed naturally situated on top of graphs: PageRank, label propagation, and Gibbs sampling are but a few examples. But many other algorithms, such as stochastic gradient descent and decision tree learning, are more amenable to flat tables. Furthermore, raw data often comes in the form of logs, which can be easily translated into flat tables. With GraphLab’s upcoming offering of SFrames, we are now handling large-scale flat tables as well as graphs.

So that was my talk. I talked about data, I talked about algorithms, and I talked about what it takes to go from data to analysis using algorithms. It felt supremely satisfying to unite the two ends of the spectrum. Apparently I wasn’t the only one. The talk struck a chord with the audience. Many people came up afterwards, eager to learn more. What algorithms are more suitable for graphs? How should one pick between the two? What metrics might one use? It was great to see people becoming interested in the messy details of tool building.

To be honest, data structures was one of my least favorite subjects in college. It seemed so dry and abstract ā€¦ and complicated! But when we take the perspective of the interplay of raw data and algorithms, the subject comes alive. One person came up to me afterwards and said ā€œIā€™m just getting started with data science. Thanks for making a difficult subject accessible!ā€ That comment alone made all it all worth the effort. At GraphLab, this is the kind of stuff that we live and breath everyday. For each algorithm and each data set, we weigh the alternatives and implement the most suitable data structures. We do the dirty work so that others don’t have to.

Editor’s Note: A version of this post appeared previously on the GraphLab Blog

Related Resources