At our successful Strata + Hadoop World conference (including successfully avoiding Sandy), a few themes emerged that resonated with my interests and experience as a hands-on data analyst and as a researcher who tracks technology adoption trends. Keep in mind that these themes reflect my personal biases. Others will have a different take on their own key takeaways from the conference.
1. In-memory data storage for faster queries and visualization
Interactive or real-time query for large datasets is seen as a key to analyst productivity (real-time as in query times fast enough to keep the user in the flow of analysis, from sub-second to less than a few minutes). The existing large-scale data management schemes aren’t fast enough and reduce analytical effectiveness when users can’t explore the data by quickly iterating through various query schemes. We see companies with large data stores building out their own in-memory tools, e.g., Dremel at Google, Druid at Metamarkets, and Sting at Netflix, and new tools, like Cloudera’s Impala announcement at the conference, UC Berkeley’s AMPLab’s Spark, SAP Hana, and Platfora.
We saw this coming a few years ago when analysts we pay attention to started building their own in-memory data store sandboxes, often in key/value data management tools like Redis, when trying to make sense of new, large-scale data stores. I know from my own work that there’s no better way to explore a new or unstructured data set than to be able to quickly run off a series of iterative queries, each informed by the last.
2. SQL and SQL-like tools matter
We see Strata attendees maturing their large-scale analysis infrastructures toward democratizing access to data via high-level SQL and SQL-like tools. As with in-memory data storage, we see both high-functioning data companies and tool vendors working to build more SQL and SQL-like access to large-scale data stores. A common architecture mentioned at the conference included Hadoop for data ingestion and prep coupled with a relational or SQL-like interface (e.g., Hive) to provide widespread access to the data. From the vendor side, we see Cloudera’s Impala project consists of a distributed parallel SQL query engine, Hadapt integrating Hadoop and SQL, and the AMPLab Shark (Hive) interface to Spark.
There’s still a need for constructs like MapReduce and Scala to support parallel programming algorithms. And, HDFS seems the likely foundation for all manner of distributed data processing and tools for the next few years. However, we see too much existing investment in staff resources who know SQL and in SQL-oriented tools to blunt the trend of increased SQL-like access to large-scale, distributed data stores.
3. The 80% rule for data preparation
Echoing a theme DJ Patil highlights in “Data Jujitsu,” many of the data analysts at Strata emphasized that 80% of analysis is data preparation — a ratio we see in our own data work. By data preparation, we mean the acquisition, cleaning, transforming — including standardizing and normalizing values — organizing, and training data for analysis and use in machine learning and other algorithms. It’s hard work, and it isn’t the sexy part of the data science ecosystem, but these efforts are necessary to get reliable and effective results. Joe Hellerstein used the analogy of washing machines and their contribution to productivity compared to rocket ships in his “Of Rocket Ships and Washing Machines” conference keynote to nicely illustrate the importance of data prep to the data space.
A more complete understanding of the role and requirements for data prep as a key component of the analysis workflow, the more realistic organizations can be about what to expect from a data group and why they should invest in data prep productivity. We expect to hear more about better tools and techniques to support improved data prep productivity over the next few years.
4. Asking the right question
Effective analysis depends more on asking the right question or designing a good experiment than on tools and techniques. Large datasets provide the opportunity to take advantage of the “Unreasonable Effectiveness of Data” (Halevy, Norvig, Perera), i.e., effective results from coupling large datasets with relatively simply algorithms. We think organizations that want to improve their ability to deploy data as an asset are best served by emphasizing the “art” of asking good questions and experiment design. Effective analysis is difficult to “buy,” typically requiring adapting the culture toward learning, experimenting and quantitative understanding.
A conference sub-theme of asking the right question raised another issue: how do we train and enable data resources? While no easy answers were offered, there seemed some agreement that:
- Curious folks can make great use of, and build on, simple tools and techniques to become more effective data analysts.
- Storytelling is a key capability for making good analysis useful to an organization.
Weaving the themes together
Looking at all four themes and other topics from Strata, we see maturing and coalescing around analytic productivity as a prime driver for how the data ecosystem is changing — with more focus on better, faster access to data; more effective analysis; and improving how results are communicated and shared.
Let us know via the comments or email what themes you noticed and what you found most intriguing about the conference.
Strata + Hadoop World sessions of note
While there were many outstanding keynotes and sessions at Strata, here’s a list of a few that best informed the themes described above (full keynote presentations are posted below; session videos are available in the Strata + Hadoop World complete video compilation) :
Big Answers — Mike Olson (Cloudera)
A look at the Impala real-time distributed SQL query engine and addressing big social and technology challenges with data.
Beyond Hadoop: Fast Ad Hoc Queries on Big Data — Mike Driscoll and Eric Tschetter (MetaMarkets)
An introduction to Druid, an in-memory distributed data store for fast queries.
Of Rocket Ships and Washing Machines: Data Technology for the People — Joe Hellerstein (Trifacta)
Using the analogy of washing machines as having a bigger productivity and cultural impact than rocket ships, Hellerstein explained the importance of increasing data prep productivity for data scientists.
Netflix Evolving Data Science Architecture — Kurt Brown (Netflix)
By showing how Netflix pragmatically builds, learns and adapts its analytic infrastructure, Brown encapsulated how the data space has matured over the last few years — toward faster queries and more widespread access to data.
Creative Thinking and Data Science — Mike Stringer (Datascope Analytics)
Stringer provided an insightful, data-driven look at how asking the right questions is more important than tools and techniques for effective analysis.
Breeding Data Scientists — Amy O’Connor and Danielle Dean (Nokia)
A mother/daughter presentation focused on how Nokia built its internal data team — with curiosity as primary. I was inspired by how Danielle Dean (the daughter) earnestly described her self-taught immersion into tools and techniques to pragmatically improve her ability to make sense of data.
Roger Magoulas further explores these data themes in the following video: