Strata SC 2014 Session Postmortem
In February, GraphLab took a road trip to Strata, a Big Data conference organized by O’Reilly. It was a gathering of close to 3100 people–engineers, business folks, industry evangelists, and data scientists. We had a lot of fun meeting and socializing with our peers and customers. Amidst all the conference excitement, we presented two talks. Carlos Guestrin, our intrepid CEO, held a tutorial on large-scale machine learning. I gave a talk in the Hardcore Data Science track.
Given the diversity of the audience, this was a difficult talk to pin down. After banging my head against the wall for some time, I decided to go with what interests *me*. As a machine learning researcher and an industry observer, I’ve always puzzled over these questions: What exact is Big Data? What kind of tools do we really need to build? How? Big Data discussions often span a bewildering spectrum of topics. At one end of the spectrum, people talk about Big Data, data processing, data cleaning, and simple analytics. At the other end, people talk about complex machine learning models. There is a disconnect. There is something in between that is seldom talked about, and yet is crucial for efficient analysis: data structures.
Data structures are the glue between data and algorithms. Raw data must be turned into data structures–whether in memory or on disk–before they can be operated on. Algorithms depend on the underlying data structures to support their computation needs. An efficient implementation of the right data structure can be the key to efficient analysis. GraphLab is known for its distributed graphs. But graphs are not the whole story. Many algorithms are indeed naturally situated on top of graphs: PageRank, label propagation, and Gibbs sampling are but a few examples. But many other algorithms, such as stochastic gradient descent and decision tree learning, are more amenable to flat tables. Furthermore, raw data often comes in the form of logs, which can be easily translated into flat tables. With GraphLab’s upcoming offering of SFrames, we are now handling large-scale flat tables as well as graphs.
So that was my talk. I talked about data, I talked about algorithms, and I talked about what it takes to go from data to analysis using algorithms. It felt supremely satisfying to unite the two ends of the spectrum. Apparently I wasn’t the only one. The talk struck a chord with the audience. Many people came up afterwards, eager to learn more. What algorithms are more suitable for graphs? How should one pick between the two? What metrics might one use? It was great to see people becoming interested in the messy details of tool building.
To be honest, data structures was one of my least favorite subjects in college. It seemed so dry and abstract … and complicated! But when we take the perspective of the interplay of raw data and algorithms, the subject comes alive. One person came up to me afterwards and said “I’m just getting started with data science. Thanks for making a difficult subject accessible!” That comment alone made all it all worth the effort. At GraphLab, this is the kind of stuff that we live and breath everyday. For each algorithm and each data set, we weigh the alternatives and implement the most suitable data structures. We do the dirty work so that others don’t have to.
Editor’s Note: A version of this post appeared previously on the GraphLab Blog
Insight from a Strata Santa Clara 2014 session
When you think about what goes into winning a Nobel Prize in a field like economics, it’s a lot like machine learning. In order to make a breakthrough, you need to identify an interesting theory for explaining the world, test your theory in practice to see if it holds up, and if it does, you’ve got a potential winner. The bigger and more significant the issue addressed by your theory, the more likely you are to win the prize.
In the world of business, there’s no bigger issue than helping a company be more successful, and that usually hinges on helping it deliver its products to those that need them. This is why I like to describe my company SalesPredict as helping our customers win the Nobel Prize in business, if such a thing existed.
Focusing attention on the present lets organizations pursue existing opportunities as opposed to projected ones
Slow and Unaware
It was 2005. The war in Iraq was raging. Many of us in the national security R&D community were developing responses to the deadliest threat facing U.S. soldiers: the improvised explosive device (IED). From the perspective of the U.S. military, the unthinkable was happening each and every day. The world’s most technologically advanced military was being dealt significant blows by insurgents making crude weapons from limited resources. How was this even possible?
The war exposed the limits of our unwavering faith in technology. We depended heavily on technology to provide us the advantage in an environment we did not understand. When that failed, we were slow to learn. Meanwhile the losses continued. We were being disrupted by a patient, persistent organization that rapidly experimented and adapted to conditions on the ground.
To regain the advantage, we needed to start by asking different questions. We needed to shift our focus from the devices that were destroying U.S. armored vehicles to the people responsible for building and deploying the weapons. This motivated new approaches to collect data that could expose elements of the insurgent network.
New organizations and modes of operation were also required to act swiftly when discoveries were made. By integrating intelligence and special operations capabilities into a single organization with crisp objectives and responsive leadership, the U.S. dramatically accelerated its ability to disrupt insurgent operations. Rapid orientation and action were key in this dynamic environment where opportunities persisted for an often unknown and very limited period of time.
This story holds important and under appreciated lessons that apply to the challenges numerous organizations face today. The ability to collect, store, and process large volumes of data doesn’t confer advantage by default. It’s still common to fixate on the wrong questions and fail to recover quickly when mistakes are made. To accelerate organizational learning with data, we need to think carefully about our objectives and have realistic expectations about what insights we can derive from measurement and analysis.
Insights from a business executive and law professor
If you develop software or manage databases, you’re probably at the point now where the phrase “Big Data” makes you roll your eyes. Yes, it’s hyped quite a lot these days. But, overexposed or not, the Big Data revolution raises a bunch of ethical issues related to privacy, confidentiality, transparency and identity. Who owns all that data that you’re analyzing? Are there limits to what kinds of inferences you can make, or what decisions can be made about people based on those inferences? Perhaps you’ve wondered about this yourself.
We’re obsessed by these questions. We’re a business executive and a law professor who’ve written about this question a lot, but our audience is usually lawyers. But because engineers are the ones who confront these questions on a daily basis, we think it’s essential to talk about these issues in the context of software development.
While there’s nothing particularly new about the analytics conducted in big data, the scale and ease with which it can all be done today changes the ethical framework of data analysis. Developers today can tap into remarkably varied and far-flung data sources. Just a few years ago, this kind of access would have been hard to imagine. The problem is that our ability to reveal patterns and new knowledge from previously unexamined troves of data is moving faster than our current legal and ethical guidelines can manage. We can now do things that were impossible a few years ago, and we’ve driven off the existing ethical and legal maps. If we fail to preserve the values we care about in our new digital society, then our big data capabilities risk abandoning these values for the sake of innovation and expediency.
By David Andrzejewski of SumoLogic
A few weeks ago I had the pleasure of hosting the machine data track of talks at Strata Santa Clara. Like “big data”, the phrase “machine data” is associated with multiple (sometimes conflicting) definitions, two prominent ones come from Curt Monash and Daniel Abadi. The focus of the machine data track is on data which is generated and/or collected automatically by machines. This includes software logs and sensor measurements from systems as varied as mobile phones, airplane engines, and data centers. The concept is closely related to the “internet of things”, which refers to the trend of increasing connectivity and instrumentation in existing devices, like home thermostats.
More data, more problems
This data can be useful for the early detection of operational problems or the discovery of opportunities for improved efficiency. However, the decoupling of data generation and collection from human action means that the volume of machine data can grow at machine scales (i.e., Moore’s Law), an issue raised by both Monash and Abadi. This explosive growth rate amplifies existing challenges associated with “big data”. In particular two common motifs among the talks at Strata were the difficulties around:
- mechanics: the technical details of data collection, storage, and analysis
- semantics: extracting understandable and actionable information from the data deluge
human-centered design techniques from an ideation workshop
By Bo Peng and Aaron Wolf of Datascope Analytics
At Datascope Analytics, our ideation workshop combines elements from human-centered design principles to develop innovative and valuable ideas/solutions/strategies for our clients. From our workshop experience, we’ve developed a few key techniques that have enabled successful communication and collaboration. We complete certain milestones during the workshop: the departure point, the dream view, and curation with gold star voting, among others. These are just a few of the accomplishments that are achieved at various points during the workshop. In addition, we strive to support cultural goals throughout the workshop’s duration: creating an environment that spurs creativity and encourages wild ideas, and maintaining a mediator role. These techniques have thus far proven successful in providing innovative and actionable solutions for our clients.
Today, it’s shocking (and honestly exciting) how much of my daily experience is determined by a recommender system. These systems drive amazing experiences everywhere, telling me where to eat, what to listen to, what to watch, what to read, and even who I should be friends with. Furthermore, information overload is making recommender systems indispensable, since I can’t find what I want on the web simply using keyword search tools. Recommenders are behind the success of industry leaders like Netflix, Google, Pandora, eHarmony, Facebook, and Amazon. It’s no surprise companies want to integrate recommender systems with their own online experiences. However, as I talk to team after team of smart industry engineers, it has become clear that building and managing these systems is usually a bit out of reach, especially given all the other demands on the team’s time.
Tools, Trends, What Pays (and What Doesn't) for Data Professionals
There is no shortage of news about the importance of data or the career opportunities within data. Yet a discussion of modern data tools can help us understand what the current data evolution is all about, and it can also be used as a guide for those considering stepping into the data space or progressing within it.
In our report, 2013 Data Science Salary Survey, we make our own data-driven contribution to the conversation. We collected a survey from attendees of the Strata Conference in New York and Santa Clara, California, about tool usage and salary.
Strata attendees span a wide spectrum within the data world: Hadoop experts and business leaders, software developers and analysts. By no means does everyone use data on a “Big” scale, but almost all attendees have some technical aspect to their role. Strata attendees may not represent a random sample of all professionals working with data, but they do represent a broad slice of the population. If there is a bias, it is likely toward the forefront of the data space, with attendees using the newest tools (or being very interested in learning about them).
A Sneak Peek
Data products are the driving force behind new multi-billion dollar companies and a lot of the things we do today on a day-to-day basis have machine learning algorithms behind them. But unfortunately, even though data science is a concept invented in the 21st century, in practice the state of data science is more similar to software engineering in mid 20th century.
The pioneers of data science did a great job of making it very accessible and fairly easy to pick up, but since it’s beginning circa 2005, not much effort has been made to bring it up to par with modern software engineering practices. Machine learning code is still code, and as any software that reaches production environments it should follow standard software engineering practices, like modularity, maintainability and quality (among many others).
Sneak peek at an upcoming tutorial at Strata Santa Clara 2014
Apache Hadoop 2.0 represents a generational shift in the architecture of Apache Hadoop. With YARN, Apache Hadoop is recast as a significantly more powerful platform – one that takes Hadoop beyond merely batch applications to taking its position as a ‘data operating system’ where HDFS is the file system and YARN is the operating system.
YARN is a re-architecture of Hadoop that allows multiple applications to run on the same platform. With YARN, applications run “in” Hadoop, instead of “on” Hadoop: