By David Andrzejewski of SumoLogic
A few weeks ago I had the pleasure of hosting the machine data track of talks at Strata Santa Clara. Like “big data”, the phrase “machine data” is associated with multiple (sometimes conflicting) definitions, two prominent ones come from Curt Monash and Daniel Abadi. The focus of the machine data track is on data which is generated and/or collected automatically by machines. This includes software logs and sensor measurements from systems as varied as mobile phones, airplane engines, and data centers. The concept is closely related to the “internet of things”, which refers to the trend of increasing connectivity and instrumentation in existing devices, like home thermostats.
More data, more problems
This data can be useful for the early detection of operational problems or the discovery of opportunities for improved efficiency. However, the decoupling of data generation and collection from human action means that the volume of machine data can grow at machine scales (i.e., Moore’s Law), an issue raised by both Monash and Abadi. This explosive growth rate amplifies existing challenges associated with “big data”. In particular two common motifs among the talks at Strata were the difficulties around:
- mechanics: the technical details of data collection, storage, and analysis
- semantics: extracting understandable and actionable information from the data deluge
human-centered design techniques from an ideation workshop
By Bo Peng and Aaron Wolf of Datascope Analytics
At Datascope Analytics, our ideation workshop combines elements from human-centered design principles to develop innovative and valuable ideas/solutions/strategies for our clients. From our workshop experience, we’ve developed a few key techniques that have enabled successful communication and collaboration. We complete certain milestones during the workshop: the departure point, the dream view, and curation with gold star voting, among others. These are just a few of the accomplishments that are achieved at various points during the workshop. In addition, we strive to support cultural goals throughout the workshop’s duration: creating an environment that spurs creativity and encourages wild ideas, and maintaining a mediator role. These techniques have thus far proven successful in providing innovative and actionable solutions for our clients.
Today, it’s shocking (and honestly exciting) how much of my daily experience is determined by a recommender system. These systems drive amazing experiences everywhere, telling me where to eat, what to listen to, what to watch, what to read, and even who I should be friends with. Furthermore, information overload is making recommender systems indispensable, since I can’t find what I want on the web simply using keyword search tools. Recommenders are behind the success of industry leaders like Netflix, Google, Pandora, eHarmony, Facebook, and Amazon. It’s no surprise companies want to integrate recommender systems with their own online experiences. However, as I talk to team after team of smart industry engineers, it has become clear that building and managing these systems is usually a bit out of reach, especially given all the other demands on the team’s time.
Tools, Trends, What Pays (and What Doesn't) for Data Professionals
There is no shortage of news about the importance of data or the career opportunities within data. Yet a discussion of modern data tools can help us understand what the current data evolution is all about, and it can also be used as a guide for those considering stepping into the data space or progressing within it.
In our report, 2013 Data Science Salary Survey, we make our own data-driven contribution to the conversation. We collected a survey from attendees of the Strata Conference in New York and Santa Clara, California, about tool usage and salary.
Strata attendees span a wide spectrum within the data world: Hadoop experts and business leaders, software developers and analysts. By no means does everyone use data on a “Big” scale, but almost all attendees have some technical aspect to their role. Strata attendees may not represent a random sample of all professionals working with data, but they do represent a broad slice of the population. If there is a bias, it is likely toward the forefront of the data space, with attendees using the newest tools (or being very interested in learning about them).
A Sneak Peek
Data products are the driving force behind new multi-billion dollar companies and a lot of the things we do today on a day-to-day basis have machine learning algorithms behind them. But unfortunately, even though data science is a concept invented in the 21st century, in practice the state of data science is more similar to software engineering in mid 20th century.
The pioneers of data science did a great job of making it very accessible and fairly easy to pick up, but since it’s beginning circa 2005, not much effort has been made to bring it up to par with modern software engineering practices. Machine learning code is still code, and as any software that reaches production environments it should follow standard software engineering practices, like modularity, maintainability and quality (among many others).
Sneak peek at an upcoming tutorial at Strata Santa Clara 2014
Apache Hadoop 2.0 represents a generational shift in the architecture of Apache Hadoop. With YARN, Apache Hadoop is recast as a significantly more powerful platform – one that takes Hadoop beyond merely batch applications to taking its position as a ‘data operating system’ where HDFS is the file system and YARN is the operating system.
YARN is a re-architecture of Hadoop that allows multiple applications to run on the same platform. With YARN, applications run “in” Hadoop, instead of “on” Hadoop:
Skills of the Agile Data Wrangler
As data processing has become more sophisticated, there has been little progress on improving the most time-consuming and tedious parts of the pipeline: Data Transformation tasks including discovery, structuring, and content cleaning . In standard practice, this kind of “data wrangling” requires writing idiosyncratic scripts in programming languages such as Python or R, or extensive manual editing using interactive tools such as Microsoft Excel. The result has two significantly negative outcomes. First, people with highly specialized skills (e.g., statistics, molecular biology, micro-economics) spend far more time in tedious data wrangling tasks than they do in exercising their specialty. Second, less technical users are often unable to wrangle their own data. The result in both cases is that significant data is often left unused due to the hurdle of transforming it into shape. Sadly, when it comes to standard practice in modern data analysis, “the tedium is the message.” In our upcoming tutorial at Strata, we will survey both sources and solutions to the problems of Data Transformation.
Analysts must regularly transform data to make it palatable to databases, statistics packages, and visualization tools. Data sets also regularly contain missing, extreme, duplicate or erroneous values that can undermine the results of analysis. These anomalies come from various sources, including human data entry error, inconsistencies between integrated data sets, and sensor interference. Our own interviews with data analysts have found that these types of transforms constitute the most tedious component of their analytic process. Flawed analyses due to dirty data are estimated to cost billions of dollars each year. Discovering and correcting data quality issues can also be costly: some estimate cleaning dirty data to account for 80 percent of the cost of data warehousing projects.
Human judgment is at the center of successful data analysis. This statement might initially seem at odds with the current Big Data frenzy and its focus on data management and machine learning methods. But while these tools provide immense value, it is important to remember that they are just that: tools. A hammer does not a carpenter make — though it certainly helps.
Consider the words of John Tukey 1, possibly the greatest statistician of the last half-century: “Nothing — not the careful logic of mathematics, not statistical models and theories, not the awesome arithmetic power of modern computers — nothing can substitute here for the flexibility of the informed human mind. Accordingly, both approaches and techniques need to be structured so as to facilitate human involvement and intervention.” Tukey goes on to write: “Some implications for effective data analysis are: (1) that it is essential to have convenience of interaction of people and intermediate results and (2) that at all stages of data analysis the nature and detail of output need to be matched to the capabilities of the people who use it and want it.” Though Tukey and colleagues voiced these sentiments nearly 50 years ago, they ring even more true today. The interested analyst is at the heart of the Big Data question: how well do our tools help users ask better questions, formulate hypotheses, spot anomalies, correct errors and create improved models and visualizations? To “facilitate human involvement” across “all stages of data analysis” is a grand challenge for our age.
Making Machine Learning Accessible & Usable
Big Data may seem like a familiar concept to those working in IT, but for most executives it’s difficult to imagine just how much Big Data impacts business on a daily basis. Most companies already collect customer data, ranging from purchase habits to social media interactions, but few translate their data into actionable business insights. By applying advanced analytics to Big Data, companies can identify patterns and make predictions from huge amounts of information that a single human analyst could never see, let alone understand.
Machine Learning – the core technology behind this type of Big Data analytics – involves a collection of algorithms that are designed to uncover patterns that classical statistical algorithms often fail to detect. Procedures like k-means clustering, support vector machines, Bayes nets, and decision trees are flexible and adapt themselves to nonlinear and high-dimensional data structures. This flexibility comes with a price, however. Expert users must decide in advance on a host of parameter settings – kernel types, cluster numbers, prior probabilities, and so on. The complexity of these decisions necessarily eludes the average analyst. Furthermore, Machine Learning algorithms rest on certain assumptions that are similar to those required for classical statistical analysis. Outliers, missing values, and unusual distributions can invalidate the conclusions drawn from Machine Learning applications.
My earliest introduction to Apache Hadoop, several years ago, was using Hadoop with one of the popular NoSQL databases to build data acquisition pipelines for a semantic search engine for a potential customer. Originally, I had used classic ETL and database tools but the resulting data acquisition, cleansing, and entity extraction pipeline took days to run over a workload of several million medium / large XML documents. Like many others adopting NoSQL, we could have scaled up our relational approach with expensive server and software purchases, but the budget constraints meant that we had to look at other alternatives.
I started to experiment with using a mix of Hadoop components, open source ETL tools, XSLT (as the source data was archives of XML documents), and NoSQL technologies along with custom Java components to perform entity detection and custom plugins for the particular ETL software. The resulting prototype solution performed the same corpus preparation in hours even on a small proof-of-concept cluster.
For many approaching Hadoop solutions for the first time, the natural tendency is to view Hadoop from the perspective of just one facet of existing technologies such as databases, machine-learning, cloud computing, distributed storage, or distributed computing. While this can be very useful as a learning tool, it can lead to some misconceptions about the Hadoop components and ecosystem or applicable use cases.