Data products are the driving force behind new multi-billion dollar companies and a lot of the things we do today on a day-to-day basis have machine learning algorithms behind them. But unfortunately, even though data science is a concept invented in the 21st century, in practice the state of data science is more similar to software engineering in mid 20th century.
The pioneers of data science did a great job of making it very accessible and fairly easy to pick up, but since it’s beginning circa 2005, not much effort has been made to bring it up to par with modern software engineering practices. Machine learning code is still code, and as any software that reaches production environments it should follow standard software engineering practices, like modularity, maintainability and quality (among many others).
The first thing to understand is that data science is mostly manipulation of data. Usually the data is large scale and complex, but these manipulations are commonly found in non-data science code bases as well. Even more, since in the work of a data scientist you don’t know which functionality achieves the desired result, for example whether the right metric is mean or median, the argument for modular code becomes much stronger.
The second proposition is that the current tooling that data scientist use is inadequate for the type of systems that end up in production. Real world data processing in many cases is at least as complicated as writing regular software tasks such as fetching data from a database, passing messages between mobile devices or throttling the bit rate of a streamed video, only in the former case the tools being used are on one spectrum the SQL-like Hive and the very basic scripting language Pig. Although these languages can be extended using user defined functions, those functions are very ad-hoc in nature and very difficult to reuse. On the other spectrum, there is the vanilla Java MapReduce which consists of an awful amount of boilerplate code that has very little to do with the actual desired functionality.
In my upcoming tutorial session at Strata Santa Clara, Effective Data Science with Scalding, I will try to propose a more modern alternative that will combine the best of both worlds, concise and high level as Pig and Hive and a fully powered programming language like Java. The session will focus on Scalding, which is a high-level abstraction framework over MapReduce written in Scala. It has all of the functionality of Pig and Hive achieved sometimes with fewer lines of code. The fact that it is written in Scala, which is a modern language over the Java Virtual Machine, is a great feature since all of the Java libraries can be reused.
This tutorial will benefit aspiring data scientists who want to learn the ropes of data wrangling using one of the fastest growing MapReduce frameworks, practicing data scientists who want to improve the their craft and introduce functional programming concepts such as monoids and monads and even technical managers who want to learn the tradeoffs between various Hadoop technologies. It will be a very practical session, where you will learn to read and write data from different sources, apply filtering and various transformations, aggregate data using very simple yet very powerful data design patterns and apply sophisticated machine learning models using interoperability with existing Java machine learning libraries.
Hope to see you there!