ENTRIES TAGGED "data science"
It's easier to "discover" features with tools that have broad coverage of the data science workflow
Interface languages: Python, R, SQL (and Scala)
This is a great time to be a data scientist or data engineer who relies on Python or R. For starters there are developer tools that simplify setup, package installation, and provide user interfaces designed to boost productivity (RStudio, Continuum, Enthought, Sense).
Increasingly, Python and R users can write the same code and run it against many different execution1 engines. Over time the interface languages will remain constant but the execution engines will evolve or even get replaced. Specifically there are now many tools that target Python and R users interested in implementations of algorithms that scale to large data sets (e.g., GraphLab, wise.io, Adatao, H20, Skytree, Revolution R). Interfaces for popular engines like Hadoop and Apache Spark are also available – PySpark users can access algorithms in MLlib, SparkR users can use existing R packages.
In addition many of these new frameworks go out of their way to ease the transition for Python and R users. wise.io “… bindings follow the Scikit-Learn conventions”, and as I noted in a recent post, with SFrames and Notebooks GraphLab, Inc. built components2 that are easy for Python users to learn.
Hardcore Data Science speakers provided many practical suggestions and tips
One of the most popular offerings at Strata Santa Clara was Hardcore Data Science day. Over the next few weeks we hope to profile some of the speakers who presented, and make the video of the talks available as a bundle. In the meantime here are some notes and highlights from a day packed with great talks.
We’ve come to think of analytics as being comprised primarily of data and algorithms. Once data has been collected, “wrangled”, and stored, algorithms are unleashed to unlock its value. Longtime machine-learning researcher Alice Zheng of GraphLab, reminded attendees that data structures are critical to scaling machine-learning algorithms. Unfortunately there is a disconnect between machine-learning research and implementation (so much so, that some recent advances in large-scale ML are “rediscoveries” of known data structures):
While there are many data structures that arise in computer science, Alice devoted her talk to two data structures1 that are widely used in machine-learning:
It has roots in academic scientific computing, but has features that appeal to many data scientists
As I noted in a recent post on reproducing data projects, notebooks have become popular tools for maintaining, sharing, and replicating long data science workflows. Much of that is due to the popularity of IPython1. In development since 2001, IPython grew out of the scientific computing community and has slowly added features that appeal to data scientists.
Roots in academic scientific computing
As IPython creator Fernando Perez noted in his “historical retrospective”, exploratory analysis in a scientific setting requires a solid interactive environment. After years of development IPython has become a great tool for interacting with data. IPython also addresses other important pain points for scientists – reproducibility and collaboration – issues that are equally important to data scientists working in industry.
IPython is more than just Python
With an interactive widget architecture that’s 100% language-agnostic, these days IPython is used by many other programming language communities2, including Julia, Haskell, F#, Ruby, Go, and Scala. If you’re a data scientist who likes to mix-and-match languages, you can create, maintain, and share multi-language data projects in IPython:
In the summer of 2012, Accel Partners hosted an invitation-only Big Data conference at Stanford. Ping Li stood near the exit with a checkbook, ready to invest $1MM in pitches for real-time analytics on clusters. However, real-time means many different things. For MetaScale working on the Sears turnaround, real-time means shrinking a 6 hour window on a mainframe to 6 minutes on Hadoop. For a hedge fund, real-time means compiling Python to run on GPUs where milliseconds matter, or running on FPGA hardware for microsecond response.
With much emphasis on Hadoop circa 2012, one might think that no other clusters existed. Nothing could be further from the truth: Memcached, Ruby on Rails, Cassandra, Anaconda, Redis, Node.js, etc. – all in large-scale production use for mission critical apps, much closer to revenue than the batch jobs. Google emphasizes a related point in their Omega paper: scheduling batch jobs is not difficult, while scheduling services on a cluster is a hard problem, and that translates to lots of money.
Open source, distributed computing tools speedup an important processing pipeline for genomics data
As open source, big data tools enter the early stages of maturation, data engineers and data scientists will have many opportunities to use them to “work on stuff that matters”. Along those lines, computational biology and medicine are areas where skilled data professionals are already beginning to make an impact. I recently came across a compelling open source project from UC Berkeley’s AMPLab: ADAM is a processing engine and set of formats for genomics data.
Second-generation sequencing machines produce more detailed and thus much larger files for analysis (250+ GB file for each person). Existing data formats and tools are optimized for single-server processing and do not easily scale out. ADAM uses distributed computing tools and techniques to speedup key stages of the variant processing pipeline (including sorting and deduping):
Very early on the designers of ADAM realized that a well-designed data schema (that specifies the representation of data when it is accessed) was key to having a system that could leverage existing big data tools. The ADAM format uses the Apache Avro data serialization system and comes with a human-readable schema that can be accessed using many programming languages (including C/C++/C#, Java/Scala, php, Python, Ruby). ADAM also includes a data format/access API implemented on top of Apache Avro and Parquet, and a data transformation API implemented on top of Apache Spark. Because it’s built with widely adopted tools, ADAM users can leverage components of the Hadoop (Impala, Hive, MapReduce) and BDAS (Shark, Spark, GraphX, MLbase) stacks for interactive and advanced analytics.
A Sneak Peek
Data products are the driving force behind new multi-billion dollar companies and a lot of the things we do today on a day-to-day basis have machine learning algorithms behind them. But unfortunately, even though data science is a concept invented in the 21st century, in practice the state of data science is more similar to software engineering in mid 20th century.
The pioneers of data science did a great job of making it very accessible and fairly easy to pick up, but since it’s beginning circa 2005, not much effort has been made to bring it up to par with modern software engineering practices. Machine learning code is still code, and as any software that reaches production environments it should follow standard software engineering practices, like modularity, maintainability and quality (among many others).
The Delite framework has produced high-performance languages that target data scientists
An important reason why pydata tools and Spark appeal to data scientists is that they both cover many data science tasks and workloads (Spark users can move seamlessly between batch and streaming). Being able to use the same programming style and syntax for workflows that span a variety of tasks is a huge productivity boost. In the case of Spark (and Hadoop), the emergence of a variety of scalable analytic engines have made distributed computing applications much easier to build.
Delite: a framework for embedded, parallel, and high-performance DSLs
Another way to boost productivity is to use a family of high-performance languages that cover many data science tasks. Ideally you want languages that allow programmers to focus on applications (not on low-level details of parallel programming) and that can run efficiently on different machines and architectures1 (CPU, GPU). And just like pydata and Spark, syntax and context-switching shouldn’t get in the way of tackling complex data science workflows.
The Delite framework from Stanford’s Pervasive Parallelism Lab (PPL) has been used to produce a family of high-performance domain specific languages (DSLs) that target different data analysis tasks. DSLs are programming languages2 with restricted expressiveness (for a particular domain) and tend to be high-level in nature (they are often declarative and deterministic). Delite is a compiler and runtime infrastructure that allows language designers to use aggressive, domain-specific optimizations to deliver high-performance DSLs. Using Delite, the team at Stanford produced DSLs embedded in a functional language (Scala) with performance results comparable to hand-optimized implementations (e.g. MATLAB, LINQ) across different domains.
Analytic services are tailoring their solutions for specific problems and domains
In relatively short order Amazon’s internal computing services has become the world’s most successful cloud computing platform. Conceived in 2003 and launched in 2006, AWS grew quickly and is now the largest web hosting company in the world. With the recent addition of Kinesis (for stream processing), AWS continues to add services and features that make it an attractive platform for many enterprises.
A few other companies have followed a similar playbook: technology investments that benefit a firm’s core business, is leased out to other companies, some of whom may operate in the same industry. An important (but not well-known) example comes from finance. A widely used service provides users with clean, curated data sets and sophisticated algorithms with which to analyze them. It turns out that the world’s largest asset manager makes its investment and risk management systems available to over 150 pension funds, banks, and other institutions. In addition to the $4 trillion managed by BlackRock, the company’s Aladdin Investment Management system is used to manage1 $11 trillion in additional assets from external managers.
New tools make it easier for companies to process and mine streaming data sources
Stream processing was in the minds of a few people that I ran into over the past week. A combination of new systems, deployment tools, and enhancements to existing frameworks, are behind the recent chatter. Through a combination of simpler deployment tools, programming interfaces, and libraries, recently released tools make it easier for companies to process and mine streaming data sources.
Of the distributed stream processing systems that are part of the Hadoop ecosystem0, Storm is by far the most widely used (more on Storm below). I’ve written about Samza, a new framework from the team that developed Kafka (an extremely popular messaging system). Many companies who use Spark express interest in using Spark Streaming (many have already done so). Spark Streaming is distributed, fault-tolerant, stateful, and boosts programmer productivity (the same code used for batch processing can, with minor tweaks, be used for realtime computations). But it targets applications that are in the “second-scale latencies”. Both Spark Streaming and Samza have their share of adherents and I expect that they’ll both start gaining deployments in 2014.
Popular approaches for reproducing, managing, and deploying complex data projects
As I talk to people and companies building the next generation of tools for data scientists, collaboration and reproducibility keep popping up. Collaboration is baked into many of the newer tools I’ve seen (including ones that have yet to be released). Reproducibility is a different story. Many data science projects involve a series of interdependent steps, making auditing or reproducing1 them a challenge. How data scientists and engineers reproduce long data workflows depends on the mix of tools they use.
The default approach is to create a set of well-documented programs and scripts. Documentation is particularly important if several tools and programming languages are involved in a data science project. It’s worth pointing out that the generation of scripts need not be limited to programmers: some tools that rely on users executing tasks through a GUI also generate scripts for recreating data analysis and processing steps. A recent example is the DataWrangler project, but this goes back to Excel users recording VBA macros.