Reproducing Data Projects

Popular approaches for reproducing, managing, and deploying complex data projects

As I talk to people and companies building the next generation of tools for data scientists, collaboration and reproducibility keep popping up. Collaboration is baked into many of the newer tools I’ve seen (including ones that have yet to be released). Reproducibility is a different story. Many data science projects involve a series of interdependent steps, making auditing or reproducing1 them a challenge. How data scientists and engineers reproduce long data workflows depends on the mix of tools they use.

Scripts
The default approach is to create a set of well-documented programs and scripts. Documentation is particularly important if several tools and programming languages are involved in a data science project. It’s worth pointing out that the generation of scripts need not be limited to programmers: some tools that rely on users executing tasks through a GUI also generate scripts for recreating data analysis and processing steps. A recent example is the DataWrangler project, but this goes back to Excel users recording VBA macros.

Workflow tools
Scripts for executing interdependent data science tasks are usually tied together by workflow tools. Tools like Chronos and Ambrose simplify the creation, maintenance, and monitoring of long data workflows. And at least within airbnb, Chronos is heavily used by business analysts:

Provenance for Computational Tasks: A Survey

By having users focus on discrete tasks that are available through a UI, there’s no reason why business users can’t stitch together data science projects that involve advanced analysis. Startup Alpine Data Labs uses a “workflow interface” to enable business users to execute, reproduce, and audit complex analytic projects.

Notebooks and Workbooks
Notebooks and workbooks are increasingly2 being used to reproduce, audit, and maintain data science workflows. I wrote about iPython notebooks in an earlier post, and I continue to see and hear about them from many users. Notebooks mix text (documentation), code, and graphics in one file, making them natural tools for maintaining complex data projects. Along the same lines, many tools aimed at business users have some notion of a workbook: a place where users can save their series3 of (visual/data) analysis, data import and wrangling steps. These workbooks can then be viewed and copied by others, and also serve as a place where many users can collaborate.

What next
Notebooks, workbooks, and workflow tools are the most popular methods for managing complex data projects. There are bits that I like from each of them. Notebooks are cool because narrative and documentation are extremely easy to incorporate (after all the original Mathematica notebooks were used for instructional purposes). Workflow tools can be used to peruse data projects comprised of many steps (nowadays they usually come with a zoomable DAG viewer). Chronos has tools designed to make it easy to manage production workflows (scheduling, retries, dependencies). Many workbooks (associated with commercial software products) have collaboration, security, and sharing built-in.

Beyond a mashup of features from these three popular approaches, are there completely different ways of thinking about reproducibility, lineage, sharing, and collaboration in the data science and engineering context? I certainly hope that there are startups rethinking these problems. The good news is that academic researchers are developing tools for lineage and provenance for computational tasks. So maybe a new wave of startups and open source projects are on the way.

Related content:


(1) reproducibility includes redoing an analysis at a latter date, or copying and forking portions of a long data science project.
(2) I would even venture to guess that notebooks and workbooks are the most popular means on handling data science workflows.
(3) In essence a workbook is a place where users save all the data analysis and processing they’ve performed.

Strata SC 2014 Image

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata in Santa Clara: February 11-13 | Santa Clara, CA

tags: ,
  • http://borasky-research.net/about-data-journalism-developer-studio-pricing-survey/ M. Edward (Ed) Borasky

    Reproducibility is something you have to design into a project from the very beginning. It isn’t something you can ‘tack on later when the tools get better’. Project managers especially have to allocate resources to assuring reproducibility, or it won’t get baked into the project.

    • Ben Lorica

      I agree! The post was a summary of *some* common tools and strategies for tackling it.

  • Symeon Papadopoulos

    Another challenge with reproducibility stems from the fact that the data source themselves are not always reproducible (available). For instance, if the data source for your project is the Twitter stream, then it is simply not allowed to share the data used as input (sharing the tweet ids is one way to deal with this, though it comes with a series of problems, e.g. long time of reconstructing the dataset, many missing/deleted tweets, etc.).

  • Bruno Aziza

    Great points here Ben.

    The process of Data Science is long over due for re-usability, code-free software and collaborative approach. Your comment on business analysts and executives is spot on here too, because, if they are included and understand how it all works, they can then help evangelize the Advanced Analytics across the organization.

    Communication is key here – to this point, our Lead Data Scientist recently authored a piece on how data scientists can develop these soft skills. I call it ‘the soft stuff is the hard stuff’. See it @ http://bit.ly/1hD2yOZ

    Hope this helps,
    Bruno Aziza
    Alpine Data Labs

  • Raymie Stata

    Nice post. One fun tool that sits somewhere between scripting and
    workflow is Drake (https://github.com/Factual/drake). It’s “make” for
    data files.