Data Wrangling gets a fresh look

We are in the early days of productivity technology in data science

Data analysts have long lamented the amount of time they spend on data wrangling. Rightfully so, as some estimates suggest they spend a majority of their time on it. The problem is compounded by the fact that these days, data scientists are encouraged to cast their nets wide, and investigate alternative (unstructured) data sources. The general perception is that data wrangling is the province of programmers and data scientists. Spend time around Excel users and you’ll learn that they do quite a bit of data wrangling too!

In my work I tend to write scripts and small programs to do data wrangling. That usually means some combination1 of SQL, Python, and Spark2. I’ve played with Google Refine (now called OpenRefine) in the past, but I found the UI hard to get used to. Part of the problem may have been that I didn’t use the tool often3 enough to become comfortable.

For most users data wrangling still tends to mean a series of steps that usually involves different tools (e.g., you often need to draw charts to spot outliers and anomalies). As I’ve pointed out in previous posts, workflows that involve many different tools require a lot of context-switching, which in turn affects productivity and impedes reproducability.

We are washing our data at the side of the river on stones. We are really in the early, early ages of productivity technology in data science.
Joe Hellerstein (Strata-NYC 2012), co-founder and CEO of Trifacta

Startups to the rescue
If data scientists spend a majority of time on data wrangling, the right set of tools can significantly boost their productivity (and thus their companies bottom lines). In addition, by empowering business users to do many routine data wrangling tasks, they are able to “play” with many more data sources without overburdening their IT departments.

Visual Analysis startups Platfora, Datameer, and ClearStory all offer some tools for data wrangling. ClearStory offers a catalog of external data sets that users can easily combine (“harmonize”) with their own data. SiSense has a set of data transformation functions, along with tools for combining data sets (“join recommendations”).

While Visual Analysis tools are recognizing the need to simplify data wrangling, there are many data sets that are beyond the scope of their simple data preparation4 utilities. Luckily some startups are focused on data preparation and data wrangling. With Paxata’s data preparation tools, analysts are able to quickly focus on analytics (instead of data wrangling). In particular users of Paxata5 are able to do many of the data preparation tasks they previously asked their IT colleagues to do. Dataiku6 offers a product called Data Science Studio that comes with statistical analysis tools for data cleansing. Trifacta is a SF-based startup that is building tools to simplify data transformation and preparation. The founders of Trifacta are the academic team behind the highly regarded DataWrangler project.

Related content:

(1) I rarely use stats tools like R for data wrangling, although years ago I remember sometimes using SAS (proc sql and the data step).
(2) Data processing and wrangling is one of the common uses of Spark.
(3) Your need a certain number or “reps” (repetitions) to begin appreciating a tool like OpenRefine.
(4) Preparation prior to data analysis.
(5) In the case of Paxata this includes filter, transform, cluster, and merge.
(6) Update (11/24/2013): Hat tip to Sam Bessalah.

Strata SC 2014 Image

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata in Santa Clara: February 11-13 | Santa Clara, CA

tags: , , ,
  • Sam Bessalah

    Hi Ben, nice post. Just wanted to mention Dataiku as one startup working in that direction as well. With a full data wrangling cycle studio on a sleek web interface. There’s clearly a lot of demand for those kind of tools, and it’s nice to startup focus on this, and not only on big processing tools or machine learning compilation of algorithms.

    • DataFoo

      Dataiku looks very much like Google Refine – is this a clone?

      • Pedro Magalhaes

        Google Refine and Data Wrangler from Stanford…

      • Olivier Grisel

        It’s more like an integrated platform that features:

        - a spreadsheet UI a la Open Refine / Data Wrangler for introspection and plotting of in-memory subsamples of large HDFS / SQL data sources,

        - a dependency dataset manager a la Luigi from Spotify that can schedule processing on a Hadoop cluster (via Hive, Pig, Python scripts and others),

        - a predictive modeling tool and interactive data modeling studio that leverages pandas, scikit-learn, IPython notebook and R.

        and more.

        It does not reuse Luigi nor Open Refine / Data Wrangler.

  • George K. Mathew

    Good post Ben. Check out Alteryx when you get a chance. It crushes every data wrangling problem thrown at & been serving data analysts at scale for years.

  • Sonal

    Thanks for this informative post Ben. I wanted to add Reifier as a large scale data deduplication and record linkage tool aiming to ease the lives of data analysts. Please check