Data Science tools: Are you “all in” or do you “mix and match”?

It helps to reduce context-switching during long data science workflows.

An integrated data stack boosts productivity
As I noted in my previous post, Python programmers willing to go “all in”, have Python tools to cover most of data science. Lest I be accused of oversimplification, a Python programmer still needs to commit to learning a non-trivial set of tools1. I suspect that once they invest the time to learn the Python data stack, they tend to stick with it unless they absolutely have to use something else. But being able to stick with the same programming language and environment is a definite productivity boost. It requires less “setup time” in order to explore data using different techniques (viz, stats, ML).

Multiple tools and languages can impede reproducibility and flow
On the other end of the spectrum are data scientists who mix and match tools, and use packages and frameworks from several languages. Depending on the task, data scientists can avail of tools that are scalable, performant, require less2 code, and contain a lot of features. On the other hand this approach requires a lot more context-switching, and extra effort is needed to annotate long workflows. Failure to document things properly makes it tough to reproduce3 analysis projects, and impedes knowledge transfer4 within a team of data scientists. Frequent context-switching also makes it more difficult to be in a state of flow, as one has to think about implementation/package details instead of exploring data. It can be harder to discover interesting stories with your data, if you’re constantly having to think about what you’re doing. (It’s still possible, you just have to concentrate a bit harder.)

Some tools that cover a range of data science tasks
More tools that integrate different data science tasks are starting to appear. SAS has long provided tools for data management and wrangling, business intelligence, visualization, statistics, and machine-learning. For massive5 data sets, a new alternative to SAS is ScaleR from Revolution Analytics. Within ScaleR programmers use R for data wrangling (rxDataStep), data visualization (basic viz functions for big data), and statistical analysis (it comes with a variety of scalable statistical algorithms).

Startup Alpine Data Labs lets users connect to a variety of data sources, manage their data science workflows, and access a limited set of advanced algorithms. Upstart BI vendors Datameer and Platfora provide data wrangling and visualization tools. Datameer also provides easy data integration to a variety of structured/unstructured data sources, analytic functions and PMML to execute predictive analytics. The release of MLbase this summer adds machine-learning to the BDAS/Spark stack – which currently covers data processing, interactive (SQL) and streaming analysis.

What does your data science toolkit look like? Do you mainly use one stack or do you tend to “mix and match”?



(1) This usually includes matplotlib or Bokeh, Scikit-learn, Pandas, SciPy, and NumPy. But as a general purpose language, you can even use it for data acquisition (e.g. web crawlers or web services).
(2) An example would be using R for viz or stats.
(3) This pertains to all data scientists, but is particularly important to those among us who use a wide variety of tools. Unless you document things properly, when you’re using many different tools the results of very recent analysis projects can be hard to redo.
(4) Regardless of the tools you use, everything starts with knowing something about the lineage and provenance of your data set – something Loom attempts to address.
(5) A quick and fun tool for exploring smaller data sets is the just released SkyTree Adviser. After users perform data processing and wrangling in another tool, SkyTree Adviser exposes machine-learning, statistics, and statistical graphics through an interface that is accessible to business analysts.

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata Rx Health Data Conference: September 25-27 | Boston, MA
Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England

tags: , , , , , , , ,
  • http://twitter.com/AnthonyNystrom Anthony Nystrom

    I just recently gave a presentation on “Data Science in the NOW, it takes an army of tools”… Interestingly you really DO NEED to mix and match, even for rather non complex data work, transformations etc. Intridea is soon releasing a Data Science Workstation, vagrant VM that comes loaded with most of the tools and languages one would need to get started. Will be based much like my Python development vagrant bootstrap. -> http://anthonynystrom.github.com/python-dev-bootstrap/

    • http://gplus.to/cflynn Charles Flynn

      Anthony, the data science workstation sounds very interesting. Where can I find more information?

      • http://twitter.com/AnthonyNystrom Anthony Nystrom

        HI Charles! Thanks for asking… We will be releasing it soon. However, if you want more detail and to try it out. Just send me a connection request on linkedIn… Cool?

        • http://twitter.com/flynn_cc Charles Flynn

          Thanks Anthony! Invitation sent, looking forward to learning more.

  • http://twitter.com/kkrugler Ken Krugler

    I think you need to separate out the tools used during exploration/early analysis from what’s used in production. Almost always you’ll want to use one set (e.g. R) during the initial phase, and a different set (e.g. Cascading) in production.

    But once you’re in production, having one consistent tool chain does become important. Switching between Python and Java in an ETL workflow (or Pig/Hive and some arbitrary UDF) makes things less consistent & stable, in my experience.