ENTRIES TAGGED "statistics"
The importance of data science tools that let organizations easily combine, deploy, and maintain algorithms
Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you’re fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized and analyzed. Many companies have data engineers (adept at using workflow tools like Azkaban and Oozie), who manage1 pipelines for data scientists and analysts.
A workflow tool for data analysts: Chronos from airbnb
A raw bash scheduler written in Scala, Chronos is flexible, fault-tolerant2, and distributed (it’s built on top of Mesos). What’s most interesting is that it makes the creation and maintenance of complex workflows more accessible: at least within airbnb, it’s heavily used by analysts.
Job orchestration and scheduling tools contain features that data scientists would appreciate. They make it easy for users to express dependencies (start a job upon the completion of another job), and retries (particularly in cloud computing settings, jobs can fail for a variety of reasons). Chronos comes with a web UI designed to let business analysts3 define, execute, and monitor workflows: a zoomable DAG highlights failed jobs and displays stats that can be used to identify bottlenecks. Chronos lets you include asynchronous jobs – a nice feature for data science pipelines that involve long-running calculations. It also lets you easily define repeating jobs over a finite time interval, something that comes in handy for short-lived4 experiments (e.g. A/B tests or multi-armed bandits).
Quickly perform and interpret the results of routine Small Data analysis
With so much focus on Big Data, the needs of many analysts who work with Small Data tend to get ignored. The default tool for many of these users remains spreadsheets1 and/or statistical packages which come with a lot of features and options. However many analysts need a very small subset of what these tools have to offer.
Enter Statwing, a software-as-a-service provider for routine statistical analysis. While the tool is still in the early stages, it can already do many basic “data analysis” tasks.
Consider the following example of a pivot table constructed in Excel: this required 8 mouse-clicks, if you do everything perfectly, and about 5 decisions (what variables to include, what metric to use, …)
The same task in Statwing required 4 mouse-clicks and 0 decisions! Plus it comes with visuals:
The lack of clutter and the addition of a simple “headline” (“Female tends to have much higher values for satisfaction than Male“), makes the result much easier to interpret. The advanced tab contains detailed statistical analysis (in this case the p-value, counts, values). Many users get confused by the output/results produced by traditional statistical software. Let’s face it, many analysts have had little training in statistics. I welcome a tool that produces readily interpretable results.
The company hopes to replicate the above example across a wide variety of routine data analysis tasks. Their initial focus is on tools for (consumer) survey analysis, a potentially huge market given that online companies have made surveys so much easier to conduct. Users of Statwing pay a small monthly subscription, making it cheaper than most2 statistical packages. For a small monthly fee, their intuitive UI lets analysts get their tasks done quickly. More importantly Statwing may nurture aspiring data scientists in your organization.
(1) As this recent Strata presentation points out: Spreadsheets are the glue that keeps many organizations together.
(2) Open source tools like OpenOffice, R and Octave are free. So is the use of Google spreadsheets.
IA Ventures success, MathJax display engine, statistical literacy, and making big data more human
IA Ventures raises a huge first-time fund; MathJax provides an open source mathematical display engine; Kevin Drum shares 10 statistics pitfalls; and Paul Bradshaw explains how to bring big data down to a human scale.
Trading platforms, truth in graphs, European financial stats, and Mandelbrot's passing.
In this edition of Strata Week: The London Stock Exchange moves from .Net to open source; learn how graphical scales can lie; the Euroean Central Bank president calls for better financial statistics; and we bid farewell to the father of fractals.
If you haven’t seen it, I recommend you watch Andy Bechtolsheim’s keynote at the recent Mysqlconf. We covered SSD’s in our just published report on Big Data management technologies. Since then, we’ve gotten additional signals from our network of alpha geeks and our interest in them remains high. I had a chance to visit with Dataspora founder and blogger Mike Driscoll, an enthusiastic advocate for the use of the open source statistical computing language, R.