Human judgment is at the center of successful data analysis. This statement might initially seem at odds with the current Big Data frenzy and its focus on data management and machine learning methods. But while these tools provide immense value, it is important to remember that they are just that: tools. A hammer does not a carpenter make — though it certainly helps.
Consider the words of John Tukey 1, possibly the greatest statistician of the last half-century: “Nothing — not the careful logic of mathematics, not statistical models and theories, not the awesome arithmetic power of modern computers — nothing can substitute here for the flexibility of the informed human mind. Accordingly, both approaches and techniques need to be structured so as to facilitate human involvement and intervention.” Tukey goes on to write: “Some implications for effective data analysis are: (1) that it is essential to have convenience of interaction of people and intermediate results and (2) that at all stages of data analysis the nature and detail of output need to be matched to the capabilities of the people who use it and want it.” Though Tukey and colleagues voiced these sentiments nearly 50 years ago, they ring even more true today. The interested analyst is at the heart of the Big Data question: how well do our tools help users ask better questions, formulate hypotheses, spot anomalies, correct errors and create improved models and visualizations? To “facilitate human involvement” across “all stages of data analysis” is a grand challenge for our age.
At Trifacta, a start-up based on collaborative research between Stanford and Berkeley — we’re particularly focused on striking a novel balance of visual interfaces, algorithmic support and scalable processing to radically accelerate data transformation. In interviews with analysts, we’ve observed that their work is highly iterative, but also much constrained by existing tools. Anomalies uncovered using visualization, for instance, may require redoubled efforts around data acquisition and cleaning. However, many analysts find it difficult to scale their visualization tools to the volumes of data they are working with.
Similarly, others note that the high latency of batch-oriented processing models like MapReduce stymie iterative exploration. Significant delays or unnecessarily complex user interfaces may impede not only the pace of analysis, but also its breadth and quality. To build more effective analysis tools, deep-seated systems choices must go hand-in-hand with user interface design.
Scalable, Interactive Visualization
As one step towards better analyst-centered tools, consider the problem of scaling interactive visualization to ever-larger databases. Even with only thousands — nevermind billions — of data points, typical statistical graphics can become cluttered and hard to read. A common workaround is to subsample the data to produce a less cluttered view. However, for exploratory analysis, binned aggregation — subdividing the data domain into discrete bins and computing corresponding summaries — can provide much better overviews to aid human reasoning. Consider the following figures, which visualize over 4 million checkins on a location-based social network service. The figure on the left shows the data visualized in Google Fusion Tables, which uses a stratified sampling scheme to reduce the data volume. The figure on the right shows the result of using binned aggregation to form a heatmap. We now see much more structure in the data, including checkins along the interstate highway system and the activity of users in Mexico and Canada. In addition, we see an interesting chain of checkins over the Gulf of Mexico, which represents an account created to track the progress of Hurricane Ike. Appropriate visualizations help us discover these types of otherwise subtle patterns and outliers.
Of course, visual representation is only one half of the equation. Users must also interact with the data in order to explore: to pan, zoom, and cross-filter to understand patterns spanning multiple dimensions. For example, how does check-in behavior vary over seasons or time of day? Querying big data to answer such questions can involve expensive backend operations, with high latencies that discourage exploration.
In response, our research group at Stanford (recently moved to the University of Washington) has developed methods for rapid interactive querying of visualizations, all within modern web browsers. We took inspiration from online mapping systems, which allow users to pan and zoom through the world by stitching together a patchwork of small image tiles. Similarly, we can chop up large data sets into much more manageable subsets.
Our approach differs from existing tile services in two important ways. First, rather then send pre-rendered images, we send raw subsets of data that can be both rendered and queried. Second, rather than send “flat” 2D data, we create tiles that contain multiple dimensions of data to support cross-filtering of variables (for example, to visualize geographic activity for all check-ins in the month of January). These “multivariate data tiles” allow us to send just the data we need over the network, helping manage big data sets. We then use fast query processing on the browser, currently implemented as WebGL shader programs for highly parallel processing. The resulting system, called imMens, provides scalable client-side visualization of billions or more data points, including real-time interactive queries at a rate of 50 frames per second. To learn more, see Sean Kandel and I’s 2013 Strata NYC talk or the academic research paper on imMens.
The imMens system is just one example of applying an analyst centered perspective to big data. Others have also been following suit: the BigVis package for R enables construction of binned visualizations, while the Nanocubes, MapD, and ScalaR projects use various database techniques to provide fast queries for interactive exploration.
Column-oriented databases and in-memory systems (e.g., Shark) can further reduce server-side latency, as can sketching, incremental updates (as in online aggregation, or sampling methods (as in BlinkDB) that provide fast, albeit approximate, aggregation queries.
Looking further, we must expand our focus to all stages of data analysis. Data discovery, data transformation, feature selection and model assessment are all tasks that could benefit from improved methods for “human involvement and intervention”. As the diversity, size and availability of relevant data sources continues to grow, so will the need for interactive tools that better match the capabilities of the analysts seeking to make sense of it all.
1. From John W. Tukey and Martin B. Wilk (1966). Data analysis and statistics: An expository Overview. Included in the Collected Works of John W. Tukey Volume IV: Philosophy and Principles of Data Analysis, 1965-1986.