Measuring a world-shaking trend with feet planted in every area of human endeavor cannot be achieved in a popular book of 200 pages, but one has to start somewhere. I am happy to recommend the adept efforts of Viktor Mayer-Schönberger and Kenneth Cukier as a starting point. Their recent book Big Data: A Revolution That Will Transform How We Live, Work, and Think (recently featured in a video interview on the O’Reilly Strata site) does not quite unravel the mystery of the zeal for recording and measurement that is taking over governments and business, but it does what a good popularization should: alert us to what’s happening, provide some frameworks for talking about it, and provide a launchpad for us to debate the movement’s good and evil.
Because readers of this blog have been grappling with these concerns for some time. I’ll provide the barest summary of topics covered in Mayer-Schönberger and Cukier’s extensive overview, then provide some complementary ideas of my own.
Summary of book topics
Some of the themes of Big Data that grabbed my interest include:
- New tools for measuring the world and people’s activities provide data sets that are many orders of magnitude higher than we are used to having, and computers tied together in clusters are run novel techniques to find insights never available to us before.
- Data crunchers are finding correlations that provide useful guidance for actions. Mere correlation cannot tell us why something is happening, but often it doesn’t matter. The authors cite numerous examples where correlations by themselves suggested valuable actions.
- Because big data opens up efficiencies to those savvy enough to use it, the future of business belongs to huge organizations (including middlemen and aggregators) with the resources to collect both data and experts to manipulate it, or to smaller organizations who are nimble enough to make hay from open or cheaply available data sets.
- Control over our own lives may slip more and more from our hands as institutions use statistical insights to determine not only what we are doing, but what we are likely to do in the future. One chapter in Big Data is devoted to policy-related remedies.
- Old-timers’ intuitions are challenged by the findings of big data, just as Deep Blue’s brute-force processing of chess moves can overcome the world’s best human chess masters. Nevertheless, the authors end affirming the importance of human insight and choice.
These represents a grand agenda for one book (nor have I exhausted all its topics), but I’d like to jump ahead to ideas that the Big Data stimulated for me, leaving it up to readers to get the book for themselves if they want to study all its conclusions. In the interest of full disclosure, I’ll mention that one author– Cukier, the data editor of The Economist–helped me get an article published several years ago.
Other aspects of big data
Mayer-Schönberger and Cukier’s view of traditional statistical techniques deserves a bit of examination. They tend to place these in opposition to newer techniques of crunching big data. According to their thesis, the old techniques were developed to deal with small samples and all the uncertainties they presented about representing the whole population. Those outdated assumptions compromise their applicability to a new age, where computers just iterate over the whole population. The authors even recount a suspicion of traditional statisticians made by one of their big data experts, New York City’s Mike Flowers, who was put off by statisticians’ interest in “arcane concerns about mathematical models” (p. 186).
Certainly, the authors say, there is a place for traditional statistics. It can even be used to run traditional experiments in order to validate suggestions made by big data crunching. But I think the relationship between old and new techniques is much tighter. This question has an important bearing on the power exerted by big data, because I believe proper techniques will be harder to learn and accurately apply than Mayer-Schönberger and Cukier suggest. While they expect the skills soon to become “commonplace” (pp. 125-126), I think there will be a crying shortage for some time, allowing a few large institutions with deep pockets to corner the market.
Let’s take the common big-data task of clustering, which might help in such situations as an art dealer trying to determine that Leonardo da Vinci is closer to El Greco in style than to Andy Warhol. Clustering algorithms can take a very long time to run, and choosing good starting points is important to reduce compute time. In fact, characteristics of the data can help a data scientist choose which algorithm to run in the first place. So what can provide with a starting point for the big data venture? A traditional statistical analysis of a random sample could be a good choice.
This extends throughout the field of data. Even the choice of the best sorting technique–a common exercise in the first classes for programming students–can vary depending on the characteristics of the data being sorted.
Big Data is not oriented toward this sort of technical discussion. The authors chose quite reasonably to avoid equations or other accoutrements of a mathematical explanation, which I’m sure would have scared off readers. And yet without some such background (to be sure, I’m no mathematician or statistician), one can’t determine the real strengths and weaknesses of the big data movement.
Let’s turn to the critical question of transparency, which Mayer-Schönberger and Cukier consider a necessity to help people challenge the decisions that others derive from data analysis. Transparency is no panacea, in my view. First, algorithms are incredibly complex. Second, as we’ve seen, the choice of algorithm (as well as the data to be analyzed) requires some subjective judgment, which is hard to challenge.
Worse still, any calculation affecting humans has winners and losers, and therefore makes some people eager to game the system. Big Data mentions the trivial example of orange used cars being found to be in better shape than other used cars (pp. 66-67), and points out how ridiculous it would be for car owners to paint the cars orange before selling them. These kind of dueling incentives apply across the board. It’s one reason Google doesn’t publish its search rank algorithms–and in fact, one reason it changes that algorithm on a daily basis.
Thus, instead of asking banks or insurers to reveal their decision-making processes, it may be better to give individuals access to data about themselves, and a process such as the “external algorithmist” proposed by Mayer-Schönberger and Cukier (p. 181) to allow individuals to present exculpatory evidence.
Both the external algorithmist and the internal algorithmist (similar to an ombudsman) envisioned by the authors are good additions to an organization. Dr. Brigitte Piniewski (whose work with an NIH-funded experiment in the community collection of health data, was covered in a Strata article) suggests that the role of an algorithmist is a very difficult one, and in fact too big for a single person to fill. The algorithmist, in her assessment, must have some basic knowledge not only of statistics but of real-world disciplines such as physics and biology, in order to have a sense of what is possible and what analytic results are absurd on their face.
Even more important for algorithmists is a correct attitude: an inate skepticism that may be in-born trait rather than something teachable. And she says this willingness to constantly challenge accepted beliefs must run throughout the organization, which cannot rely on a single expert to provide this corrective.
Because many institutional decisions take place in the background where the affected individuals never find out they took place at all (when have you ever learned that a marketer decided not to offer you a great deal?), the reactive approach has grave limitations. Furthermore, it’s hard to trust institutions to take self-corrective actions to preserve privacy and individual automomy at this historical moment when we’re reeling from revelations that the IRS targeted institutions based on their political positions and the Justice Department gathered phone records across the board from Associated Press reporters.
My last cluster of concerns relates to the role of human intuition and creativity in an age of big data, where the authors end their book on a high note. It’s important to recognize that big data analysis consists of applying what happened in the past to what will happen in the future. Had there been no recent influenza outbreaks, Google could not have run the tests that produced their famous flu prediction algorithm.
And thus the data we collect in the past hangs over us. Suppose we analyze arrest and conviction records to determine whom we should target most heavily for policing? Guess what? African-Americans and Hispanics will get the bulk of policy scrutiny, because the police have targeted them disproportionately for generations. In short correlations could turn into textbook cases of self-fulfilling prophecies. As Mayer-Schönberger and Cukier say, we still need human intervention to think outside the box.
I think big data will accentuate today’s trend to differentiate between the commoditized and the innovative. Like manufacturing, we will see more and more decision-making calling on big data–but with crucial human correctives, as noted before.
The relation between invention and standardization is a bit like the promise of 3D printers such as the Makerbot. On the one hand, they allow flights of invention by clever tinkerers like never before. But the printers depend on microchips made in sterile labs by the millions, and other materials from large manufacturers. (The biggest US producer of polylactide filament is a subsidiary of Cargill.) Used properly, big data could similarly be the greatest contributor in history to personal innovation.