Data skepticism

If data scientists aren't skeptical about how they use and analyze data, who will be?

A couple of months ago, I wrote that “big data” is heading toward the trough of a hype curve as a result of oversized hype and promises. That’s certainly true. I see more expressions of skepticism about the value of data every day. Some of the skepticism is a reaction against the hype; a lot of it arises from ignorance, and it has the same smell as the rich history of science denial from the tobacco industry (and probably much earlier) onward.

But there’s another thread of data skepticism that’s profoundly important. On her MathBabe blog, Cathy O’Neil has written several articles about lying with data — about intentionally developing models that don’t work because it’s possible to make more money from a bad model than a good one. (If you remember Mel Brooks’ classic “The Producers,” it’s the same idea.) In a slightly different vein, Cathy argues that making machine learning simple for non-experts might not be in our best interests; it’s easy to start believing answers because the computer told you so, without understanding why those answers might not correspond with reality.

I had a similar conversation with David Reiley, an economist at Google, who is working on experimental design in social sciences. Heavily paraphrasing our conversation, he said that it was all too easy to think you have plenty of data, when in fact you have the wrong data, data that’s filled with biases that lead to misleading conclusions. As Reiley points out (pdf), “the population of people who sees a particular ad may be very different from the population who does not see an ad”; yet, many data-driven studies of advertising effectiveness don’t take this bias into account. The idea that there are limitations to data, even very big data, doesn’t contradict Google’s mantra that more data is better than smarter algorithms; it does mean that even when you have unlimited data, you have to be very careful about the conclusions you draw from that data. It is in conflict with the all-too-common idea that, if you have lots and lots of data, correlation is as good as causation.

Skepticism about data is normal, and it’s a good thing. If I had to give a one line definition of science, it might be something like “organized and methodical skepticism based on evidence.” So, if we really want to do data science, it has to be done by incorporating skepticism. And here’s the key: data scientists have to own that skepticism. Data scientists have to be the biggest skeptics. Data scientists have to be skeptical about models, they have to be skeptical about overfitting, and they have to be skeptical about whether we’re asking the right questions. They have to be skeptical about how data is collected, whether that data is unbiased, and whether that data — even if there’s an inconceivably large amount of it — is sufficient to give you a meaningful result.

Because the bottom line is: if we’re not skeptical about how we use and analyze data, who will be? That’s not a pretty thought.

This post originally appeared on O’Reilly Radar. It’s republished with permission.

tags: , , ,
  • http://twitter.com/erikdlarson Erik Larson

    This is a great article, and I think it would benefit from one small edit to the last sentence: “They have to be skeptical about how data is collected, _how_ the data is _biased_, and whether that data…is sufficient to give you a meaningful result.”

    Perhaps I am splitting unimportant hairs because I am only a dabbler in data sciences, but the idea of unbiased data seems almost impossible to me. Don’t all of the large data sets people are excited about have significant biases in them?

    • Shri

      I think it’s more useful to think of all data as biased in the manner it’s collected and that this is something we’re trying to do much of the time, like how drug trials apply randomization and double blinding to isolate the effect of the drug. We should be thinking about the biases we want, the ones we don’t, the ones we know about, and the ones we don’t know about.

  • @slatemine

    If “data scientists aren’t skeptical” they aren’t scientist always have a null hypothesis and consider all the evidence before jumping to conclusions. If scientists start jumping to conclusions what is the marketing department going to do :-)