I was thrilled to receive an invitation to a new meetup: the NYC Data Skeptics Meetup. If you’re in the New York area, and you’re interested in seeing data used honestly, stop by!
That announcement pushed me to write another post about data skepticism. The past few days, I’ve seen a resurgence of the slogan that correlation is as good as causation, if you have enough data. And I’m worried. (And I’m not vain enough to think it’s a response to my first post about skepticism; it’s more likely an effect of Cukier’s book.) There’s a fundamental difference between correlation and causation. Correlation is a two-headed arrow: you can’t tell in which direction it flows. Causation is a single-headed arrow: A causes B, not vice versa, at least in a universe that’s subject to entropy.
In a conversation with Q Ethan McCallum (who should be credited as co-author), we wondered how to evaluate data science groups. If you’re looking at an organization’s data science group from the outside, possibly as a potential employee, what can you use to evaluate it? It’s not a simple problem under the best of conditions: you’re not an insider, so you don’t know the full story of how many projects it has tried, whether they have succeeded or failed, relations between the data group, management, and other departments, and all the other stuff you’d like to know but will never be told.
Our starting point was remote: Q told me about Tyler Brulé’s travel writing for Financial Times (behind a paywall, unfortunately), in which he says that a club sandwich is a good proxy for hotel quality: you go into the restaurant and order a club sandwich. A club sandwich isn’t hard to make: there’s no secret recipe or technique that’s going to make Hotel A’s sandwich significantly better than B’s. But it’s easy to cut corners on ingredients and preparation. And if a hotel is cutting corners on their club sandwiches, they’re probably cutting corners in other places.
If data scientists aren't skeptical about how they use and analyze data, who will be?
A couple of months ago, I wrote that “big data” is heading toward the trough of a hype curve as a result of oversized hype and promises. That’s certainly true. I see more expressions of skepticism about the value of data every day. Some of the skepticism is a reaction against the hype; a lot of it arises from ignorance, and it has the same smell as the rich history of science denial from the tobacco industry (and probably much earlier) onward.
But there’s another thread of data skepticism that’s profoundly important. On her MathBabe blog, Cathy O’Neil has written several articles about lying with data — about intentionally developing models that don’t work because it’s possible to make more money from a bad model than a good one. (If you remember Mel Brooks’ classic “The Producers,” it’s the same idea.) In a slightly different vein, Cathy argues that making machine learning simple for non-experts might not be in our best interests; it’s easy to start believing answers because the computer told you so, without understanding why those answers might not correspond with reality. Read more…
The biggest problems will almost always be those for which the size of the data is part of the problem.
A recent VentureBeat article argues that “Big Data” is dead. It’s been killed by marketers. That’s an understandable frustration (and a little ironic to read about it in that particular venue). As I said sarcastically the other day, “Put your Big Data in the Cloud with a Hadoop.”
You don’t have to read much industry news to get the sense that “big data” is sliding into the trough of Gartner’s hype curve. That’s natural. Regardless of the technology, the trough of the hype cycle is driven by by a familiar set of causes: it’s fed by over-agressive marketing, the longing for a silver bullet that doesn’t exist, and the desire to spout the newest buzzwords. All of these phenomena breed cynicism. Perhaps the most dangerous is the technologist who never understands the limitations of data, never understands what data isn’t telling you, or never understands that if you ask the wrong questions, you’ll certainly get the wrong answers.
Big data is not a term I’m particularly fond of. It’s just data, regardless of the size. But I do like Roger Magoulas’ definition of “big data”: big data is when the size of the data becomes part of the problem. I like that definition because it scales. It was meaningful in 1960, when “big data” was a couple of megabytes. It will be meaningful in 2030, when we all have petabyte laptops, or eyeglasses connected directly to Google’s yottabyte cloud. It’s not convenient for marketing, I admit; today’s “Big Data!!! With Hadoop And Other Essential Nutrients Added” is tomorrow’s “not so big data, small data actually.” Marketing, for better or for worse, will deal. Read more…
Michael Flowers on how New York City uses its data and Sinan Aral on the nature of online influence.
A few weeks ago, I attended DataGotham, the first conference celebrating the New York data community. It was a great short conference: good people, great speakers, great party at the Tribeca Rooftop. (Guess what? The CIA is hiring. Or so we’re told.)
My two favorite speakers were Michael Flowers and Sinan Aral. It was great to hear a New Yorker who really sounded like a New Yorker, and who clearly knew the streets. Mike directs New York’s Policy and Strategic Planning Analytics team, and talked about using data to optimize New York’s operations. A theme that I’ve seen repeatedly is that many organizations have lots of data that they don’t know how to use. In many cases, they don’t even know that the data is valuable. Mike talked about how New York City is putting that data to use. For example, his team is using tax records to optimizing building inspections. Buildings on which taxes are owed are much more likely to have a fire, and it’s much, much more likely that a firefighter will be injured in one of those fires. So once you know where the tax problems are, you’ve found the most dangerous buildings, and can prioritize your inspections.
Sinan’s talk was the proverbial “drinking from a firehose”: fast and furious, with more insight packed into 20 minutes than most people can get into a full day. His research is on the nature of online influence, and started with the idea that Ashton Kutcher has millions of Twitter followers, but if he told his followers to do something, very few of them would actually do it. Twitter followers and Facebook friends are self-selecting, and are likely to self-organize around similar behaviors. If somebody tells you to do something that you were already likely to do, is that influence? With careful analysis on Facebook’s huge dataset, Sinan has been able to tease out the real influence relationships.
If you missed out in September, you can attend the DataGotham Reprise that’s part of NYC Data Week. All of the events in Data Week are free and open to the public. If you can’t make the Data Gotham Reprise, you can watch all the talks on their YouTube channel. And if you like that, Mike Flowers will be keynoting at O’Reilly’s Strata Conference + Hadoop World in New York, October 23-25.
A call for data scientists, technologists, health professionals, and business leaders to convene.
We are launching a conference at the intersection of health, health care, and data. Why?
Our health care system is in crisis. We are experiencing epidemic levels of obesity, diabetes, and other preventable conditions while at the same time our health care system costs are spiraling higher. Most of us have experienced increasing health care costs in our businesses or have seen our personal share of insurance premiums rise rapidly. Worse, we may be living with a chronic or life-threatening disease while struggling to obtain effective therapies and interventions — finding ourselves lumped in with “average patients” instead of receiving effective care designed to work for our specific situation.
In short, particularly in the United States, we are paying too much for too much care of the wrong kind and getting poor results. All the while our diet and lifestyle failures are demanding even more from the system. In the past few decades we’ve dropped from the world’s best health care system to the 37th, and we seem likely to drop further if things don’t change.
The very public fight over the Affordable Care Act (ACA) has brought this to the fore of our attention, but this is a situation that has been brewing for a long time. With the ACA’s arrival, increasing costs and poor outcomes, at least in part, are going to be the responsibility of the federal government. The fiscal outlook for that responsibility doesn’t look good and solving this crisis is no longer optional; it’s urgent.
There are many reasons for the crisis, and there’s no silver bullet. Health and health care live at the confluence of diet and exercise norms, destructive business incentives, antiquated care models, and a system that has severe learning disabilities. We aren’t preventing the preventable, and once we’re sick we’re paying for procedures and tests instead of results; and those interventions were designed for some non-existent average patient so much of it is wasted. Later we mostly ignore the data that could help the system learn and adapt.
It’s all too easy to be gloomy about the outlook for health and health care, but this is also a moment of great opportunity. We face this crisis armed with vast new data sources, the emerging tools and techniques to analyze them, an ACA policy framework that emphasizes outcomes over procedures, and a growing recognition that these are problems worth solving.
In a world of big, open data, "privacy by design" will become even more important.
A few weeks ago, Tom Slee published “Seeing Like a Geek,” a thoughtful article on the dark side of open data. He starts with the story of a Dalit community in India, whose land was transferred to a group of higher cast Mudaliars through bureaucratic manipulation under the guise of standardizing and digitizing property records. While this sounds like a good idea, it gave a wealthier, more powerful group a chance to erase older, traditional records that hadn’t been properly codified. One effect of passing laws requiring standardized, digital data is to marginalize all data that can’t be standardized or digitized, and to marginalize the people who don’t control the process of standardization.
That’s a serious problem. It’s sad to see oppression and property theft riding in under the guise of transparency and openness. But the issue isn’t open data, but how data is used.
How to think about choosing a database.
A relational database is no longer the default choice. Mike Loukides charts the rise of the NoSQL movement and explains how to choose the right database for your application.
Trying to crack a tough data problem? Submit it to the KDD Cup challenge.
Organizers of this year's KDD Cup data mining challenge are looking for data problems in areas such as medicine, education, the environment, or anything that leads to a social good. Submissions are due by November 15.
Oracle's NoSQL Database is more than a product. It's also an acknowledgement.
Oracle's announcement of a NoSQL product isn't just a validation of key-value stores, but of the entire discussion of database architecture.