ENTRIES TAGGED "data"
I was thrilled to receive an invitation to a new meetup: the NYC Data Skeptics Meetup. If you’re in the New York area, and you’re interested in seeing data used honestly, stop by!
That announcement pushed me to write another post about data skepticism. The past few days, I’ve seen a resurgence of the slogan that correlation is as good as causation, if you have enough data. And I’m worried. (And I’m not vain enough to think it’s a response to my first post about skepticism; it’s more likely an effect of Cukier’s book.) There’s a fundamental difference between correlation and causation. Correlation is a two-headed arrow: you can’t tell in which direction it flows. Causation is a single-headed arrow: A causes B, not vice versa, at least in a universe that’s subject to entropy.
Our tools should make common cases easy and safe, but that's not the reality today.
Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.
Well, arguing against greater understanding of the methods we apply is like arguing against motherhood and apple pie, and Cathy and Mike are spot on in their diagnoses of the current situation. And yet …
In a conversation with Q Ethan McCallum (who should be credited as co-author), we wondered how to evaluate data science groups. If you’re looking at an organization’s data science group from the outside, possibly as a potential employee, what can you use to evaluate it? It’s not a simple problem under the best of conditions: you’re not an insider, so you don’t know the full story of how many projects it has tried, whether they have succeeded or failed, relations between the data group, management, and other departments, and all the other stuff you’d like to know but will never be told.
Our starting point was remote: Q told me about Tyler Brulé’s travel writing for Financial Times (behind a paywall, unfortunately), in which he says that a club sandwich is a good proxy for hotel quality: you go into the restaurant and order a club sandwich. A club sandwich isn’t hard to make: there’s no secret recipe or technique that’s going to make Hotel A’s sandwich significantly better than B’s. But it’s easy to cut corners on ingredients and preparation. And if a hotel is cutting corners on their club sandwiches, they’re probably cutting corners in other places.
Strata Community Profile on Jon Higbie, Managing Partner and Chief Scientist of Revenue Analytics
In his role as chief scientist at Atlanta-based consulting firm Revenue Analytics, Jon Higbie helps clients make sound pricing decisions for everything from hotel rooms, to movie theater popcorn, to that carton of OJ in the fridge.
And in the ever-growing field of data science where start-ups dominate much of the conversation, the 7-year-old company has a longevity that few others can claim just yet. They’ve been around the block a few times, and count behemoth companies like Coca-Cola and IHG among their clients.
We spoke recently about how revenue and pricing strategies have changed in recent years in response to the greater transparency of the internet, and the complex data algorithms that go into creating a simple glass of orange juice.
Opportunity to share your data stories with Brett Goldstein and Q. Ethan McCallum
On Goldstein, McCallum, and their upcoming book, Making Analytics Work: Case by Case
By Alex Howard
People have been crunching numbers to understand government since the first time an official used an abacus to compare one season’s grain harvest against another. Tracking and comparing data is part of how we’ve been understanding our world for millennia. In the 21st century, organizations in all sectors are transitioning from paper records to massive databases. Instead of inscribing tablets, we’re browsing real-time data dashboards on them. Using modern data analytics to make sense of all of those numbers is now the task of scientists, journalists and, intriguingly, public officials. That’s the context in which I first encountered Brett Goldstein, when I talked with him about his work as Chicago’s chief data officer. Goldstein has been a key part of Chicago’s data-driven approach to open government since Mayor Rahm Emanuel was elected in February 2011. He and Chicago CTO John Tolva have been breaking new ground in an emerging global discussion around how cities understand, govern and regulate themselves.
I saw Goldstein share his ideas for data analytics in person at last year’s Strata Conference in New York City, where he and Q Ethan McCallum, the author of the Bad Data Handbook, talked about text mining and civic engagement. Their thinking on big data in the public sector is helping to inform other cities that want to follow in Chicago’s footsteps. Urban predictive analytics are making sense of what residents are doing, where and when — and what they want from their governments. Both men have steadily built and earned excellent reputations as a public servant and a trusted authority in in the field.
Twitter has hired Guardian Data editor Simon Rogers as its first data editor.
Twitter has hired its first data editor. Simon Rogers, one of the leading practitioners of data journalism in the world, will join Twitter in May. He will be moving his family from London to San Francisco and applying his skills to telling data-driven stories using tweets. James Ball will replace him as the Guardian’s new data editor.
As a data editor, will Rogers keep editing and producing something that we’ll recognize as journalism? Will his work at Twitter be different than what Google Think or Facebook Stories delivers? Different in terms of how he tells stories with data? Or is the difference that Twitter has a lot more revenue coming in or sees data-driven storytelling as core to driving more business? (Rogers wouldn’t comment on those counts.)
Probabilistic languages can free developers from the complexities of high-performance probabilistic inference.
Probabilistic programming languages are in the spotlight. This is due to the announcement of a new DARPA program to support their fundamental research. But what is probabilistic programming? What can we expect from this research? Will this effort pay off? How long will it take?
A probabilistic programming language is a high-level language that makes it easy for a developer to define probability models and then “solve” these models automatically. These languages incorporate random events as primitives and their runtime environment handles inference. Now, it is a matter of programming that enables a clean separation between modeling and inference. This can vastly reduce the time and effort associated with implementing new models and understanding data. Just as high-level programming languages transformed developer productivity by abstracting away the details of the processor and memory architecture, probabilistic languages promise to free the developer from the complexities of high-performance probabilistic inference. Read more…
In the UC Berkeley AMPLab, we have embarked on a six year project to build a powerful next generation big data analytics platform: the Berkeley Data Analytics Stack (BDAS). We have already released several components of BDAS including Spark, a fast distributed in-memory analytics engine, and in February we ran a sold out tutorial at the Strata conference in Santa Clara teaching attendees how to use Spark and other components of the BDAS stack.
In this blog post we will walk through four steps to getting hands-on using Spark to analyze real data. For an overview of the motivation and key components of BDAS, check out our previous Strata blog post.
Arguments are the glue that connects data to decisions
Data is key to decision making. Yet we are rarely faced with a situation where things can be put in to such a clear logical form that we have no choice but to accept the force of evidence before us. In practice, we should always be weighing alternatives, looking for missed possibilities, and considering what else we need to figure out before we can proceed.
Arguments are the glue that connects data to decisions. And if we want good decisions to prevail, both as decision makers and as data scientists, we need to better understand how arguments function. We need to understand the best ways that arguments and data interact. The statistical tools we learn in classrooms are not sufficient alone to deal with the messiness of practical decision-making.
Examples of this fill the headlines. You can see evidence of rigid decision-making in how the American medical establishment decides what constitutes a valid study result. By custom and regulation, there is an official statistical breaking point for all studies. Below this point, a result will be acted upon. Above, it won’t be. Cut and dry, but dangerously brittle.
If data scientists aren't skeptical about how they use and analyze data, who will be?
A couple of months ago, I wrote that “big data” is heading toward the trough of a hype curve as a result of oversized hype and promises. That’s certainly true. I see more expressions of skepticism about the value of data every day. Some of the skepticism is a reaction against the hype; a lot of it arises from ignorance, and it has the same smell as the rich history of science denial from the tobacco industry (and probably much earlier) onward.
But there’s another thread of data skepticism that’s profoundly important. On her MathBabe blog, Cathy O’Neil has written several articles about lying with data — about intentionally developing models that don’t work because it’s possible to make more money from a bad model than a good one. (If you remember Mel Brooks’ classic “The Producers,” it’s the same idea.) In a slightly different vein, Cathy argues that making machine learning simple for non-experts might not be in our best interests; it’s easy to start believing answers because the computer told you so, without understanding why those answers might not correspond with reality. Read more…