ENTRIES TAGGED "data scientists"
Leading Indicators
In a conversation with Q Ethan McCallum (who should be credited as co-author), we wondered how to evaluate data science groups. If you’re looking at an organization’s data science group from the outside, possibly as a potential employee, what can you use to evaluate it? It’s not a simple problem under the best of conditions: you’re not an insider, so you don’t know the full story of how many projects it has tried, whether they have succeeded or failed, relations between the data group, management, and other departments, and all the other stuff you’d like to know but will never be told.
Our starting point was remote: Q told me about Tyler Brulé’s travel writing for Financial Times (behind a paywall, unfortunately), in which he says that a club sandwich is a good proxy for hotel quality: you go into the restaurant and order a club sandwich. A club sandwich isn’t hard to make: there’s no secret recipe or technique that’s going to make Hotel A’s sandwich significantly better than B’s. But it’s easy to cut corners on ingredients and preparation. And if a hotel is cutting corners on their club sandwiches, they’re probably cutting corners in other places.
Data’s missing ingredient? Rhetoric.
Arguments are the glue that connects data to decisions
Data is key to decision making. Yet we are rarely faced with a situation where things can be put in to such a clear logical form that we have no choice but to accept the force of evidence before us. In practice, we should always be weighing alternatives, looking for missed possibilities, and considering what else we need to figure out before we can proceed.
Arguments are the glue that connects data to decisions. And if we want good decisions to prevail, both as decision makers and as data scientists, we need to better understand how arguments function. We need to understand the best ways that arguments and data interact. The statistical tools we learn in classrooms are not sufficient alone to deal with the messiness of practical decision-making.
Examples of this fill the headlines. You can see evidence of rigid decision-making in how the American medical establishment decides what constitutes a valid study result. By custom and regulation, there is an official statistical breaking point for all studies. Below this point, a result will be acted upon. Above, it won’t be. Cut and dry, but dangerously brittle.
How do you become a data scientist? Well, it depends
My obsession with data and user needs is now focused on the many paths toward data science.
Over Thanksgiving, Richie and Violet asked me if I preferred the iPhone or the Galaxy SIII. I have both. It is a long story. My response was, “It depends.” Richie, who would probably bleed Apple if you cut him, was very unsatisfied with my answer. Violet was more diplomatic. Yet, it does depend. It depends on what the user wants to use the device for.
I say, “It depends” a lot in my life.
Both in the personal life and the work life … well, because it really is all one life isn’t it? With my work over the past decade or so, I have been obsessive about being user-focused. I spend a lot of time thinking about whom a product, feature, or service is for and how they will use it. Not how I want them to use it — how they want to use it and what problem they are trying to solve with it.
Before I joined O’Reilly, I was obsessively focused on the audience for my data analysis. “C” level execs look for different kinds of insights than a director of engineering. A field sales rep looks for different insights than a software developer. Understanding more about who the user or audience was for a data project enabled me to map the insights to the user’s role, their priorities, and how they wanted to use the data. Because, you know what isn’t too great? When you spend a significant amount of time working on something that does not get used or is not what someone needed to help them in their job.
Read more…
Data science in the natural sciences
Big data is shaping diverse fields, showing that past predictions from data-driven natural sciences are now coming to pass.
I find myself having conversations recently with people from increasingly diverse fields, both at Columbia and in local startups, about how their work is becoming “data-informed” or “data-driven,” and about the challenges posed by applied computational statistics or big data.
A view from health and biology in the 1990s
In discussions with, as examples, New York City journalists, physicists, or even former students now working in advertising or social media analytics, I’ve been struck by how many of the technical challenges and lessons learned are reminiscent of those faced in the health and biology communities over the last 15 years, when these fields experienced their own data-driven revolutions and wrestled with many of the problems now faced by people in other fields of research or industry.
It was around then, as I was working on my PhD thesis, that sequencing technologies became sufficient to reveal the entire genomes of simple organisms and, not long thereafter, the first draft of the human genome. This advance in sequencing technologies made possible the “high throughput” quantification of, for example,
- the dynamic activity of all the genes in an organism; or
- the set of all protein-protein interactions in an organism; or even
- statistical comparative genomics revealing how small differences in genotype correlate with disease or other phenotypes.
These advances required formation of multidisciplinary collaborations, multi-departmental initiatives, advances in technologies for dealing with massive datasets, and advances in statistical and mathematical methods for making sense of copious natural data. Read more…
Now available: “Planning for Big Data”
A free handbook for anybody wanting to understand and use big data.
"Planning for Big Data" is a new book that helps you understand what big data is, why it matters, and where to get started.
Visualization of the Week: Visualizing the Strata Conference
The Information Lab visualizes the Strata Conference's attendees.
This week's visualization comes from The Information Lab and shows who was at the Strata Conference, how far they traveled, and the data their companies produce.
Strata Week: Datasift lets you mine two years of Twitter data
Datasift offers more access to the Twitter archive, and a proposal for a data school.
In this week's data news, Datasift will offer deeper access to old tweets, P2PU and the Open Knowledge Foundation announce a School of Data.
Strata Week: Your personal automated data scientist
Wolfram releases a pro tool, protecting data during times of need, and new doubts about dating services.
Wolfram|Alpha launches a pro version of its computational knowledge engine, guidelines emerge for protecting the data of people in crisis, and researchers cast doubt on dating sites' matchmaking algorithms.
Embracing the chaos of data
Pete Warden on the upside of unstructured data.
Data scientists, it's time to welcome errors and uncertainty into your data projects. In this interview, Jetpac CTO Pete Warden discusses the advantages of unstructured data.
Strata Week: The looming data science talent shortage
EMC study looks at the state of data science, Carrier IQ and big data, and the welcome return of old tweets.
In this week's data news: EMC's new data science study predicts a data scientist shortage, why Carrier IQ is part of a "bizarre big-data triangle," and DataSift will soon offer access to an archive of old tweets.






