why? why? why!

a lesson for data science teams

By Dean Malmgren and Mike Stringer

The other day we had a conversation with a bespectacled senior data scientist at another organization (named X to protect the innocent). The conversation went something like this:

facepalm

Many of us have had similar conversations with people like X, and many of us have even been X before. Data scientists, being curious individuals, enjoy working on problems for the sake of doing something interesting, fun, technically challenging, or because their boss heard about “big data” in the Wall Street Journal. These reasons are all distinctly different from trying to solve an important problem.

This can be daunting for data scientists, because some important problems don’t actually need a data scientist to solve. It is increasingly the case, however, that data can be used as an extraordinarily valuable resource to help solve age-old, time-tested business problems in innovative ways. Operations? Product Development? Strategy? Human Resources? Chances are that there are some data out there now, or that you can collect, that can help change your organization or drive an exciting new product.

To tap this increasingly abundant “natural” resource, however, a data science team must:

  1. learn from business domain experts about real problems
  2. think creatively about if and how data can be used as part of a solution
  3. focus on problems that actually improve the business.

Going in any different order is a recipe for disillusionment about big data’s true potential. Starting with a real problem instead of starting with some interesting dataset often leads data scientists down a completely different — and much more fruitful — path.

Case in point

As an example from our work at Datascope Analytics, in 2010, Brian Uzzi introduced us to Daegis, a leading e-discovery services provider. Our initial conversations centered around social network analysis and thinking about how we could use connections between people to further their business. Daegis’ clients store tons of email records and we could have made some sexy network diagrams or any number of other exciting things. It would have been interesting! It would have been fun! It would have been, well, mostly worthless had we not asked one important question first:

datascopeconversation1

This is not necessarily a social network analysis problem. This is a classification problem where the goal is to accurately identify the small set of documents that are relevant to a lawsuit.

So we focused the first phase of our project with Daegis around building a quick prototype using data from the Text Retrieval Conference (TREC) to demonstrate that our transductive learning algorithms could reduce the number of documents that needed to be reviewed by 80-99%. This was huge! We were going to help Daegis gain a tremendous advantage and Daegis’ clients would be able to defend themselves from frivolous lawsuits. +1 for the good guys.

datascopeconversation2

After several design iterations (see our Strata presentation or slides if you’re interested), we arrived at some insights: what we developed needed to be educational, transparent, and understandable. By the end, if you had to summarize the project, it would be closer to “educating attorneys about information retrieval” than “social network analysis.” The final result is a product that Daegis sells under the name Acumen (subtle hint for attorneys out there: you should use it!).

The take home message

This case illustrates a lesson for data scientists: ask why first!

whycartoon

Be ready. The answers to this deceptively simple question may surprise you, take you into challenging uncharted territory, and inspire you to think about problems in completely different ways.

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata Rx Health Data Conference: September 25-27 | Boston, MA

Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England

tags: , , , ,