A data scientist and a former Apple engineer, Pete Warden (@petewarden) is now the CTO of the new travel photography startup Jetpac. Warden will be a keynote speaker at the upcoming Strata Conference, where he’ll explain why we should rethink our approach to data. Specifically, rather than pursue the perfection of structured information, Warden says we should instead embrace the chaos of unstructured data. He expands on that idea in the following interview.
What do you mean asking data scientists to embrace the chaos of data?
Pete Warden: The heart of data science is designing instruments to turn signals from the real world into actionable information. Fighting the data providers to give you those signals in a convenient form is a losing battle, so the key to success is getting comfortable with messy requirements and chaotic inputs. As an engineer, this can feel like a deal with the devil, as you have to accept error and uncertainty in your results. But the alternative is no results at all.
Are we wasting time trying to make unstructured data structured?
Pete Warden: Structured data is always better than unstructured, when you can get it. The trouble is that you can’t get it. Most structured data is the result of years of effort, so it is only available with a lot of strings, either financial or through usage restrictions.
The first advantage of unstructured data is that it’s widely available because the producers don’t see much value in it. The second advantage is that because there’s no “structuring” work required, there’s usually a lot more of it, so you get much broader coverage.
A good comparison is Yahoo’s highly-structured web directory versus Google’s search index built on unstructured HTML soup. If you were looking for something that was covered by Yahoo, its listing was almost always superior, but there were so many possible searches that Google’s broad coverage made it more useful. For example, I hear that 30% of search queries are “once in history” events — unique combinations of terms that never occur again.
Dealing with unstructured data puts the burden on the consuming application instead of the publisher of the information, so it’s harder to get started, but the potential rewards are much greater.
How do you see data tools developing over the next few years? Will they become more accessible to more people?
Pete Warden: One of the key trends is the emergence of open-source projects that deal with common patterns of unstructured input data. This is important because it allows one team to solve an unstructured-to-structured conversion problem once, and then the entire world can benefit from the same solution. For example, turning street addresses into latitude/longitude positions is a tough problem that involves a lot of fuzzy textual parsing, but open-source solutions are starting to emerge.
Associated photo on home and category pages: “mess with graphviz by Toms Bauģis, on Flickr