More data was transmitted over the Internet in 2010 than in all other years combined. That’s one reason why this year’s Web 2.0 Summit used the “data frame” to explore the new landscape of digital business — from mobile to social to location to government.
Microsoft is part of this conversation about big data, with respect to the immense resources and technical talent the Redmond-based software giant continues to hold. During Web 2.0 Summit, I interviewed Microsoft Fellow David Campbell about his big data work and thinking. A video of our interview is below, with key excerpts added afterward.
What’s Microsoft’s role in the present and future of big data?
David Campbell: I’ve been a data geek for 25-plus years. You go back five to seven years ago, it was kind of hard to get some of the younger kids to think that the data space was interesting to solve problems. Databases are kind of boring stuff, but the data space is amazingly exciting right now.
It’s a neat thing to have one part of the company that’s processing petabytes of data on tens and hundreds of thousands of servers and then another part that’s a commercial business. In the last couple of years, what’s been interesting is to see them come together, with things that scale even on the commercial side. That’s the cool part about it, and the cool part of being at Microsoft now.
What’s happening now seems like it wasn’t technically possible a few years ago. Is that the case?
David Campbell: Yes, for a variety of reasons. If you think about the costs just to acquire the data, you can still pay people to type stuff in. It’s roughly $1 per kilobyte. But you go back 25 or 30 years and virtually all of the data that we were working with had come off human fingertips. Now it’s just out there. Even inherently analog things like phone calls and pictures — they’re just born digital. To store it, we’ve gone from $1,000-per-megabyte 25 years ago to $40-per-terabyte for raw storage. That’s an incredible shift.
How is Microsoft working with data startups?
David Campbell: The interesting thing about the data space is that we’re talking about a lot of people with machine learning experience. They know a particular domain, but it’s really hard for them to go find a set of customers. So, let’s say that they’ve got an algorithm or a model that might be relevant to 5,000 people. It’s really hard for them to go find those people.
We built this thing a couple of years ago called the DataMarket. The idea is to change the route to market. So, people can take their model and place it on the DataMarket and then others can go find it.
Here’s the example I use inside the company, for those old enough to remember: When people were building Visual Basic controls, it was way harder to write one than it was to consume one. The guys writing the controls didn’t have to go find the guy who was building the dentist app. They just published it in this thing from way back when it was actually, on paper, called “Programmer’s Paradise,” and then the guy who was writing the dentist’s app would go there to find what he needed.
It’s the same sort of thing here. How do you connect those people, those data scientists, who are going to remain a rare commodity with the set of people who can make use of the models they have?
How are the tools of data science changing?
David Campbell: Tooling is going to be a big challenge and a big opportunity here. We announced a tool recently that we call the Data Explorer, which lets people discover other forms of data — some in the DataMarket, some that they have. They can mash it up, turn it around and then republish it.
One of the things we looked at when we started building the tools is that people tend to do mashups today in what I was calling a “last-mile tool.” They might use Access or Excel or some other tool. When they were done, they could share it with anyone else who had the same tool. The idea of the Data Explorer is to back up one step and produce something that is itself a data source that’s then consumable by a large number of last-mile tools. You can program against the service itself to produce applications and whatnot.
How should companies collect and use data? What strategic advice would you offer?
David Campbell: From the data side, we’ve lived in what we’d call a world of scarcity. We thought that data was expensive to store, so we had to get rid of it as soon as possible. You don’t want it unless you have a good use for it. Now we think about data from a perspective of abundance.
Part of the challenge, 10 or 15 years years ago, was where do I go get the data? Where do I tap in? But in today’s world, everything is so interconnected. It’s just a matter of teeing into it. The phrase I’ve used instead of big data is “ambient data.” It’s just out there and available.
The recommendation would be to stop and think about the latent value in all that data that’s there to be collected and that’s fairly easy to store now. That’s the challenge and the opportunity for all of us.
This interview was edited and condensed.