A traditional view of data analysis involves precision, preparation, and methodical examination of defined datasets. Philipp Janert, author of “Data Analysis with Open Source Tools,” has a somewhat different perspective. Those traditional elements are still important, but Janert also thinks simplicity, experimentation, action, and natural curiosity all shape effective data work. He expands on these ideas in the following interview.
Is data analysis inherently complicated?
Philipp Janert: I observe a tendency to do something complicated and fancy; to bring in a statistical concept and other “sophisticated” stuff. The problem is that the sophisticated stuff isn’t that easy to understand.
Why not just look at the data set? Just look at it in an editor. Maybe you’ll see something. Or, draw some graphs. Graphs don’t require any sort of formal analytical training. These simple methods can be illuminating precisely because you don’t need anything complicated, and nothing is hidden.
Why do analysts shy from simplicity?
PJ: I often perceive a great sense of insecurity in my co-workers when it comes to math. Because of that, I get the sense people are trying to almost hide behind complicated methods.
The classic case for me is that usually within the first three minutes of a conversation, people start talking about standard deviations. It’s the one concept from classical statistics that everyone has heard of. But contextually, it’s not clear what “standard deviation” really means. Are they talking about what’s being measured by the standard deviation, namely the width of the distribution? Are they referring to one particular measure and how it’s being calculated? Do they mean the conclusions that can be drawn from standard deviations in the Normal case?
We need to keep it simple and not get sucked into abstract concepts that may or may not be fully understood.
What tool or method offers the best starting point for data analysis?
PJ: Start by plotting the data set. Plot all of the data points and look at them. Don’t try to calculate indicator quantities or summary statistics. Just look at what you see in the plot. Almost anything worthwhile can be seen in a good graph.
Is there a defined career path for people who want to become data scientists?
PJ: The stunning development over the 12 months I was writing this book is that “big data” became the thing that’s on everybody’s mind. All of a sudden, people are really concerned about very large datasets. Of course, this seems to be mostly driven by the social networking phenomenon. But the question is: What do we do with that data?
I know that for my purposes, I never need big data. When I ask people what they do with big data, I’ve found that it’s not what I would call “analysis” at all, because it does not involve the development of conceptual models. It does not involve the inductive/deductive cycle of scientific reasoning.
It falls into one of two camps. The first is reporting. For instance, if a company is being paid based on the number of pages they serve, then counting the number of served pages is important. The resulting log files tend to be huge, so that’s technically big data. But it’s a very straightforward counting and reporting game.
The other camp is what I consider “generalized search.” These are scenarios like: If User A likes movies B, C, and D, what other specific movie might User A want? That’s a form of searching because you’re not actually trying to create a conceptual model of user
behavior. You’re comparing individual data points; you’re trying to find the movie that has the greatest similarity to a very specific other set of predefined movies. For this kind of generalized, exhaustive search, you need a lot of data because you look for the individual data points. But that’s not really analysis as I understand it, either.
So coming back to your original question — is there a path to becoming a “data scientist?” — we need to first find out what data science might be. It will encompass different things: the kind of big data I mentioned; reporting and business intelligence; hopefully the kind of conceptual modeling that I do. But depending on what you’re trying to accomplish, you could require very different skills.
For what I do — and this is really the only data analysis I can speak about with any sense of confidence — the most important skill is curiosity. This sounds a little tacky, but I mean it. Are you curious why the grass is green? Are you curious why is the sky blue? I’m talking about questions of this sort. These are representative of the inquisitive mind of a scientist. If you have that, you’re in good shape and you can start anywhere.
The skills and tools of data science will be discussed at the Strata Conference, being held Feb. 1-3 in Santa Clara, Calif. Save 30% off registration with the code SRT11RAD.
Besides curiosity, are there other traits or skills that benefit data analysts?
PJ: You need experience with empirical work. And by that I mean someone who looks at the “idiot lights” on a router to make sure the cable is plugged in before they troubleshoot. We’ve all been in the situation where you reinstall the IP stack because you can’t get network connectivity, and only later did you realize the router wasn’t plugged in. These failures of empirical work are critical because empirical skills can be learned.
It’s also nice, but not essential, to have taken a college math class and retained a bit.
You should learn a programming language as well because you need to know how to manipulate data on your own. Any of the current scripting languages will do.
The last thing is that you need to actually do the work. Find a dataset that you’re interested in and work on it. It doesn’t have to be fancy, but you have to get started. You can’t just sit there and expect it to happen. Experience and practice are really important.
It sounds like the “just start” mindset you find in the Maker/DIY community also applies to data. Is that right?
PJ: I don’t know about other people, but I do this because it’s fun. And that’s a similar mentality to the Make space. They’re more about creating something as opposed to understanding something, but the mentality is very much the same.
It’s about curiosity followed by action. You look at the dataset and then go deeper to discover something. And this process isn’t defined by tools. Personally, I’m interested in what somebody’s trying to find rather than if they’re using all the right statistical methods.