We talk a lot about the ways in which data science affects various businesses, organizations, and professions, but how are we actually preparing future data scientists? What training, if any, do university students get in this area? The answer may be obvious if students focus on math, statistics or hard science majors, but what about other disciplines?
I recently spoke with Drew Conway (@drewconway) about data science and academia, particularly in regards to social sciences. Conway, a PhD candidate in political science at New York University, will expand on some of these topics during a session at next month’s Strata Conference in New York.
Our interview follows.
How has the work of academia — particularly political science — been affected by technology, open data, and open source?
Drew Conway: There are fundamentally two separate questions in here, so I will try to address both of them. First is the question of how academic research has changed as a result of these technologies. And for my part, I can only really speak for how they have affected social science research. The open data movement has impacted research most notably in compressing the amount of time a researcher goes from the moment of inception (“hmm, that would be interesting to look at!”) to actually looking at data and searching for interesting patterns. This is especially true of the open data movement happening at the local, state and federal government levels.
Only a few years ago, the task of identifying, collecting, and normalizing these data would have taken months, if not years. This meant that a researcher could have spent all of that time and effort only to find out that their hypothesis was wrong and that — in fact — there was nothing to be found in a given dataset. The richness of data made available through open data allows for a much more rapid research cycle, and hopefully a greater breadth of topics being researched.
Open source has also had a tremendous impact on how academics do research. First, open source tools for performing statistical analysis, such as R and Python, have robust communities around them. Academics can develop and share code within their niche research area, and as a result the entire community benefits from their effort. Moreover, the philosophy of open source has started to enter into the framework of research. That is, academics are becoming much more open to the idea of sharing data and code at early stages of a research project. Also, many journals in the social sciences are now requiring that authors provide replication code and data.
The second piece of the question is how these technologies affect the dissemination of research. In this case blogs have becoming the de facto source for early access to new research, or scientific debate. In my own discipline, The Monkey Cage is most political scientists’ first source for new research. What is fantastic about the Monkey Cage, and other academic blogs, is that they are not only ready by other academics. Journalists, policy makers, and engaged citizens can also interact with academics in this way — something that was not possible before these academic blogs became mainstream.
Let’s sidestep the history of the discipline and debates about what constitutes a hard or soft science. But as its name suggests, “political science” has long been interested in models, statistics, quantifiable data and so on. Has the discipline been affected by the rise of data science and big data?
Drew Conway: The impact of big data has been slow, but there are a few champions who are doing really interesting work. Political science, at its core, is most interested in understanding how people collectively make decisions, and as researchers we attempt to build models and collect data to that end. As such, the massive data on social interactions being generated by social media services like Facebook and Twitter present unprecedented opportunities for research.
While some academics have been able to leverage this data for interesting work, there seems to be a clash between these services’ terms of service and with the desire for scientists to collect data and generate reproducible findings from this data. I wrote about my own experience using Twitter data for research, but there are many others researchers from all disciplines that have run into similar problems.
With respect to how academics have been impacted by data science, I think the impact has mostly flowed in the other direction. One major component of data science is the ability to extract insight from data using tools from math, statistics and computer science. Most of this is informed by the work of academics, and not the other way around. That said, as more academic researchers become interested in examining large-scale datasets (on the order of Twitter or Facebook), many of the technical skills of data science will have to be acquired by academics.
How does data science change the work of the grad student — in terms of necessary skills but also in terms of access to information/informants?
Drew Conway: Unfortunately, having sophisticated technical skills, i.e., those of a data scientist, are still undervalued in academia. Being involved in open-source projects, or producing statistical software is not something that will help a graduate student land a high-profile academic job, or help a young faculty member get tenure. Publications are still the currency of success, and that — as I mentioned — clashes with the data-sharing policies of many large social media services.
Graduate students and faculty do themselves a disservice by not actively staying technically relevant. As so much more data gets pushed into the open, I believe basic data hacking skills — scraping, cleaning, and visualization — will be prerequisites to any academic research project. But, then again, I’ve always been a weird academic, double majoring in computer science and political science as an undergrad
How does the rise of data science and its spread beyond the realm of math and statistics change the world of technology, either from an academic or entrepreneurial perspective?
Drew Conway: From an entrepreneurial perspective I think it has dramatically changed the way new businesses think about building a team. Whether it is at Strata, or any of the other conferences in the same vein, you will see a glut of job openings or panels on how to “build a data team.” At present, people who have the blend of skills I associate with data science — hacking, math/stats, and substantive expertise — are a rare commodity. This dearth of talent, however, will be short-lived.
I see in my undergrads many more students who grew up with data and computing as ubiquitous parts of their lives. They’re interested in pursuing routes of study that provide them with data science skills, both in terms of technical competence, and also in creative outlets such as interactive design.
How does “human subjects compliance” work when you’re talking about “data” versus “people” — that’s an odd distinction, of course, and an inaccurate one at that. But I’m curious if some of the rules and regulations that govern research on humans account for research on humans’ data.
Drew Conway: I think it is an excellent question, and one that academe is still struggling to deal with. In some sense, mining social data that is freely available on the Internet provides researchers a way to sidestep traditional IRB regulation. I don’t think there’s anything ethically questionable about recording observations that are freely made public. That’s akin to observing the meanderings of people in a park.
Where things get interesting is when researchers use crowd sourcing technology, like Mechanical Turk, as a survey mechanism. Here, this is much more of a gray area. I suppose, technically, the Amazon terms of services covers researchers, but ethically this is something that would seem to me to fall within the scope of an IRB. Unfortunately, the likely outcome is that institutions won’t attempt to understand the difference until some problem arises.
This interview was edited and condensed.