The world is experiencing an unprecedented data deluge, a reality that my colleague Edd Dumbill described as another “industrial revolution” at February’s Strata Conference. Many sectors of the global economy are waking up to the need to use data as a strategic resource, whether in media, medicine, or moving trucks. Open data has been a major focus of Gov 2.0, as federal and state governments move forward with creating new online platforms for open government data.
The explosion of data requires new tools and management strategies. These new approaches include more than technical evolution, as a recent conversation with Charlie Quinn, director of data integration technologies at the Benaroya Research Institute, revealed: they involve cultural changes that create greater value by sharing data between institutions. In Quinn’s field, genomics, big data is far from a buzzword, with scanned sequences now rating on the terabyte scale.
In the interview below, Quinn shares insights about applying open source to data management and combining public data with experimental data. You can hear more about open data and open source in advancing personalized medicine from Quinn at the upcoming OSCON Conference.
How did you become involved in data science?
Charlie Quinn: I got into the field through a friend of mine. I had been doing data mining for fraud on credit cards and the principal investigator, who I work with now, was going to work in Texas. We had a novel idea that to build the tools for researchers, we should hire software people. What had happened in the past was you had bioinformaticians writing scripts. They found the programs that they needed did about 80% of what they wanted, and they had a hard time gaining the last 20%. So we had had a talk way back when saying, “if you really want proper software tools, you ought to hire software people to build them for you.” He called my boss to come on down and take a look. I did, and the rest is history.
You’ve said that there’s a “data explosion” in genomics research. What do you mean? What does this mean for your field?
Charlie Quinn: It’s like the difference between analog and digital technology. The amount of data you’d have with analog is still substantial, but as we move toward digital, it grows exponentially. If we’re looking at technology in gene expression values, which is what we’ve been focusing on in genomics, it’s about a gigabyte per scan. As we move into doing targeted RNA sequencing, or even high frequency sequencing, if you take the raw output from the sequence, you’re looking at terabytes per scan. It’s orders of magnitude more data.
What that means from a practical perspective is there’s more data being generated than just for your request. There’s more data being generated than a single researcher could possibly ever hope to get their head wrapped around. Where the data explosion becomes interesting is how we engage researchers to take data they’re generating and share it with others, so that we can reuse data, and other people might be able to find something interesting in it.
What are the tools you’re using to organize and make sense of all that data?
Charlie Quinn: A lot of it’s been homegrown so far, which is a bit of an issue as you start to integrate with other organizations because everybody seems to have their own homegrown system. There’s an open source group in Seattle called Lab Key, which a lot of people have started to use. We’re taking another look at them to see if we might be able to use some of their technology to help us move forward in organizing the backend. A lot of this is so new. It’s hard to keep up with where we’re at and quite often, we’re outpacing it. It’s a question of homegrown and integrating with other applications as we can.
How does open source relate to that work?
Charlie Quinn: We try and use open source as much as we can. We try and contribute back where we can. We haven’t been contributing back anywhere near as much as we’d like to, but we’re going to try and get into that more.
We’re huge proponents not only of open source, but of open data. What we’ve been doing is going around and trying to convince people that we understand they have to keep data private up to a certain point, but let’s try and release as much data as we can as early as we can.
When we go back to talking about the explosion of data, if we’re looking at Gene X and we happen to see something that might be interesting on Y or Z, we can post a quick discovery note or a short blurb. In that way, you’re trying to push ideas out and take the data behind those ideas and make it public. That’s where I think we’re going to get traction: trying to share data earlier rather than later.
At OSCON, you’ll talk about how experimental data combines with public data. When did you start folding the two together?
Charlie Quinn: We’ve been playing with it for a while. What we’re hoping to do is make more of it public, now that we’re getting the institutional support for it. Years ago, we went and indexed all of the abstracts at Pubnet by gene so that when people went to a text engine, you could type in your query and you would get a list of genes, as opposed to a list of articles. That helped researchers find what they were looking for — and that’s just leveraging openly available data. Now, with NIH’s mandate for more people to publish their results back into repositories, we’re downloading that data and combining it with the data we have internally. Now, as we go across a project or across a disease trying to find how a gene is acting or how a protein is acting, it’s just giving us a bigger dataset to work with.
What are some of the challenges you’ve encountered in your work?
Charlie Quinn: The issues we’ve had are with the quality of the datasets in the public repositories. You need to hire a curator to validate if the data is going to be usable or not, to make sure it’s comparable to the data that we want to use it with.
What’s the future of open data in research and personalized medicine?
Charlie Quinn: We’re going to be seeing multiple tiers of data sharing. In the long run, you’ve going to have very well curated public repositories of data. We’re a fair ways away from there in reality because there’s still a lot of inertia against doing that within the research community. The half-step to get there will be large project consortiums where we start sharing data inter-institutionally. As people get more comfortable with that, we’ll be able to open it up to a wider audience.
This interview was edited and condensed.