ENTRIES TAGGED "StrataRX"
How the field of genetics is using data within research and to evaluate researchers
Editor’s note: Earlier this week, Part 1 of this article described Sage Bionetworks, a recent Congress they held, and their way of promoting data sharing through a challenge.
Data sharing is not an unfamiliar practice in genetics. Plenty of cell lines and other data stores are publicly available from such places as the TCGA data set from the National Cancer Institute, Gene Expression Omnibus (GEO), and Array Expression (all of which can be accessed through Synapse). So to some extent the current revolution in sharing lies not in the data itself but in critical related areas.
First, many of the data sets are weakened by metadata problems. A Sage programmer told me that the famous TCGA set is enormous but poorly curated. For instance, different data sets in TCGA may refer to the same drug by different names, generic versus brand name. Provenance–a clear description of how the data was collected and prepared for use–is also weak in TCGA.
In contrast, GEO records tend to contain good provenance information (see an example), but only as free-form text, which presents the same barriers to searching and aggregation as free-form text in medical records. Synapse is developing a structured format for presenting provenance based on the W3C’s PROV standard. One researcher told me this was the most promising contribution of Synapse toward the shared used of genetic information.
Observations from Sage Congress and collaboration through its challenge
The glowing reports we read of biotech advances almost cause one’s brain to ache. They leave us thinking that medical researchers must command the latest in all technological tools. But the engines of genetic and pharmaceutical innovation are stuttering for lack of one key fuel: data. Here they are left with the equivalent of trying to build skyscrapers with lathes and screwdrivers.
Sage Congress, held this past week in San Francisco, investigated the multiple facets of data in these field: gene sequences, models for finding pathways, patient behavior and symptoms (known as phenotypic data), and code to process all these inputs. A survey of efforts by the organizers, Sage Bionetworks, and other innovations in genetic data handling can show how genetics resembles and differs from other disciplines.
An intense lesson in code sharing
At last year’s Congress, Sage announced a challenge, together with the DREAM project, intended to galvanize researchers in genetics while showing off the growing capabilities of Sage’s Synapse platform. Synapse ties together a number of data sets in genetics and provides tools for researchers to upload new data, while searching other researchers’ data sets. Its challenge highlighted the industry’s need for better data sharing, and some ways to get there.
An interview with Fred Smith of the CDC on their open content APIs.
Health care data liquidity (the ability of data to move freely and securely through the system) is an increasingly crucial topic in the era of big data. Most conversations about data liquidity focus on patient data, but other kinds of information need to be able to move freely and securely, too. Enter several government initiatives, including efforts at agencies within the Department of Health and Human Services (HHS) to make their content more easily available.
Fred Smith is team lead for the Interactive Media Technology Team in the Division of News and Electronic Media in the Office of the Associate Director for Communication for the U.S. Centers for Disease Control and Prevention (CDC) in Atlanta. We recently spoke by phone to discuss ways in which the CDC is working to make their information more “liquid”: easier to access, easier to repurpose, and easier to combine with other data sources.
Which data is available from the CDC APIs?
Fred Smith: In essence, what we’re doing is taking our unstructured web content and turning it into a structured database, so we can call an API into it for reuse. It’s making our content available for our partners to build into their websites or applications or whatever they’re building.
Todd Park likes to talk about “liberating data” — well, this is liberating content. What is a more high-value dataset than our own public health messaging? It incorporates not only HTML-based text, but also we’re building this to include multimedia — whether it’s podcasts, images, web badges, or other content — and have all that content be aware of other content based on category or taxonomy. So it will be easy to query, for example: “What content does the CDC have on smoking prevention?”
Five ways we can improve the information we collect to help us solve hard problems in health care.
I was honored to chair O’Reilly’s inaugural edition of Strata Rx, our conference on data science in health care, this past October along with Colin Hill. As we’re beginning to plan this year’s event, I find myself thinking a lot about a theme that emerged from some of the keynotes last fall: in order to solve the problems we’re facing in health care — to lower costs and provide more personal, targeted treatments to patients — we don’t just need more data; we need better data.
Much has been made about the era of big data we find ourselves in. But though the data we collect is straining the limits of our tools and models, we’re still not making the kind of headway we hoped for in areas like health care. So big data isn’t enough. We need better data.
What does it mean to have better data in health care? Here are some things on my list; perhaps you can think of others. Read more…
Which data formats should the DocGraph project support?
The DocGraph project has an interesting issue that I think will become a common one as the open data movement continues. For those that have not been keeping up, DocGraph was announced at Strata RX, described carefully on this blog, and will be featured again at Strata 2013. For those that do not care to click links, DocGraph is a crowdfunded open data set, which merges open data sources on doctors and hospitals.
As I recently described on the DocGraph mailing list, work is underway to acquire the data sets that we set out to merge. The issue deals with file formats.
The core identifier for doctors, hospitals and other healthcare entities is the National Provider Identifier (NPI). This is something like a Social Security number for doctors and hospitals. In fact it was created in part so that doctors would not need to use their Social Security numbers or other identifiers in order to participate in healthcare financial transactions (i.e. paid by insurance companies for their services). The NPI is the “one number to rule them” in healthcare and we want to map data from other sources accurately to that ID.
Each state releases none, one or several data files that can be purchased and also contain doctor data. But these file downloads are in “random file format X.” Of course we are not yet done with our full survey of the files and their formats, but I can assure you that they are mostly CSV files and a troubling number of PDF files. It is our job to take these files and merge them against the NPI, in order to provide a cohesive picture for data scientists.
But the data available from each state varies greatly. Sometimes they will have addresses, sometimes not. Sometimes they will have fax numbers, sometimes not, sometimes they will include medical school information, some will not. Sometimes they will simply include the name of the medical school, sometimes they will use a code. Sometimes when they use codes they will make up their own …
I am not complaining here. We knew what we were getting ourselves into when we took on the DocGraph project. The community at large has paid us well to do this work! But now we have a question? What data formats should we support? Read more…
What to do when facing the stoic expressions that pop up during ethics discussions.
The other day I clicked on a message posted to the O’Reilly editors’ email list and the message text filled up almost the entire monitor screen. I must admit that I thought “Am I going to require another caffeine hit to read through this?”
I decided to take a chance, not take another break just then, and read the lengthy note. I didn’t need that caffeine hit after all. Apparently, neither did half a dozen other editors.
The note was about ethics.
In a previous life, I worked in the competitive intelligence field. I remember participating in a friendly confab at an industry event and then someone mentioned the word “e-t-h-i-c-s”. It was rather fascinating to see how that word elicited stoic faces. No one wanted to be the first person to say anything on that topic. Now when working at ORM, mention the word “ethics!” and folks are not shy about saying exactly what they think. Not. At. All.
During the discussion, Ethics of Big Data by Kord Davis, came up. While I was not the editor on this book, I did read it when I was in New York. It made my list of recommended books for people looking to jump into the world of big data. Why? Because I remembered the stoic poker faces from my previous life in competitive intelligence. Read more…
An inside look at DocGraph, a data project that shows how the U.S. health care system delivers care.
At Strata RX in October I announced the availability of DocGraph. This is the first project of NotOnly Development, which is a Not Only For Profit Health IT micro-incubator.
The DocGraph dataset shows how doctors, hospitals, laboratories and other health care providers team together to treat Medicare patients. This data details how the health care system in the U.S. delivers care.
You can read about the basics of this data release, and you can read about my motivations for making the release. Most importantly, you can still participate in our efforts to crowdfund improvements to this dataset. We have already far surpassed our original $15,000 goal, but you can still get early and exclusive access to the data for a few more days. Once the crowdfunding has ended, the price will go up substantially.
This article will focus on this data from a technical perspective.
In a few days, the crowdfunding (hosted by Medstartr) will be over, and I will be delivering this social graph to all of the participants. We are offering a ransom license that we are calling “Open Source Eventually,” so participants in the crowdfunding will get exclusive access to the data for a full six months before the license to this dataset automatically converts to a Creative Commons license. The same data is available under a proprietary-friendly license for more money. For all of these “releases,” this article will be the go-to source for technical details about the specific contents of the file.
O'Reilly conference brings together health care and data
O’Reilly’s first conference devoted to health care, Strata Rx, wrapped up earlier this week. Despite competing with at least three other conferences being held on the same week around the country on various aspects of health care and technology, we drew a crowd that filled the ballroom during keynotes and spent the breaks networking more hungrily than they attacked the (healthy) food provided throughout.
Springing from O’Reilly’s Strata series about the use of data to change business and society, Strata Rx explored many other directions in health care, as a peek at the schedule will show. The keynotes were filmed and will soon appear online. The unique perspectives offered by expert speakers is evident, but what’s hard is making sense of the two days as a whole.
In this article I’ll try to show the underlying threads that tied together the many sessions about data analytics, electronic records, disruption in the health care industry, 21st-century genetics research, patient empowerment, and other themes. The essential message from the leading practitioners at Strata Rx is ultimately that no one in health care (doctors, administrators, researchers, regulators, patients) can practice their discipline in isolation any more. We are all going to have to work together.
We can’t wait for insights from others, expecting researchers to hand us ideal treatment plans or doctors to make oracular judgments. The systems are all interconnected now. And if we want healthy people, not to mention sustainable health care costs, we will have to play our roles in these systems with nuance and sophistication.
But I’ll get to this insight by steps. Let’s look at some major themes of Strata Rx. Read more…
Watch live keynotes from this week's Strata Rx Conference in San Francisco.
The intersection of big data and health care was explored at the O’Reilly Strata Rx Conference. The event has concluded, but you can still access an archive of videos, photos, and speaker slides. Read more…
Voice your support for a proposed federal rule that expands patients' access to test results.
I’m convinced that there’s a wave of innovation coming in healthcare, driven by new kinds of data, new ways of extracting meaning from that data, and new business models that data can enable. That’s one of the reasons why we launched our StrataRx Conference, which focuses on the importance of data science to the future of health care.
Unfortunately, much of the data that will enable an entrepreneurial explosion is still locked up — in paper records, in proprietary data formats, and by well-intentioned but conflicting privacy regulations.
We’re making progress towards open data in healthcare, but there are still so many obstacles! Ann Waldo recently introduced me to one of these.
A 2009 law modernized patient access rights by allowing individuals to get copies of their medical records in electronic format. Unfortunately, however, these patients’ access rights surprisingly do not include lab test results – one of the types of medical records that people are most likely to find urgent and useful. Due to the interaction of HIPAA (the Federal medical privacy law), CLIA (a Federal laboratory regulatory law), and state laws, patients can only get direct access to their their test results from labs in a handful of states.
A recent New York Times story highlighted just how much pain and suffering can be caused by this inability to get access to your own lab results.
In 2011, the Department of Health and Human Services put forward a proposed Rule that would give patients the right to get their test results directly from laboratories. This Rule is still waiting to be finalized. In hopes of breaking the logjam, O’Reilly Media and a variety of other players have written a consensus letter that voices our whole-hearted support for that proposed Rule and encourages the Federal government to finalize it promptly.
We’d love to invite you to join us in signing this letter.
Patients’ rights should include direct access to their lab results, just like all their other medical records!