Preferred structures for cleaned-up doctor data

Which data formats should the DocGraph project support?

The DocGraph project has an interesting issue that I think will become a common one as  the open data movement continues. For those that have not been keeping up, DocGraph was announced at Strata RX, described carefully on this blog, and will be featured again at Strata 2013. For those that do not care to click links, DocGraph is a crowdfunded open data set, which merges open data sources on doctors and hospitals.

As I recently described on the DocGraph mailing list, work is underway to acquire the data sets that we set out to merge. The issue deals with file formats.

The core identifier for doctors, hospitals and other healthcare entities is the National Provider Identifier (NPI). This is something like a Social Security number for doctors and hospitals. In fact it was created in part so that doctors would not need to use their Social Security numbers or other identifiers in order to participate in healthcare financial transactions (i.e. paid by insurance companies for their services). The NPI is the “one number to rule them” in healthcare and we want to map data from other sources accurately to that ID.

Each state releases none, one or several data files that can be purchased and also contain doctor data. But these file downloads are in “random file format X.” Of course we are not yet done with our full survey of the files and their formats, but I can assure you that they are mostly CSV files and a troubling number of PDF files. It is our job to take these files and merge them against the NPI, in order to provide a cohesive picture for data scientists.

But the data available from each state varies greatly. Sometimes they will have addresses, sometimes not. Sometimes they will have fax numbers, sometimes not, sometimes they will include medical school information, some will not. Sometimes they will simply include the name of the medical school, sometimes they will use a code. Sometimes when they use codes they will make up their own …

I am not complaining here. We knew what we were getting ourselves into when we took on the DocGraph project. The community at large has paid us well to do this work! But now we have a question? What data formats should we support?

The simple answer is that everyone can handle JSON, XML or CSV and we will probably end up supporting all of those formats to some degree. XML is famous for its capacity to handle “constraints” that solve some or all of the problems that we are considering addressing. Of course, there is some work on similar constraint systems for JSON. This guy thinks more is needed, This guy thinks less is needed. Both of them seem to have thought about it more than I have.

What I am concerned with is how to architect the schema(s) for these merges. As we merge the doctor data, we will be making guesses about what a particular field in a file download means? To what degree should we expose those guesses?

How do we merge data sources that will have overlapping and potentially redundant data sources (like addresses for doctors that come from five different files)? How do we handle metadata, when so much of the metadata involves which file/state a data point originated from? Some people will really want to know that, but most people want a slim data structure that is easy to work with.

I feel somewhat uneasy just asking the community of data scientists about these issues, but it ends up being a pretty big deal in the long run. We hope to maintain this as an open dataset, which means that we want people to be able to rely on these file formatting decisions for as long as possible (we will be versioning the format) . Eventually people will start to merge this dataset with DNA and phonemic data, including geo-aware population health data. That is going to be a pretty complicated process and I really want to save headaches for the people involved in this kind of work.

Also important is the size of this dataset. The core data file is an almost 5 GB CSV file. I expect the state files to be around a 1 GB each. Add all of that up and add JSON/XML formatting and you are talking about a lot of space. How can we chose a format that will make working with this dataset easier? So far, we have seen lots of cool stuff because we kept things simple for data scientists.

I would like to open the floor to comments (on this blog post), which the DocGraph project will be taking very seriously. When you comment, if you could mention other “open projects that have solved similar problems and how they did it, that would be great. It would also be great to have links to any relevant SQL or document-based layout standards that might help us. If you are going to provide an academic reference, please also let us know if you have any experience working with those standards. While we really want to solve this problem in a parsimonious way, we do have practical usability as our primary aim here. Thanks for your help in advance.

Related:

tags: , , , , ,
  • lazyspark

    But a combination of NPI and taxonomy is going to help instead of just NPI. As a single NPI can point to an institution which may have different facilities like pediatric, cancer, etc.

  • http://twitter.com/greenbacker Charlie Greenbacker

    I don’t think this is a problem you’re really going to be able to permanently solve. You’ll always be dealing with wild & wooly data to ingest, so you’ll always be working on ETL processes to harvest that data into your system. At its core, this is a bureaucracy problem at the state government level, not a technology problem. As soon as you have ETL processes set up for all 50 states, someone will change their format, alter their metrics, etc. and you’ll be scrambling to adapt.

    So, you’ll need to support any and all file formats used by the data sources you intend to ingest, and you’ll need to be prepared to constantly update your ETL processes to keep up with the dynamic data environment. This will all require a lot of manual effort to deconflict redundant data, etc.

    The format you use to store your merged data doesn’t really matter — just pick whatever makes your job easier. Ideally, though, you’ll provide export mechanisms for users to access this data in all the popular formats: CSV, XML, JSON, etc.

    For inspiration, I’d recommend checking out GeoNames.org which merges geographic information from dozens of gazetteers (in various formats with different data fields) into a single database.

  • http://twitter.com/leonardkish Leonard Kish

    I think you’re bringing up a larger issue of how to manage open data sets. I suspect there are open data communities that have solved these issues before. A data wiki?

    What’s the UNIX philosophy say about the universal interface?

    • http://twitter.com/leonardkish Leonard Kish

      I could be way off here, but might elements of Ward Cunningham’s Smallest Federated Wiki be of use here? Likely would need to be modified for use strictly on datasets, but… maybe that’s a whole other project.

      Anways, I’d go with leanest lowest-common denominator, .csv

  • Prabhu K

    I would sincerely recommend RDF as the format. It is your old Key Value Attribute on steroids. You can query it with SPARQL (Similar to SQL). You can pull other pieces of data along with your existing data and the NPI becomes the URI which everything else revolves around.

    Here is a quick link that shows you the scale – http://www.w3.org/wiki/LargeTripleStores

    The standard is open so if you have a clean ontology you can keep linking the data instead of giving it out as one download as CSV or something else.

  • http://twitter.com/_earthcitizen Greg S

    (DocGraph MedStartr contributor here)

    File format preferred = CSV

    Its the most universal and everyone working with XML / JSON can do their own conversions after the fact

    How to handle fields – i would start with NPI and name obviously, then merge all address formats to the following
    - address_1 field
    - address_2 field
    - zip
    - zip+4 (optional?)
    - State

    Then followed by contact info such as email, phone, fax, etc

    if some states provide fax_number and some do not, just leave the field blank where the states do not, but I would find the common denominator fields that all files have in common, and then start adding unique fields that some or most states have, followed by unique fields that only a handful of states have or only 1 state has. I see no reason why should leave out a field like fax_number, just because some states have it and some do not. I would always recommend keeping it and just leaving blank where the data does not exist