ENTRIES TAGGED "data project"
Which data formats should the DocGraph project support?
The DocGraph project has an interesting issue that I think will become a common one as the open data movement continues. For those that have not been keeping up, DocGraph was announced at Strata RX, described carefully on this blog, and will be featured again at Strata 2013. For those that do not care to click links, DocGraph is a crowdfunded open data set, which merges open data sources on doctors and hospitals.
As I recently described on the DocGraph mailing list, work is underway to acquire the data sets that we set out to merge. The issue deals with file formats.
The core identifier for doctors, hospitals and other healthcare entities is the National Provider Identifier (NPI). This is something like a Social Security number for doctors and hospitals. In fact it was created in part so that doctors would not need to use their Social Security numbers or other identifiers in order to participate in healthcare financial transactions (i.e. paid by insurance companies for their services). The NPI is the “one number to rule them” in healthcare and we want to map data from other sources accurately to that ID.
Each state releases none, one or several data files that can be purchased and also contain doctor data. But these file downloads are in “random file format X.” Of course we are not yet done with our full survey of the files and their formats, but I can assure you that they are mostly CSV files and a troubling number of PDF files. It is our job to take these files and merge them against the NPI, in order to provide a cohesive picture for data scientists.
But the data available from each state varies greatly. Sometimes they will have addresses, sometimes not. Sometimes they will have fax numbers, sometimes not, sometimes they will include medical school information, some will not. Sometimes they will simply include the name of the medical school, sometimes they will use a code. Sometimes when they use codes they will make up their own …
I am not complaining here. We knew what we were getting ourselves into when we took on the DocGraph project. The community at large has paid us well to do this work! But now we have a question? What data formats should we support? Read more…
Ilya Grigorik's GitHub project shows what happens when questions, data, and tools converge.
1. Ask the question, “I wonder what happens if I do this?” and then follow it all the way through.
2. Start a project on a whim and open it up so anyone can participate.
By day, Grigorik is a developer advocate on Google’s Make the Web Fast team (he’s a perfect candidate for a future Velocity interview). On the side, he likes to track open source projects on GitHub. As he explained during our chat, this can be a time-intensive hobby:
“I follow about 3,000 open source projects, and I try to keep up with what’s going on, what are people contributing to, what are the new interesting sub-branches of work being done … The problem I ran into about six months ago was that, frankly, it was just too much to keep up with. The GitHub timeline was actually overflowing. In order to keep up, I would have to go in every four hours and scan through everything, and then repeat it. That doesn’t give you much time for sleep.” [Discussed 15 seconds into the interview.]
Grigorik built a system — including a newsletter— that lets him stay in the loop efficiently. He worked with GitHub to archive public GitHub activity, and he then made that data available in raw form and through Google BigQuery (the data is updated hourly).
This is a fun project, no doubt, but it’s also a big deal. Here’s why: When you shorten the distance between questions and answers, you empower people to ask more questions. It’s the liberation of curiosity, and that’s exactly what happened here. Read more…