ENTRIES TAGGED "databases"
Matching the missing to the dead involves reconciling two national databases.
Javier Reveron went missing from Ohio in 2004. His wallet turned up in New York City, but he was nowhere to be found. By the time his parents arrived to search for him and hand out fliers, his remains had already been buried in an unmarked indigent grave. In New York, where coroner’s resources are precious, remains wait a few months to be claimed before they’re buried by convicts in a potter’s field on uninhabited Hart Island, just off the Bronx in Long Island Sound.
The story, reported by the New York Times last week, has as happy an ending as it could given that beginning. In 2010 Reveron’s parents added him to a national database of missing persons. A month later police in New York matched him to an unidentified body and his remains were disinterred, cremated and given burial ceremonies in Ohio.
Reveron’s ordeal suggests an intriguing, and impactful, machine-learning problem. The Department of Justice maintains separate national, public databases for missing people, unidentified people and unclaimed people. Many records are full of rich data that is almost never a perfect match to data in other databases — hair color entered by a police department might differ from how it’s remembered by a missing person’s family; weights fluctuate; scars appear. Photos are provided for many missing people and some unidentified people, and matching them is difficult. Free-text fields in many entries describe the circumstances under which missing people lived and died; a predilection for hitchhiking could be linked to a death by the side of a road.
I’ve called the Department of Justice (DOJ) to ask about the extent to which they’ve worked with computer scientists to match missing and unidentified people, and will update when I hear back. One thing that’s not immediately apparent is the public availability of the necessary training set — cases that have been successfully matched and removed from the lists. The DOJ apparently doesn’t comment on resolved cases, which could make getting this data difficult. But perhaps there’s room for a coalition to request the anonymized data and manage it to the DOJ’s satisfaction while distributing it to capable data scientists.
How to think about choosing a database.
A relational database is no longer the default choice. Mike Loukides charts the rise of the NoSQL movement and explains how to choose the right database for your application.
Questions surround the Aaron Swartz case and Microsoft wants to help scholars with big data.
Aaron Swartz faces felony charges for downloading "big data" (more than 4 million academic journals) from the MIT library, Microsoft's new data tool is aimed at scholars, and David Eaves looks at open data efforts in Canada.
CouchDB proves a good fit for a project with technical limits.
A new project in Zambia is trying to integrate supervisors, clinics, and community healthcare workers into a system that can improve patient service and provide more data. In this interview, Cory Zue explains how CouchDB is playing a role.
Data acquisition for a site like CrunchBase may not carry the costs some assume.
The data acquisition process should be increasingly automatic, and so increasingly cheap. I'm hoping for a world where information producers are paid for extracting value from that data.
On March 11 Boston will join several other cities who have host conferences on the movement broadly known as NoSQL. Cassandra, CouchDB, HBase, HypergraphDB, Hypertable, Memcached, MongoDB, Neo4j, Riak, SimpleDB, Voldemort, and probably other projects as well will be represented at the one-day affair. The interviews I had with various projects leaders for this article turned up a recurring usage pattern for NoSQL. What connects the users is that they carry out web-related data crunching, searching, and other Web 2.0 related work. I think these companies use NoSQL tools because they’re the companies who understand leading-edge technologies and are willing to take risks in those areas. As the field gets better known, usage will spread.
Access to local information is great, but context is even better
There’s plenty of enthusiasm for local / hyperlocal projects, but the sweepstakes has yet to be won. So many of these local efforts rely on traditional information delivery through news articles or databases. That material has use, no doubt. Yet few projects take the extra step and put that data into context.