Unlocking data from paper forms is the problem that optical character recognition (OCR) software is supposed to solve. Two issues persist, however. First, the hardware and software involved are expensive, creating challenges for cash-strapped nonprofits and government. Second, all of the information on a given document is scanned into a system, including sensitive details like Social Security numbers and other personally identifiable information. This is a particularly difficult issue with respect to health care or bringing open government to courts: privacy by obscurity will no longer apply.
The process of converting paper forms into structured data still hasn’t been significantly disrupted by rapid growth of the Internet, distributed computing and mobile devices. Fields that range from research science to medicine to law to education to consumer finance to government all need better, cheaper bridges from the analog to the digital sphere.
“I was looking at the information systems that were available to these low-resource organizations,” Chen said in a recent phone interview. “I saw that they’re very much bound in paper. There’s actually a lot of efforts to modernize the infrastructure and put in mobile phones. Now that there’s mobile connectivity, you can run a health clinic on solar panels and long distance Wi-Fi. At the end of the day, however, business processes are still on paper because they had to be essentially fail-proof. Technology fails all the time. From that perspective, paper is going to stick around for a very long time. If we’re really going to tackle the challenge of the availability of data, we shouldn’t necessarily be trying to change the technology infrastructure first — bringing mobile phones and iPads to where there’s paper — but really to start with solving the paper problem.”
When Chen saw that data entry was a chokepoint for digitizing health indicators, he started working on developing a better, cheaper way to ingest data on forms.
Captricity’s approach to that paper problem is fascinating, as I saw in an online demo of the technology. They’ve found a way to use the Internet to quickly and cheaply digitize handwritten forms into structured data.
“Our target user is an office admin who’s really good at, let’s say, Excel and doing mail merges in Word,” said Chen. “We make sure that person or a school teacher or an existing database administrator can just go and do it themselves.”
The process looks relatively simple for the end user, at least on the surface. Scan a form and then outline fields on it to create a template. When you subsequently scan a batch of forms, the system breaks up the designated fields into images that crowdsourced workers can identify online. The fields from each form are then output as structured data, exportable as a CSV file or into Google documents. Each digitized document represents a row. The provenance of the data is preserved, with the original image connected to a given cell. Most jobs take 10 to 20 minutes. (The demo I saw took 11 minutes or so.)
“Under the covers, the approach is to take a page and cut it up into little pieces,” said Chen. “We reorganize each little piece and then we send it out to workers on the Internet. They don’t see the context of anything else. They give us a set of answers that we make sure are correct by essentially doing it in triplicate. Then we take a small set of these triple-verified values and build essentially a machine-learning vision engine that predicts the value for that box. ”
Captricity has effectively positioned itself as a “digitization-as-a-service” provider. Instead of buying equipment, organizations can pay as they go.
“It’s a place on the Internet, like a tap, that can turn on digitization,” Chen said. “You pay for exactly how much you use. You don’t have to spend $300,000 and buy a very high-speed scanner and self-service on premise to get what you need to get done. What we have runs on Amazon AWS and uses its Elastic Computing Cloud to crowdsource labor from Mechanical Turk. We are in talks to use other more private and specialized crowds as well.”
The retail cost to use Captricity is about $0.20 per page, after the first 25 pages, which are free. Larger volume customers will pay a bit more, based on the type of data and the volume of data. By comparison, OCR-only solutions are in the ballpark of $0.01 to $0.03 per page, said Chen, but require an expensive software license and equipment.
After the recent launch of an iOS app, Captricity went mobile. It’s now possible to scan using an iPhone, iPad or iPod Touch, integrate with an existing template and then sync the information to a Dropbox account. An Android app is in development and will be launched early next year. They’ve also rolled out integration with Salesforce in the mobile app, along with Box.com, and Constant Contact.
The startup, however, holds the potential to be something bigger than just a better mobile digitization provider: Captricity’s application programming interface (API) will help tap into their digitized data more easily.
“We have an API that has been in private beta for about a month-and-a-half now,” said Chen. “We extract away complexity and allow an application developer to just say, ‘We have forms. Let’s enter in the data. Go.’ And you’re up and running in a day.”
And now, developers can share that information. On Tuesday, Captricity announced a new open data platform that will publicly share digitized, structured datasets. The first dataset that they’ve published comes from the U.S. Census, as a demonstration of the concept.
This technology is of particular interest to health care, which is full of forms. That’s why Chen will join representatives from West Health and ElationEMR at this month’s StrataRx Conference to talk about the untapped potential of structured data on paper. The ability for users to only scan certain parts of forms — which would enable fields containing personally identifiable information to be left out — could be a key component of health data infrastructure. There will be special challenges, given HIPAA rules that govern patient data, but selective digitization might be a way to address them.
“We have fairly strict and cautious controls over this process, so it’s not automatic, but rather handheld, to make sure that only data intended to be public becomes public,” said Chen.
The idea of breaking up jobs for many people to work on online has been around for a while, of course, with crowdsourcing breaking big in the middle of the last decade. The innovation here is applying it to something that machines still can’t do as well as humans — reading handwriting — and then solving a problem that has been a real chokepoint for digitizing data. Even if this particular startup doesn’t end up taking off, they’ve approached a critical need in a way that has implications for multiple industries. For the data economy to grow, it will need more feedstock.
“This roughly falls under this flag of human-guided machine learning,” said Chen. “I think that with the advent of crowdsourcing, this is going to be a really powerful force in improving what machine learning algorithms can do and the types of problems it can solve. In six months — plus the time I spent on my PhD — we built a production system to do OCR that is better than any OCR system out there today, by probably an order of magnitude. I have the most respect for the computer vision researchers that came up with those OCR algorithms. It’s just that we solved a different problem than they did. This is the approach that we’re taking to solve the other problems of this domain as well.”