ENTRIES TAGGED "machine learning"

A startup takes on “the paper problem” with crowdsourcing and machine learning

With a new mobile app and API, Captricity wants to build a better bridge between analog and digital.

Unlocking data from paper forms is the problem that optical character recognition (OCR) software is supposed to solve. Two issues persist, however. First, the hardware and software involved are expensive, creating challenges for cash-strapped nonprofits and government. Second, all of the information on a given document is scanned into a system, including sensitive details like Social Security numbers and other personally identifiable information. This is a particularly difficult issue with respect to health care or bringing open government to courts: privacy by obscurity will no longer apply.

The process of converting paper forms into structured data still hasn’t been significantly disrupted by rapid growth of the Internet, distributed computing and mobile devices. Fields that range from research science to medicine to law to education to consumer finance to government all need better, cheaper bridges from the analog to the digital sphere.

Enter Captricity. The startup, which was co-founded by Jeff J. Lin and Kuang Chen, has its roots in the fieldwork on rural health Chen did as part of his PhD program.

“I was looking at the information systems that were available to these low-resource organizations,” Chen said in a recent phone interview. “I saw that they’re very much bound in paper. There’s actually a lot of efforts to modernize the infrastructure and put in mobile phones. Now that there’s mobile connectivity, you can run a health clinic on solar panels and long distance Wi-Fi. At the end of the day, however, business processes are still on paper because they had to be essentially fail-proof. Technology fails all the time. From that perspective, paper is going to stick around for a very long time. If we’re really going to tackle the challenge of the availability of data, we shouldn’t necessarily be trying to change the technology infrastructure first — bringing mobile phones and iPads to where there’s paper — but really to start with solving the paper problem.”

When Chen saw that data entry was a chokepoint for digitizing health indicators, he started working on developing a better, cheaper way to ingest data on forms. Read more…

Comment |

Seven reasons why I like Spark

Spark is becoming a key part of a big data toolkit.

A large portion of this week’s Amp Camp at UC Berkeley, is devoted to an introduction to Spark – an open source, in-memory, cluster computing framework. After playing with Spark over the last month, I’ve come to consider it a key part of my big data toolkit. Here’s why:

Hadoop integration: Spark can work with files stored in HDFS, an important feature given the amount of investment in the Hadoop Ecosystem. Getting Spark to work with MapR is straightforward.

The Spark interactive Shell: Spark is written in Scala, and has it’s own version of the Scala interpreter. I find this extremely convenient for testing short snippets of code.

The Spark Analytic Suite:


(Figure courtesy of Matei Zaharia)

Spark comes with tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming). Rather than having to mix and match a set of tools (e.g., Hive, Hadoop, Mahout, S4/Storm), you only have to learn one programming paradigm. For SQL enthusiasts, the added bonus is that Shark tends to run faster than Hive. If you want to run Spark in the cloud, there are a set of EC2 scripts available.

Read more…

Comments: 3 |

A grisly job for data scientists

Matching the missing to the dead involves reconciling two national databases.

Missing Person: Ai Weiwei by Daquella manera, on FlickrJavier Reveron went missing from Ohio in 2004. His wallet turned up in New York City, but he was nowhere to be found. By the time his parents arrived to search for him and hand out fliers, his remains had already been buried in an unmarked indigent grave. In New York, where coroner’s resources are precious, remains wait a few months to be claimed before they’re buried by convicts in a potter’s field on uninhabited Hart Island, just off the Bronx in Long Island Sound.

The story, reported by the New York Times last week, has as happy an ending as it could given that beginning. In 2010 Reveron’s parents added him to a national database of missing persons. A month later police in New York matched him to an unidentified body and his remains were disinterred, cremated and given burial ceremonies in Ohio.

Reveron’s ordeal suggests an intriguing, and impactful, machine-learning problem. The Department of Justice maintains separate national, public databases for missing people, unidentified people and unclaimed people. Many records are full of rich data that is almost never a perfect match to data in other databases — hair color entered by a police department might differ from how it’s remembered by a missing person’s family; weights fluctuate; scars appear. Photos are provided for many missing people and some unidentified people, and matching them is difficult. Free-text fields in many entries describe the circumstances under which missing people lived and died; a predilection for hitchhiking could be linked to a death by the side of a road.

I’ve called the Department of Justice (DOJ) to ask about the extent to which they’ve worked with computer scientists to match missing and unidentified people, and will update when I hear back. One thing that’s not immediately apparent is the public availability of the necessary training set — cases that have been successfully matched and removed from the lists. The DOJ apparently doesn’t comment on resolved cases, which could make getting this data difficult. But perhaps there’s room for a coalition to request the anonymized data and manage it to the DOJ’s satisfaction while distributing it to capable data scientists.

Photo: Missing Person: Ai Weiwei by Daquella manera, on Flickr

Read more…

Comments: 2 |
Strata Week: Data prospecting with Kaggle

Strata Week: Data prospecting with Kaggle

Kaggle now accepting data before a contest, HP's Autonomy purchase comes into focus, Cloudera's new Hadoop distribution.

In this week's data news, Kaggle launches Prospect, HP unveils its big data plans, and Cloudera releases CDH4 (the latest version of its Hadoop distribution).

Comment |

What it takes to build great machine learning products

Rich machine learning products come from skilled and knowledgeable teams.

Specific insights into a problem and careful model design separate a machine learning system that doesn't work from one that people will actually use.

Comments: 7 |
Strata Week: Machine learning vs domain expertise

Strata Week: Machine learning vs domain expertise

Debating the data skills of machines and experts, a key data move for Microsoft, and Google Analytics gets social.

This week's data news includes another look at the Strata Conference's debate about machine learning versus subject matter expertise, Raghu Ramakrishnan moves from Yahoo to Microsoft, and more social data comes to Google Analytics.

Comment |
The search for a minimum viable record

The search for a minimum viable record

Open Library's George Oates on the pursuit of concise categorization.

George Oates, the lead from the Open Library, discusses the complexities of biographic data and the possibility for a minimum viable record.

Comment: 1 |
The quiet rise of machine learning

The quiet rise of machine learning

Alasdair Allan on how machine learning is taking over the mainstream.

From Goodreads to Google to Orbitz, machine learning is slowly becoming part of everyday life. Alasdair Allan discusses current uses and how machine learning factors into his own robotic telescope network.

Comments Off |

Need faster machine learning? Take a set-oriented approach

How a days-long data process was completed in minutes.

We recently faced the type of big data challenge we expect to become increasingly common: scaling up the performance of a machine learning classifier for a large set of unstructured data. In this post, we explain how a set-oriented approach led to huge performance gains.

Comments: 7 |

Crowdsourcing specific microtasks

Since the first-ever Mechanical Turk meetup a year ago, there has been an explosion in crowdsourcing services and a well-attended conference in San Francisco. I remain enthusiastic about crowdsourcing, but the number of companies has me worried about quality of work. Fortunately specialization is already occurring, so for particular tasks there are companies out there ready to provide high-quality service….

Comments: 2 |