ENTRIES TAGGED "crowdsource"
More than algorithms, companies gain access to models that incorporate ideas generated by teams of data scientists
Data scientists were among the earliest and most enthusiastic users of crowdsourcing services. Lukas Biewald noted in a recent talk that one of the reasons he started CrowdFlower was that as a data scientist he got frustrated with having to create training sets for many of the problems he faced. More recently, companies have been experimenting with active learning (humans1 take care of uncertain cases, models handle the routine ones). Along those lines, Adam Marcus described in detail how Locu uses Crowdsourcing services to perform structured extraction (converting semi/unstructured data into structured data).
Another area where crowdsourcing is popping up is feature engineering and feature discovery. Experienced data scientists will attest that generating features is as (if not more) important than choice of algorithm. Startup CrowdAnalytix uses public/open data sets to help companies enhance their analytic models. The company has access to several thousand data scientists spread across 50 countries and counts a major social network among its customers. Its current focus is on providing “enterprise risk quantification services to Fortune 1000 companies”.
CrowdAnalytix breaks up projects in two phases: feature engineering and modeling. During the feature engineering phase, data scientists are presented with a problem (independent variable(s)) and are asked to propose features (predictors) and brief explanations for why they might prove useful. A panel of judges evaluate2 features based on the accompanying evidence and explanations. Typically 100+ teams enter this phase of the project, and 30+ teams propose reasonable features.
As companies continue to use crowdsourcing, demand for people who know how to manage projects remains steady
A little over four years ago, I attended the first Crowdsourcing meetup at the offices of Crowdflower (then called Dolores Labs). The crowdsourcing community has grown explosively since that initial gathering, and there are now conference tracks and conferences devoted to this important industry. At the recent CrowdConf1, I found a community of professionals who specialize in managing a wide array of crowdsourcing projects.
Data scientists were early users of crowdsourcing services. I personally am most familiar with a common use case – the use of crowdsourcing to create labeled data sets for training machine-learning models. But as straightforward as it sounds, using crowdsourcing to generate training sets can be tricky – fortunately there are excellent papers and talks on this topic. At the most basic level, before embarking on a crowdsourcing project you should go through a simple checklist (among other things, make sure you have enough scale to justify engaging with a provider).
Beyond building training sets for machine-learning, more recently crowdsourcing is being used to enhance the results of machine-learning models: in active learning, humans2 take care of uncertain cases, models handle the routine ones. The use of ReCAPTCHA to digitize books is an example of this approach. On the flip side, analytics are being used to predict the outcome of crowd-based initiatives: researchers developed models to predict the success of Kickstarter campaigns 4 hours after their launch.
With a new mobile app and API, Captricity wants to build a better bridge between analog and digital.
Unlocking data from paper forms is the problem that optical character recognition (OCR) software is supposed to solve. Two issues persist, however. First, the hardware and software involved are expensive, creating challenges for cash-strapped nonprofits and government. Second, all of the information on a given document is scanned into a system, including sensitive details like Social Security numbers and other personally identifiable information. This is a particularly difficult issue with respect to health care or bringing open government to courts: privacy by obscurity will no longer apply.
The process of converting paper forms into structured data still hasn’t been significantly disrupted by rapid growth of the Internet, distributed computing and mobile devices. Fields that range from research science to medicine to law to education to consumer finance to government all need better, cheaper bridges from the analog to the digital sphere.
“I was looking at the information systems that were available to these low-resource organizations,” Chen said in a recent phone interview. “I saw that they’re very much bound in paper. There’s actually a lot of efforts to modernize the infrastructure and put in mobile phones. Now that there’s mobile connectivity, you can run a health clinic on solar panels and long distance Wi-Fi. At the end of the day, however, business processes are still on paper because they had to be essentially fail-proof. Technology fails all the time. From that perspective, paper is going to stick around for a very long time. If we’re really going to tackle the challenge of the availability of data, we shouldn’t necessarily be trying to change the technology infrastructure first — bringing mobile phones and iPads to where there’s paper — but really to start with solving the paper problem.”
When Chen saw that data entry was a chokepoint for digitizing health indicators, he started working on developing a better, cheaper way to ingest data on forms. Read more…