ENTRIES TAGGED "crowdsourcing"
More than algorithms, companies gain access to models that incorporate ideas generated by teams of data scientists
Data scientists were among the earliest and most enthusiastic users of crowdsourcing services. Lukas Biewald noted in a recent talk that one of the reasons he started CrowdFlower was that as a data scientist he got frustrated with having to create training sets for many of the problems he faced. More recently, companies have been experimenting with active learning (humans1 take care of uncertain cases, models handle the routine ones). Along those lines, Adam Marcus described in detail how Locu uses Crowdsourcing services to perform structured extraction (converting semi/unstructured data into structured data).
Another area where crowdsourcing is popping up is feature engineering and feature discovery. Experienced data scientists will attest that generating features is as (if not more) important than choice of algorithm. Startup CrowdAnalytix uses public/open data sets to help companies enhance their analytic models. The company has access to several thousand data scientists spread across 50 countries and counts a major social network among its customers. Its current focus is on providing “enterprise risk quantification services to Fortune 1000 companies”.
CrowdAnalytix breaks up projects in two phases: feature engineering and modeling. During the feature engineering phase, data scientists are presented with a problem (independent variable(s)) and are asked to propose features (predictors) and brief explanations for why they might prove useful. A panel of judges evaluate2 features based on the accompanying evidence and explanations. Typically 100+ teams enter this phase of the project, and 30+ teams propose reasonable features.
As companies continue to use crowdsourcing, demand for people who know how to manage projects remains steady
A little over four years ago, I attended the first Crowdsourcing meetup at the offices of Crowdflower (then called Dolores Labs). The crowdsourcing community has grown explosively since that initial gathering, and there are now conference tracks and conferences devoted to this important industry. At the recent CrowdConf1, I found a community of professionals who specialize in managing a wide array of crowdsourcing projects.
Data scientists were early users of crowdsourcing services. I personally am most familiar with a common use case – the use of crowdsourcing to create labeled data sets for training machine-learning models. But as straightforward as it sounds, using crowdsourcing to generate training sets can be tricky – fortunately there are excellent papers and talks on this topic. At the most basic level, before embarking on a crowdsourcing project you should go through a simple checklist (among other things, make sure you have enough scale to justify engaging with a provider).
Beyond building training sets for machine-learning, more recently crowdsourcing is being used to enhance the results of machine-learning models: in active learning, humans2 take care of uncertain cases, models handle the routine ones. The use of ReCAPTCHA to digitize books is an example of this approach. On the flip side, analytics are being used to predict the outcome of crowd-based initiatives: researchers developed models to predict the success of Kickstarter campaigns 4 hours after their launch.
A doctor looks to software communities as inspiration for her own research
(The following article sprang from a collaboration between Andy Oram and Brigitte Piniewski to cover open source concepts in an upcoming book on health care. This book, titled “Wireless Health: Remaking of Medicine by Pervasive Technologies,” is edited by Professor Mehran Mehregany of Case Western Reserve University. and has an expected release date of February 2013. It is designed to provide the reader with the fundamental and practical knowledge necessary for an overall grasp of the field of wireless health. The approach is an integrated, multidisciplinary treatment of the subject by a team of leading topic experts. The selection here is part of a larger chapter by Brigitte Piniewski about personalized medicine and public health.)
Medical research and open source software have much to learn from each other. As software transforms the practice and delivery of medicine, the communities and development methods that have grown up around software–particularly free and open source software–also provide models that doctors and researchers can apply to their own work. Some of the principles that software communities can offer for spreading health throughout the population include these:
Like a living species, software evolves as code is updated and functionality is improved.
Software of low utility is dropped as users select better tools and drive forward functionality to meet new use cases.
Open source culture demonstrates how a transparent approach to sharing software practices enables problem areas to be identified and corrected accurately, cost-effectively, and at the pace of change.
Can open data dominate biological science as open source has in software?
To move from a hothouse environment of experimentation to the mainstream of one of the world's most lucrative and tradition-bound industries, Sage Bionetworks must aim for its nucleus: rewards and incentives. Comparisons to open source software and a summary of tasks for Sage Congress.
The Vioxx problem is just one instance of the wider malaise afflicting the drug industry. Managers from major pharma companies expressed confidence that they could expand public or "pre-competitive" research in the direction Sage Congress proposed. The sector left to engage is the one that's central to all this work–the public.
Report from a movement that believes in open source and open data in science
Through two days of demos, keynotes, panels, and breakout sessions, Sage Congress brought its vision to a high-level cohort of 230 attendees from universities, pharmaceutical companies, government health agencies, and others who can make change in the field.
The second in a series looking at the major themes of this year's TOC conference.
Several overriding themes permeated this year’s Tools of Change for Publishing conference. The second in a series looking at five of the major themes, here we take a look at data in publishing — how publishers can benefit, practical applications, and innovative ways it can be used.
Panagiotis Ipeirotis on the vagaries of semantic analysis and Mechanical Turk's quirks.
In a recent interview, NYU Professor Panagiotis Ipeirotis explained why a "good" online review is often perceived negatively. He also discussed Mechanical Turk's growing pains.
Fold.it users make a scientific breakthrough, Twitter open sources real-time processing tool, Google faces a senate hearing.
In this week's data news: Fold.it gamers help with HIV research, Twitter eyes data analytics, and Google testifies before the Senate.
MapReduce crunches a million-song dataset, GPS and accident reconstruction, and WWI crowdsourcing.
This week's data stories include a guide to using MapReduce to process the Million Song Dataset, a story about how GPS data can help reconstruct lost memories (and accidents), and evidence that emergency crowdsourcing goes back further than many realize.