U.S. opens data, Wong tapped for U.S. chief privacy officer, FBI might read your email sans warrant, and big data spells trouble for anonymity.
U.S. government data to be machine-readable, Nicole Wong may fill new White House chief privacy officer role
The U.S. government took major steps this week to open up government data to the public. U.S. President Obama signed an executive order requiring government data to be made available in machine-readable formats, and the Office of Management and Budget and the Office of Science and Technology Policy released a Open Data Policy memo (PDF) to address the order’s implementation.
The press release announcing the actions notes the benefit the U.S. economy historically has experienced with the release of government data — GPS data, for instance, sparked a flurry of innovation that ultimately contributed “tens of billions of dollars in annual value to the American economy,” according to the release. President Obama noted in a statement that he hopes a similar result will come from this open data order: “Starting today, we’re making even more government data available online, which will help launch even more new startups. And we’re making it easier for people to find the data and use it, so that entrepreneurs can build products and services we haven’t even imagined yet.”
Would you let people know about your dandruff problem if it might mean a cure for Lupus?
Two weeks ago, I had the privilege to attend the 2013 Genomes, Environments and Traits conference in Boston, as a participant of Harvard Medical School’s Personal Genome Project. Several hundreds of us attended the conference, eager to learn what new breakthroughs might be in the works using the data and samples we have contributed, and to network with the researchers and each other.
The Personal Genome Project (PGP) is a very different type of beast from the traditional research study model, in several ways. To begin with, it is a Open Consent study, which means that all the data that participants donate is available for research by anyone without further consent by the subject. In other words, having initially consented to participate in the PGP, anyone can download my genome sequence, look at my phenotypic traits (my physical characteristics and medical history), or even order some of my blood from a cell line that has been established at the Coriell biobank, and they do not need to gain specific consent from me to do so. By contrast, in most research studies, data and samples can only be collected for one specific study, and no other purposes. This is all in an effort to protect the privacy of the participants, as was famously violated in the establishment of the HeLa cell line.
The other big difference is that in most studies, the participants rarely receive any information back from the researchers. For example, if the researcher does a brain MRI to gather data about the structure of a part of your brain, and sees a huge tumor, they are under no obligation to inform you about it, or even to give you a copy of the scan. This is because researchers are not certified as clinical laboratories, and thus are not authorized to report medical findings. This makes sense, to a certain extent, with traditional medical tests, as the research version may not be calibrated to detect the same things, and the researcher is not qualified to interpret the results for medical purposes.
I was thrilled to receive an invitation to a new meetup: the NYC Data Skeptics Meetup. If you’re in the New York area, and you’re interested in seeing data used honestly, stop by!
That announcement pushed me to write another post about data skepticism. The past few days, I’ve seen a resurgence of the slogan that correlation is as good as causation, if you have enough data. And I’m worried. (And I’m not vain enough to think it’s a response to my first post about skepticism; it’s more likely an effect of Cukier’s book.) There’s a fundamental difference between correlation and causation. Correlation is a two-headed arrow: you can’t tell in which direction it flows. Causation is a single-headed arrow: A causes B, not vice versa, at least in a universe that’s subject to entropy.
The BBC pulled data from the International Rescue Corps to create an interactive guide to emergency response efforts in a building collapse.
In the wake of recent building collapses, the BBC addressed the question of what goes into the rescue efforts by creating an interactive guide outlining how rescuers approach a collapsed building.
Our tools should make common cases easy and safe, but that's not the reality today.
Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.
Well, arguing against greater understanding of the methods we apply is like arguing against motherhood and apple pie, and Cathy and Mike are spot on in their diagnoses of the current situation. And yet …
Fit2Cure taps the public's visual skills to match compounds to targets
In the inspiring tradition of Foldit, the game for determining protein shapes, Fit2Cure crowdsources the problem of finding drugs that can cure the many under-researched diseases of developing countries. Fit2Cure appeals to the player’s visual–even physical–sense of the world, and requires much less background knowledge than Foldit.
There about 7,000 rare diseases, fewer than 5% of which have cures. The number of people currently engaged in making drug discoveries is by no means adequate to study all these diseases. A recent gift to Harvard shows the importance that medical researchers attach to filling the gap. As an alternative approach, abstracting the drug discovery process into a game could empower thousands, if not millions, of people to contribute to this process and make discoveries in diseases that get little attention to scientists or pharmaceutical companies.
The biological concept behind Fit2Cure is that medicines have specific shapes that fit into the proteins of the victim’s biological structures like jig-saw puzzle pieces (but more rounded). Many cures require finding a drug that has the same jig-saw shape and can fit into the target protein molecule, thus preventing it from functioning normally.
In a conversation with Q Ethan McCallum (who should be credited as co-author), we wondered how to evaluate data science groups. If you’re looking at an organization’s data science group from the outside, possibly as a potential employee, what can you use to evaluate it? It’s not a simple problem under the best of conditions: you’re not an insider, so you don’t know the full story of how many projects it has tried, whether they have succeeded or failed, relations between the data group, management, and other departments, and all the other stuff you’d like to know but will never be told.
Our starting point was remote: Q told me about Tyler Brulé’s travel writing for Financial Times (behind a paywall, unfortunately), in which he says that a club sandwich is a good proxy for hotel quality: you go into the restaurant and order a club sandwich. A club sandwich isn’t hard to make: there’s no secret recipe or technique that’s going to make Hotel A’s sandwich significantly better than B’s. But it’s easy to cut corners on ingredients and preparation. And if a hotel is cutting corners on their club sandwiches, they’re probably cutting corners in other places.
The simplest and quickest way to mine your data is to deploy efficient algorithms designed to answer key questions at scale.
For many organizations real-time1 analytics entails complex event processing systems (CEP) or newer distributed stream processing frameworks like Storm, S4, or Spark Streaming. The latter have become more popular because they are able to process massive amounts of data, and fit nicely with Hadoop and other cluster computing tools. For these distributed frameworks peak volume is function of network topology/bandwidth and the throughput of the individual nodes.
Scaling up machine-learning: Find efficient algorithms
Faced with having to crunch through a massive data set, the first thing a machine-learning expert will try to do is devise a more efficient algorithm. Some popular approaches involve sampling, online learning, and caching. Parallelizing an algorithm tends to be lower on the list of things to try. The key reason is that while there are algorithms that are embarrassingly parallel (e.g., naive bayes), many others are harder to decouple. But as I highlighted in a recent post, efficient tools that run on single servers can tackle large data sets. In the machine-learning context recent examples2 of efficient algorithms that scale to large data sets, can be found in the products of startup SkyTree.
Jon Bruner's industrial Internet report; IBM, Belkin, and the Internet of Things; cars as software platforms; and coding is the job of the future.
Soon, everything will be an Internet platform
Ben Schiller at Fast Company took a look this week at a recent report by Jon Bruner on the industrial Internet. “According to Jon Bruner [the industrial Internet] is ‘machines becoming nodes on pervasive networks that use open protocols,’” writes Schiller. “And, to many others, it is as a big a deal as the Internet itself: essentially completing a job that’s only half-finished with web sites, email, Twitter, and so on.”
Shiller pulls some highlights from Bruner’s report, especially noting how the industrial Internet will effect various industries, such as energy, health care, and transport. Read more…
How the field of genetics is using data within research and to evaluate researchers
Editor’s note: Earlier this week, Part 1 of this article described Sage Bionetworks, a recent Congress they held, and their way of promoting data sharing through a challenge.
Data sharing is not an unfamiliar practice in genetics. Plenty of cell lines and other data stores are publicly available from such places as the TCGA data set from the National Cancer Institute, Gene Expression Omnibus (GEO), and Array Expression (all of which can be accessed through Synapse). So to some extent the current revolution in sharing lies not in the data itself but in critical related areas.
First, many of the data sets are weakened by metadata problems. A Sage programmer told me that the famous TCGA set is enormous but poorly curated. For instance, different data sets in TCGA may refer to the same drug by different names, generic versus brand name. Provenance–a clear description of how the data was collected and prepared for use–is also weak in TCGA.
In contrast, GEO records tend to contain good provenance information (see an example), but only as free-form text, which presents the same barriers to searching and aggregation as free-form text in medical records. Synapse is developing a structured format for presenting provenance based on the W3C’s PROV standard. One researcher told me this was the most promising contribution of Synapse toward the shared used of genetic information.