The glowing reports we read of biotech advances almost cause one’s brain to ache. They leave us thinking that medical researchers must command the latest in all technological tools. But the engines of genetic and pharmaceutical innovation are stuttering for lack of one key fuel: data. Here they are left with the equivalent of trying to build skyscrapers with lathes and screwdrivers.
Sage Congress, held this past week in San Francisco, investigated the multiple facets of data in these field: gene sequences, models for finding pathways, patient behavior and symptoms (known as phenotypic data), and code to process all these inputs. A survey of efforts by the organizers, Sage Bionetworks, and other innovations in genetic data handling can show how genetics resembles and differs from other disciplines.
An intense lesson in code sharing
At last year’s Congress, Sage announced a challenge, together with the DREAM project, intended to galvanize researchers in genetics while showing off the growing capabilities of Sage’s Synapse platform. Synapse ties together a number of data sets in genetics and provides tools for researchers to upload new data, while searching other researchers’ data sets. Its challenge highlighted the industry’s need for better data sharing, and some ways to get there.
The Sage Bionetworks/DREAM Breast Cancer Prognosis Challenge was cleverly designed to demonstrate both Synapse’s capabilities and the value of sharing. The goal was to find a better way to predict the chances of survival among victims of breast cancer. This is done through computational models that search for patterns in genetic material.
To participate, competing teams had to upload models to Synapse, where they were immediately evaluated against a set of test data and ranked in their success in predicting outcomes. Each team could go online at any time to see who was ahead and examine the code used by the front-runners. Thus, teams could benefit from their competitors’ work. The value of Synapse as a cloud service was also manifest. The process is reminiscent of the collaboration among teams to solve the Netflix predication challenge.
Although this ability to steal freely from competing teams would seem to be a disincentive to participation, more than 1400 models were submitted, and the winning model (which was chosen by testing the front-runners against another data set assembled by a different research team in a different time and place) seems to work better then existing models, although it will still have to be tested in practice.
The winner’s prize was a gold coin in the currency recognized by researchers: publication in the prestigious journal Science Translational Medicine, which agreed in advance to recognize the competition as proof of the value of the work (although the article also went through traditional peer review). Supplementary materials were also posted online to fulfill the Sage mission of promoting reproducibility as well as reuse in new experiments.
Synapse as a platform
Synapse is a cloud-based service, but is open source so that any organization can store its own data on servers of its choice and provide Synapse-like access. This is important because genetic data sets tend to be huge, and therefore hard to copy. On its own cloud servers Synapse stores metadata, such as data annotations and provenance information, on data objects that can be located anywhere. This allows organizations to store data on their own servers, while still using the Synapse services. Of course, because Synapse is open source, an organization could also chose to create their own instance, but this would eliminate some of the cross-fertilization across people and projects that has made the code-hosting site GitHub so successful.
Sage rents space on Amazon Web Services, so it looks for AWS solutions, such as DynamoDB for its non-relational storage area, to fashion each element of Synapse’s solution. More detail about Synapse’s purpose and goals can be found in my report from last year’s Congress.
A follow-up to this posting will summarize and compare some ways that the field of genetics is sharing data, and how it is being used both within research and to measure the researchers’ own value.