MIT workshop kicks off Obama campaign on privacy
Thrust into controversy by Edward Snowden’s first revelations last year, President Obama belatedly welcomed a “conversation” about privacy. As cynical as you may feel about US spying, that conversation with the federal government has now begun. In particular, the first of three public workshops took place Monday at MIT.
Given the locale, a focus on the technical aspects of privacy was appropriate for this discussion. Speakers cheered about the value of data (invoking the “big data” buzzword often), delineated the trade-offs between accumulating useful data and preserving privacy, and introduced technologies that could analyze encrypted data without revealing facts about individuals. Two more workshops will be held in other cities, one focusing on ethics and the other on law.
By David Andrzejewski of SumoLogic
A few weeks ago I had the pleasure of hosting the machine data track of talks at Strata Santa Clara. Like “big data”, the phrase “machine data” is associated with multiple (sometimes conflicting) definitions, two prominent ones come from Curt Monash and Daniel Abadi. The focus of the machine data track is on data which is generated and/or collected automatically by machines. This includes software logs and sensor measurements from systems as varied as mobile phones, airplane engines, and data centers. The concept is closely related to the “internet of things”, which refers to the trend of increasing connectivity and instrumentation in existing devices, like home thermostats.
More data, more problems
This data can be useful for the early detection of operational problems or the discovery of opportunities for improved efficiency. However, the decoupling of data generation and collection from human action means that the volume of machine data can grow at machine scales (i.e., Moore’s Law), an issue raised by both Monash and Abadi. This explosive growth rate amplifies existing challenges associated with “big data”. In particular two common motifs among the talks at Strata were the difficulties around:
- mechanics: the technical details of data collection, storage, and analysis
- semantics: extracting understandable and actionable information from the data deluge
How do we motivate sustained behavior change when the external motivation disappears—like it's supposed to?
If you’ve ever tried to count calories, go on a diet, start a new exercise program, change your sleep patterns, spend less time sitting, or make any other type of positive health change, then you know how difficult it is to form new habits. New habits usually require a bit of willpower to get going, and we all know that that’s a scarce resource. (Or at least, a limited one.)
Change is hard. But the real challenge comes after you’ve got a new routine going—because now you’ve got to keep it going, even though your original motivations to change may no longer apply. Why keep dieting when you no longer need to lose weight? We’ve all had the idea at some point that we really should reward ourselves for that five-pound weight loss with a cupcake, right?
This phenomenon of the disappearing-external-motivation can turn long-term behavior change into a Herculean task. And when it comes to taking medication, such a major obstacle to adherence can become very dangerous.
Taking pills (or injections, or whatever) every day is a pain, sometimes literally. Most of us only do it when the consequences of not taking our medication more painful. But if the medication is doing its job, then the symptoms that motivated adherence in the first place will vanish. The patient will start feeling fine—maybe even fine enough that their biggest problem is now the medication’s side effects—and the motivation to stick to the burdensome routine will wane.
If you’re on long-term medication for something like diabetes, liver or kidney disease, or mental health disorders, then a slip in adherence—let alone quitting cold turkey, as many people unilaterally decide to do—can spell disaster. The challenge of sustained behavior change is real, and it is significant. And many doctors are still scratching their heads over what to do about it.
While I was in California for the O’Reilly Strata Conference last month, I organized a meetup of healthcare professionals and data geeks to talk about some issues of common interest. This question of adherence came up there, too, and one of the attendees threw out a half-joking suggestion that perhaps long-term medication should come with placebos at random intervals—the idea being that if symptoms occasionally returned, subtly and without warning, then patients would have intermittent “reminders” of why their medication was so vital.
Leaving aside the significant health and ethics concerns about such an approach, the attendee wondered, would that kind of negative reinforcement even work? Our guest speaker, Julia Hu, explained that negative reinforcements of that kind can work well for behavior change in the short to medium term, but are not all that effective over the long term. Adherence is harder than that.
You know who’s already very familiar with this challenge? Alcoholics and other addicts. The need to abstain is readily apparent when you’re at the bottom of a well of your own making. But after you’ve climbed out of that abyss and been sober for three, five, ten years, it can be very tempting to “have just one.” Successful education programs have taught us all that, for someone with an addiction, there’s no such thing as “just one,” and that avoiding that temptation requires long-term support.
I think the issue here may be one of comfort. It is discomfort that motivates us to take our medication (or make some other kind of health behavior change), and when that discomfort goes away, our motivation can go away too. We need to be reminded of that discomfort, relieved of the illusion that our current comfort is a given.
I believe in the power of technology to help us with many healthcare challenges. But on the subject of behavior change, the apps I see are all aimed at increasing our comfort: awarding us points, creating mechanisms to help us brag on social media, or even providing financial incentives. Call it “gamification” if you like: there are dozens upon dozens of apps aimed at the kinds of behavior change I mentioned at the top of this post, and most of them are downloaded, eagerly adopted, and then left by the wayside within two weeks.
We need to do better. We need to design better. We need to creatively dig into new ways to rise to this challenge of sustained behavior patterns. Maybe it’s time to start experimenting with apps that make us less comfortable.
Change is hard. Adherence is harder. But the most worthwhile challenges always are.
In order to make an effective decision, I need to understand key issues about the design, performance, and cost of cars, regardless of whether or not I actually know how to build one myself. The same is true for people deciding if machine learning is a good choice for their business goals or project. Will the payoff be worth the effort? What machine learning approach is most likely to produce valuable results for your particular situation? What size team with what expertise is necessary to be able to develop, deploy, and maintain your machine learning system?
Given the complex and previously esoteric nature of machine learning as a field – the sometimes daunting array of learning algorithms and the math needed to understand and employ them – many people feel the topic is one best left only to the few.
It's easier to "discover" features with tools that have broad coverage of the data science workflow
Interface languages: Python, R, SQL (and Scala)
This is a great time to be a data scientist or data engineer who relies on Python or R. For starters there are developer tools that simplify setup, package installation, and provide user interfaces designed to boost productivity (RStudio, Continuum, Enthought, Sense).
Increasingly, Python and R users can write the same code and run it against many different execution1 engines. Over time the interface languages will remain constant but the execution engines will evolve or even get replaced. Specifically there are now many tools that target Python and R users interested in implementations of algorithms that scale to large data sets (e.g., GraphLab, wise.io, Adatao, H20, Skytree, Revolution R). Interfaces for popular engines like Hadoop and Apache Spark are also available – PySpark users can access algorithms in MLlib, SparkR users can use existing R packages.
In addition many of these new frameworks go out of their way to ease the transition for Python and R users. wise.io “… bindings follow the Scikit-Learn conventions”, and as I noted in a recent post, with SFrames and Notebooks GraphLab, Inc. built components2 that are easy for Python users to learn.
Other industries can show health care the way
This article was written with Ellen M. Martin.
Most healthcare clinicians don’t often think about donating or sharing data. Yet, after hearing Stephen Friend of Sage Bionetworks talk about involving citizens and patients in the field of genetic research at StrataRx 2012, I was curious to learn more.
McKinsey points out the 300 billion dollars in potential savings from using open data in healthcare, while a recent IBM Institute of Business Value study showed the need for corporate data collaboration.
Also, during my own research for Big Data in Healthcare: Hype and Hope, the resounding request from all the participants I interviewed was to “find more data streams to analyze.”
The popular graph analytics framework extends its coverage of the data science workflow
GraphLab’s SFrame, an interesting and somewhat under-the-radar tool was unveiled1 at Strata Santa Clara. It is a disk-based, flat table representation that extends GraphLab to tabular data. With the addition of SFrame, users can leverage GraphLab’s many algorithms on data stored as either graphs or tables. More importantly SFrame increases GraphLab’s coverage of the data science workflow: it allows users with terabyte-sized datasets to clean their data and create new features directly within GraphLab (SFrame performance can scale linearly with the number of available cores).
The beta version of SFrame can read data from local disk, HDFS, S3 or a URL, and save to a human-readable .csv or a more efficient native format. Once an SFrame is created and saved to disk no reprocessing of the data is needed. Below is Python code that illustrates how to read a .csv file into SFrame, create a new data feature and save it to disk on S3:
Hardcore Data Science speakers provided many practical suggestions and tips
One of the most popular offerings at Strata Santa Clara was Hardcore Data Science day. Over the next few weeks we hope to profile some of the speakers who presented, and make the video of the talks available as a bundle. In the meantime here are some notes and highlights from a day packed with great talks.
We’ve come to think of analytics as being comprised primarily of data and algorithms. Once data has been collected, “wrangled”, and stored, algorithms are unleashed to unlock its value. Longtime machine-learning researcher Alice Zheng of GraphLab, reminded attendees that data structures are critical to scaling machine-learning algorithms. Unfortunately there is a disconnect between machine-learning research and implementation (so much so, that some recent advances in large-scale ML are “rediscoveries” of known data structures):
While there are many data structures that arise in computer science, Alice devoted her talk to two data structures1 that are widely used in machine-learning: