Why you can’t really anonymize your data

It's time to accept and work within the limits of data anonymization.

One of the joys of the last few years has been the flood of real-world datasets being released by all sorts of organizations. These usually involve some record of individuals’ activities, so to assuage privacy fears, the distributors will claim that any personally-identifying information (PII) has been stripped. The idea is that this makes it impossible to match any record with the person it’s recording.

Something that my friend Arvind Narayanan has taught me, both with theoretical papers and repeated practical demonstrations, is that this anonymization process is an illusion. Precisely because there are now so many different public datasets to cross-reference, any set of records with a non-trivial amount of information on someone’s actions has a good chance of matching identifiable public records. Arvind first demonstrated this when he and his fellow researcher took the “anonymous” dataset released as part of the first Netflix prize, and demonstrated how he could correlate the movie rentals listed with public IMDB reviews. That let them identify some named individuals, and then gave access to their complete rental histories. More recently, he and his collaborators used the same approach to win a Kaggle contest by matching the topography of the anonymized and a publicly crawled version of the social connections on Flickr. They were able to take two partial social graphs, and like piecing together a jigsaw puzzle, figure out fragments that matched and represented the same users in both.

All the known examples of this type of identification are from the research world — no commercial or malicious uses have yet come to light — but they prove that anonymization is not an absolute protection. In fact, it creates a false sense of security. Any dataset that has enough information on people to be interesting to researchers also has enough information to be de-anonymized. This is important because I want to see our tools applied to problems that really matter in areas like health and crime. This means releasing detailed datasets on those areas to researchers, and those are bound to contain data more sensitive than movie rentals or photo logs. If just one of those sets is de-anonymized and causes a user backlash, we’ll lose access to all of them.

So, what should we do? Accepting that anonymization is not a complete solution doesn’t mean giving up, it just means we have to be smarter about our data releases. Below I outline four suggestions.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

Keep the anonymization

Just because it’s not totally reliable, don’t stop stripping out PII. It’s a good first step, and makes the reconstruction process much harder for any attacker.

Acknowledge there’s a risk of de-anonymization

Don’t make false promises to users about how anonymous their data is. Make the case to them that you’re minimizing the risk and possible harm of any data leaks, sell them on the benefits (either for themselves or the wider world) and get their permission to go ahead. This is a painful slog, but the more organizations that take this approach, the easier it will be. A great model is Reddit, which asked their users to opt-in to sharing their data. They got a great response.

Limit the detail

Look at the records you’re getting ready to open up to the world, and imagine that they can be linked back to named people. Are there parts of it that are more sensitive than others, and maybe less important to the sort of applications you have in mind? Can you aggregate multiple people together into cohorts that represent the average behavior of small groups?

Learn from the experts

There’s many decades of experience of dealing with highly sensitive and personal data in sociology and economics departments across the globe. They’ve developedtechniques that could prove useful to the emerging community of data scientists, such as subtle distortions of the information to prevent identification of individuals, or even the sort of locked-down clean-room conditions that are required to access detailed IRS data.

There’s so much good that can be accomplished using open datasets, it would be a tragedy if we let this slip through our fingers with preventable errors. With a bit of care up front, and an acknowledgement of the challenges we face, I really believe we can deliver concrete benefits without destroying people’s privacy.


tags: , ,
  • D. Kellus Pruitt

    Actually, I think dental records could be safely anonymized. In the coding world, dental histories are Navajo.

    In addition, nobody cares about dental records like they do medical records.

  • Illogicbuster

    “researcher took the “anonymous” dataset released as part of the first Netflix prize, and demonstrated how he could correlate the movie rentals listed with public IMDB reviews.”

    The example proves the opposite. Unless one is up on the net spewing data all over in using ones true ID, it is difficult if not impossible to correlate…

    Logic is your friend. Use it.

  • Luca


    the problem is not de-anonimize data, but how much does it cost to de-anonimize that.

    Finding a way to identify info should be related at the efficiency and the cost to obtain that data. Of course we should give a value to personal information…but this is taught work…

  • HJ Boitel

    Anonymization, when possible, permits preservation of may privacy and enhancement of personal physical and financial security. It may also have grants somewhat of a monopoly on the use of that data to whatever second party collects and preserves it.

    Anomymization has three primary areas of disadvantage. 1) The person whose data is anonymized finds that it is less easily available to him (and his service providers)than it would otherwise be; 2) Some people use anonymity to harm the reputation and quality of life of others. We see this frequently occurring with the great increase in anonymity opportunity provided by the internet. 3) The sciences are deprived of the opportunity to correlate data about the same individual thus severely limiting the ability to draw conclusions as to causation, predictability and remedy. It is this third aspect to which the following comments apply.

    The science of medicine permits ready demonstration of the foregoing disadvantages. For the most part, medical research studies involve a relatively small number of patients. Even as to those patients, numerous genetic and environmental factors are not collected and correlated. The result is a regular cycle of sharp change as to what is or is not good for you, and prolonged “wars” against this or that medical problem, usually without a great deal of success.

    Whether or not ant to what extent one uses a microwave or a cell telephone or takes a vacation or drinks coffee, to name a very few things, may have a significant effect upon health, for better or worse. How are we to collect and correlate such data if it is nonexistent or anonymized?

    We are gradually putting microprocessors in all kinds of things and they will be collecting and transmitting data concerning their use and state of repair. Imagine how much data that will produce that may be relevant to human health.

    What we need is a system that preserves appropriate privacy, while still enabling collection and correlation. The only practical way to do this is to provide everyone with an anonymous ID, in addition to their public ID. There would then be two collection system – one that would collect information associated with the public ID and the other would collect that same information, plus a lot more, that would be associated with the anonymous ID.The anonymous, but track-able, information that would be pumped into super computers that spend their time drawing the presence and absence of correlations.

    Granted, any such system would require a secure, trustworthy second party, so that the public I(D will not be correlated to the anonymous ID. All of life entails risks. Ignorance presents the greatest risk of all.

  • http://borasky-research.net/about-data-journalism-developer-studio-pricing-survey/ M. Edward (Ed) Borasky

    You write: “There’s so much good that can be accomplished using open datasets,”

    That’s an interesting pre-supposition, but I’m personally not aware of examples of “good” that’s been accomplished with open datasets. Could you give me five examples, and how the outcomes were judged to be “good”?

    I really want to see some debate on this, because some really bright computer science researchers are working very hard on very difficult things like differential privacy. IMHO there are more obvious “good” things like protecting the power grid or programming paradigms / tools for insuring correctness and maximum efficiency of massively parallel / software that they could be working on.

  • http://petewarden.typepad.com/ Pete Warden

    Great question Ed, you’ve nailed one of my assumptions. I believe that the increased number and coverage of open data sets will lead to a significant number of worthwhile innovations.

    Why do I think open data is more than just a shiny toy for geeks that distracts us from more worthwhile but less sexy work? My canonical example is web search. It’s somewhat of a historical accident that early websites allowed robots to crawl them, but without that openness we’d have no way to navigate the web.

    The real question though is “What could we do if more data sets were open?”. It’s hard to point to success stories yet, because so few of the sets I’d like to see are released. I’d argue that the Heritage Health Prize is a promising candidate, if the contest model is able to deliver real-world medical benefits. I’m also hopeful about the Global Virus Forecast Initiative’s work, and Cazoodle’s work mapping Afghanistan for the US Army using Wikipedia, Flickr and OpenStreetMap seems to be effective, though people may disagree about how ‘good’ the goal is.

    When I talk to people throughout the non-profit and political world, I hear about painful problems that seem like they would benefit if more data was available, from corruption to crime and medicine. The reason I included “Seeing Like a State” was precisely because I’m aware that data is my hammer, and it’s a cautionary tale that there’s a long history of technocrats making problems worse by applying simplistic solutions to complex situations.

    So, yes, this is a debate I’d like to hear too. Email me if you write any more on your own views, I’d love to dig deeper.

  • http://borasky-research.net/about-data-journalism-developer-studio-pricing-survey/ M. Edward (Ed) Borasky

    Well … I actually signed up for the Heritage Health Prize briefly but have withdrawn. One of the reasons I withdrew is that I don’t see any promise in what the contest calls for – a complex predictive model for hospitalizations leading to “earlier interventions”.

    We know so much already – smoking, drunk driving, obesity, poverty, etc. lead to hospitalization and “early interventions” are a matter of public policy for the most part. Where I think the research dollars need to go is in early *detection* of cancer, better genetic / biochemical modeling of the human mind / body system — and lowering the *costs* of technology associated with enhancing health.

  • http://petewarden.typepad.com/ Pete Warden

    The factors you mention are well-known but intractable because they require changes in individuals’ behaviors. I see the promise of the Heritage prize coming from spotting unexpected connections thanks to new analysis techniques, hopefully things that doctors can take effective action on. Persuading someone to stop smoking is an unsolved problem, but if you can spot that older people recovering from gall bladder surgery have a temporarily high risk of strokes, getting them to take the right drugs is comparatively easy.

    I don’t see this as taking away from traditional research at all. If we can improve the way drug trials are handled, by mining a massive sample of medical records on-demand to quickly understand how effective different combinations of treatments and drugs are, the saved money can be plowed into the sort of basic science you mention.

    Will it work like that? There’s no guarantees, but since the only way to know is to try it, it seems worth the resources we’re investing to find out.

  • If you can't get along I'm going to have to separate you two

    Illogicbuster is out of line.

    The “opposite” (of what?) is not proven at all. Posting a review on IMDB and renting movies is not “spewing all over the web.”

    Stating that a process is “difficult, if not impossible” means nothing. “Difficult” gets done every day. The fact remains, that the ease with which data are collected, submitted, and then mined, makes the correlation of data from disparate sets not only possible, but likely, and limited only by someones initiative, access to datasets, and the inspiration to look at apples and oranges, realize that they both have seeds, and wonder what else they have in common.

    And the patronizing closing line (“Logic is your friend. Use it.”) is rude, even on a t-shirt or bumper-sticker, much less in polite discourse. There’s no need to sink to the level of reality tv sniping.

    A little grace goes a long way.