Big data isn’t just about multi-terabyte datasets hidden inside eventually-concurrent distributed databases in the cloud, or enterprise-scale data warehousing, or even the emerging market in data. It’s also about the hidden data you carry with you all the time; the slowly growing datasets on your movements, contacts and social interactions.
Until recently, most people’s understanding of what can actually be done with the data collected about us by our own cell phones was theoretical. There were few real-world examples. But over the last couple of years, this has changed dramatically. Courting hubris perhaps, but I must admit it’s possible some of that was my fault, though I haven’t been alone.
The data you carry with you
You probably think you know how much data you carry around with you on your cell phone. You’ll certainly be aware of it if you’ve ever lost your phone, or had it stolen, or it’s just plain stopped working. But there is a large amount of data in the background that isn’t surfaced in the user interface.
We know about what I generally call primary data: calendars, address books, photographs, SMS messages and browser bookmarks. These are usually user generated, and we’d be pretty unhappy if we lost them. There is also the secondary data that the phone generates about us: call history, voice mail, usage information and records of our current and past locations. Most of what I’d call secondary data is surfaced to us in our phone’s user interface. We generally can’t change this sort of information without resetting the phone to a factory fresh condition; it’s generated by the device for us, it’s not something we generate ourselves.
But there is also what I refer to as tertiary data. This is data that, similar to the examples I mentioned above, is generated about us, rather than by us. Mostly, this data consists of cache files — data that is entirely necessary to you using the device, or significantly improves your user experience, but you don’t necessarily know is there. At least until some hole is found in the operating system to expose that data layer to you. That’s happened before, after all.
An obvious example is tucked in your photographs. Every picture you take is geotagged and date stamped, and if you publish your pictures to a photo-sharing site without stripping that information, you’re leaking data. Back in 2007, when geotagged photographs of newly arrived helicopters at a U.S. Army base in Iraq were published to the Internet, they allowed insurgents to determine the exact location of the helicopters inside the compound and conduct a mortar attack. Four of the AH-64 Apaches on the flight line were destroyed in the attack.
Upon opening the Path application on your phone, it automatically uploaded your address book to Path’s servers so it could find “friends” that you might want to connect to without asking for explicit permission to do so, or even implicit permission for that matter. Path has since apologized and updated its application so that it now asks permission before pushing your address book to its servers.
This was not data theft, but data leakage. You asked the application to accomplish something and didn’t really ask yourself how it was doing it. While there are technical solutions that don’t involve uploading your address book, the laziest solution is probably what you should have expected. I can almost hear the developers, “… we’ll just upload the address book for now and switch to hashing later on when we have time.”
There has been a lot of comment that somehow the whole Path thing was unexpected. Realistically, that’s not the case. It’s not an isolated circumstance, either. To the best of my knowledge Hipster and other apps also tapped your address book behind the scenes without asking permission. Interestingly, there are other, less obvious, culprits. Applications that make use of Chillingo’s “Crystal” game service, like Angry Birds, will in some circumstances also upload your address book. While there is a button to push, it is, at least for me, misleadingly labelled and doesn’t suggest what’s going to happen next.
Data leakage like this is not really a solvable problem at the user level, at least not in real-time. Having multiple permission boxes pop up at regular intervals is a bad design choice; users stop reading them, they lose importance and become ineffective. Just try using Microsoft Windows and you’ll understand exactly what I mean. Modal interrupts should be reserved for vital time-critical issues. They’re already used far to prolifically in iOS. Run the Mail application with multiple mail accounts configured when you’re not connected to the network and that’ll become instantly obvious. You’ll be bombarded by error messages.
I did have a thought that you might be able to deploy a customized web proxy directly onto your mobile device and have all web requests directed through it. The proxy would sift through the outgoing network connections in a (semi-)Bayesian manner looking for data that you don’t want transmitted and stop the application cold before it sends it to the remote server. Basically, it’s acting as a reverse spam filter, or a smart firewall, depending on how you want to think about it.
I think that something like this could well be far more effective at stopping data leakage than the current solution, which Google has used on Android: Permissions pages when you initially install an application are all very well, but most people don’t read them, and when you’re installing an application you’re not really thinking about why it might need certain permissions. However, you can be very clear about what data you’re interested in not leaving your phone. One configuration page for the proxy, rather than multiple ones, every time you install an application. Like modal dialogs on the iPhone, you subconsciously start to ignore them, to your peril.
Location, location, location
Of course, I can’t really talk about data leakage without mentioning the kerfuffle surrounding location and data privacy that happened just about this time last year. Unsurprisingly, the file in question still exists, despite some of the press stories; the existence of the file was never the problem. A cache of that nature is fairly necessary if you want to have reliable and timely location services on your phone. However, the file is now actually just that, a cache, and it is regularly swept clean by the operating system. It’s also not included in your usually unencrypted backups to your laptop, which was perhaps more of a problem than the fact it wasn’t being cleared out in the first place.
A visualization of iPhone location data. Click to enlarge.
What Apple was doing was taking a piece of tertiary data, generated about you by the device, and then exposing it on a platform (laptop or desktop) where accessing that data was easier. There are a lot of people who know how to navigate a file system on a computer, but a lot fewer who would know how to get the same data directly from the phone itself. It was a classic case of data leakage: data moved from a secure(ish) environment on the phone to a less secure one on the computer.
Back in the days of floppy disks, the lines of ownership were pretty clear. If you had the disk, the data was yours. If someone else had it, it was theirs. Things these days are much blurrier. That tertiary data — data that’s generated about us but not by us — doesn’t just build up on your mobile devices of course. Other people are building datasets about our patterns of movement, buying decisions, credit worthiness and other things. The ability to compile these sorts of datasets left the realm of major governments with the invention of the computer.
We’re all aware of this, and there’s even a provocative buzzword to describe it: data exhaust. It’s the data we leave behind us, rather than carry with us.
In the U.S., data from grocery store loyalty schemes has been used by security services to search for terrorist suspects. Turns out the number of toilet rolls you buy can be quite telling.
Which does make me think, instead of being afraid of the data exhaust, perhaps we should embrace it. In the U.K., the biggest retailer is the supermarket Tesco. Like many, I spend a good fraction of my income there, and like almost everyone I know, I have a Tesco Clubcard. This is a loyalty card that has a record of (almost) every purchase I make, from toilet rolls to roast chicken.
I’d actually pay good money for a copy of my own Clubcard data, so long as it was actually in a machine-readable format, not on paper. Although for Tesco, the data is only really interesting in aggregate; it’s the fact that they have millions of Clubcard records that makes the dataset useful to the company. To me, a history of my purchases would be useful data.
Of course, people have already started selling our data exhaust back to us. Think about your credit report, for instance.
Keep your friends close, and your enemies closer
It’s not just your own data exhaust that you have to worry about. There was an interesting paper recently by Adam Sadilek of the Department of Computer Science at the University of Rochester. It talked about how geotagged tweets could be used to locate individuals, even if they themselves didn’t geotag their tweets — it was enough that their friends did so.
Geotagged messages on Twitter during a typical weekday afternoon in New York City.
The paper found that only a couple of weeks’ worth of location data on an individual, combined with location data from their two most-sharing friends, was enough to place that person within a 100-meter radius with 77% accuracy. That rises to nearly 85% when you combine information from nine friends.
Even someone who has never shared their location at all can be pinpointed with 47% accuracy from information available from two friends. That goes up to 57% when you include nine friends.
There is a great debate going on right now, which is really only starting to surface into the mainstream press, about how we share data. Despite social networks becoming mainstream, the recent privacy debacles in the mobile space say a lot about how users perceive information privacy. I think Sadilek’s paper presents even more compelling evidence.
For instance, I’m finding Google’s new Instant Upload feature, where photos taken on my phone are automatically uploaded to Google+ behind the scenes, a lot spookier and more worrying than I thought I would. It’s especially interesting that I’m feeling that way, as I’m using Apple’s Photo Stream without thinking or worrying about it that much.
I’m trying to figure out whether it’s because the privacy trade-off — in Apple’s case sharing my photos between all my devices, and in Google’s case making my photos more-or-less instantly available for sharing in Google+ — is more obviously in my favor with Photo Stream, or it’s for other reasons.
The interesting thing here is that Photo Stream and Instant Upload are, at least behind the scenes, effectively identical. Both are cloud services and your photos are stored in a data center somewhere. The master copies of your photos have essentially been moved to the cloud, rather than residing on your device.
However, because of the context these two services operate in, I have no problems with one, and I’m finding the other an uncomfortable fit. I think there’s a big lesson there for people dealing with personal information. When you’re sharing someone’s information, even with their informed consent, the context is important about how they think about the implications surrounding that sharing.
So, all of this got me thinking. There are large personal datasets about me, and you, and everyone, being built up by large companies. But we’re also building up datasets about ourselves, in our own control. What happens if we mash them together? Can we actually do something productive?
I’m currently running an interesting experiment with my credit card and my iPhone. I’m scraping my bank’s website to grab transaction data in near real-time onto one of my servers. Each transaction comes with a postcode. This is like a U.S. zip code, but it normally specifies a much smaller neighborhood, perhaps down to a single street or smaller in a major urban area.
Watching my credit card transactions in real-time.
On my iPhone, I’m running an application that continually monitors my location using the Significant Location Change service, so my phone knows my location to better than 1km (perhaps much better in a crowded city) more or less all of the time.
Every time a new transaction occurs, I forward it via push notification from the back-end server to my iPhone. Now, my iPhone knows both the location where the transaction took place and where I actually was at the time. If those locations don’t match, then this indicates there might have been a fraudulent transaction and it flags it for me with a notification.
The interesting thing here is that I’m using data that my credit card company doesn’t have, and hopefully will never have: my actual physical location when the transaction took place. They couldn’t possibly provide this service to me because they simply don’t have the data I have.
Of course, there are false positives. Online transactions in particular stand out. Most of these are tagged with a postcode of the headquarters of the company I’m dealing with. However, my next development step will be to give my back-end server code access to my inbox and allow it to scrape for online transaction receipts. This should reduce the false-positive rate down to something vanishingly small, and I should be able to deal with those left over with some sort of machine learning. After all, there’s a human-readable string attached to each transaction that details the retailer and sometimes other useful information.
A thought to ponder
A thought to ponder in the dead of night: In the near future, the absence of data is going to be increasingly unusual. If you think the data exhaust you leave behind yourself is wide and varied, then just you wait, because we’re at the banging-the-rocks-together stage right now.
If your data exhaust becomes assumed, what happens if you turn your phone off for an hour or two one night? What if you’re accused of a murder during that time period, and you can’t prove where you were? Perhaps in the future that’s going to be sufficiently unusual that it’s automatically suspicious. Innocent until proven guilty may underlie our current legal system, but that’s because our current legal system was codified in a very different era, one that was data poor rather than data rich. Perhaps in the future, the absence of data will imply guilt.
I discussed hidden data in this interview at Strata CA 2012.