First, technology is good. Then it gets bad. Then it gets stable.
This has been going on for a long time, likely since the invention of fire, knives, or the printed word. But I want to focus specifically on computing technology. The human race is busy colonizing a second online world and sticking prosthetic brains — today, we call them smartphones — in front of our eyes and ears. And stacks of technology on which they rely are vulnerable.
When we first created automatic phone switches, hackers quickly learned how to blow a Cap’n Crunch whistle to get free calls from pay phones. When consumers got modems, attackers soon figured out how to rapidly redial to get more than their fair share of time on a BBS, or to program scripts that could brute-force their way into others’ accounts. Eventually, we got better passwords and we fixed the pay phones and switches.
We moved up the networking stack, above the physical and link layers. We tasted TCP/IP, and found it good. Millions of us installed Trumpet Winsock on consumer machines. We were idealists rushing onto the wild open web and proclaiming it a new utopia. Then, because of the way the TCP handshake worked, hackers figured out how to DDOS people with things like SYN attacks. Escalation, and router hardening, ensued.
We built HTTP, and SQL, and more. At first, they were open, innocent, and helped us make huge advances in programming. Then attackers found ways to exploit their weaknesses with cross-site scripting and buffer overruns. They hacked armies of machines to do their bidding, flooding target networks and taking sites offline. Technologies like MP3s gave us an explosion in music, new business models, and abundant crowd-sourced audiobooks — even as they leveled a music industry with fresh forms of piracy for which we hadn’t even invented laws.
Here’s a more specific example of unintended consequences. Paul Mockapetris is one of the creators of today’s Internet. He created DNS and implemented SMTP, fundamental technologies on which all of us rely. But he’s also single-handedly responsible for all the spam in the world.
That might be a bit of an overstatement, though I tease him about it from time to time. But there’s a nugget of truth to it: DNS was a simplified version of more robust directories like those in X.25. Paul didn’t need all that overhead, because he was just trying to solve the problem of remembering all those Internet addresses by hand, not trying to create a hardened, authenticated, resilient address protocol. He also created SMTP, the Simple Mail Transport Protocol. It was a whittled-down version of MTP — hence the “S” — and it didn’t include end-to-end authentication.
These two things — SMTP and DNS — make spam possible. If either of them had some kind of end-to-end authentication, it would be far harder for spammers to send unwanted messages and get away with it. Today, they’re so entrenched, that attempts to revise email protocols in order to add authentication have consistently failed. We’re willing to live with the glut of spam that clogs our servers because of the tremendous value of email.
We owe much of the Internet’s growth to simplicity and openness. Because of how Paul built DNS and SMTP, there’s no need to go through a complex bureaucracy to start something, or to jump through technical hoops to send an email to someone you met at a bar. We can invite a friend to a new application without strictures and approvals. The Internet has flourished precisely because it was built on a foundation of loose consensus and working code. It’s also flourished in spite of it.
Each of these protocols, from the lowly physical connections and links of Ethernet and PPP all the way up through TCP sessions and HTTP transactions, are arranged in a stack, independent layers of a delicious networking cake. By dividing the way the Internet works into discrete layers, innovation can happen at one layer (copper to fiber; Token Ring to Ethernet; UDP to TCP; Flash to DHTML; and so on) independent of the other layers. We didn’t need to rewrite the Internet to build YouTube.
Paul, and the other framers of the web, didn’t know we’d use it to navigate, or stream music — but they left it open so we could. But where the implications of BBS hacking or phone phreaking were limited to a small number of homebrew hackers, the implications for the web were far bigger, because by now, everyone relied on it.
Anyway, on to big data.
Geeks often talk about “layer 8.” When an IT operator sighs resignedly that it’s a layer 8 problem, she means it’s a human’s fault. It’s where humanity’s rubber meets technology’s road. And big data is interesting precisely because it’s the layer 8 protocol. It’s got great power, demands great responsibility, and portends great risk unless we do it right. And just like the layers beneath it, it’s going to get good, then bad, then stable.
Other layers of the protocol stack have come under assault by spammers, hackers, and activists. There’s no reason to think layer 8 won’t as well. And just as hackers find a clever exploit to intercept and spike an SSL session, or trick an app server into running arbitrary code, so they’ll find an exploit for big data.
The term “data warfare” might seem a bit hyperbolic, so I’ll try to provide a few concrete examples. I’m hoping for plenty more in the Strata Online Conference we’re running next week, which has a stellar lineup of people who have spent time thinking about how to do naughty things with information at scale.
Analytics applications rely on tags embedded in URLs to understand the nature of traffic their receive. URLs contain parameters that identify the campaign, the medium, and other information on the source of visitors. For example, the URL http://www.solveforinteresting.com?utm_campaign=datawar tells Google Analytics to assign visits from that link to the campaign “datawar.” There’s seldom any verification of this information — with many analytics packages it’s included in plain text. Let’s say, as a joke, you decide you’d like your name to be the most prolific traffic source on a friend’s blog. All you need is a few willing participants, and you can simply visit the blog from many browsers and machines using your name as the campaign tag. You’ll be the top campaign traffic source.
This seems innocent enough, until you realize that you can take a similar approach to misleading your competitor. You can make them think a less-effective campaign is outperforming a successful one. You can trick them into thinking Twitter is a better medium than Google+, when in fact the reverse is true, which causes them to pay for customer acquisition in less-effective channels.
The reality isn’t this simple — smart businesses track campaigns by outcomes such as purchases rather than by raw visitors. But the point is clear: open-ended data schemes like tagging work because they’re extensible and simple, but that also makes them vulnerable. The practice of “googlebombing,” is a good example. Linking a word or definition to a particular target (such as sending searches for “miserable failure” to a biography on the White House website) simply exploits the openness of Google’s underlying algorithms.
But even if you think you have a reliable data source, you may be wrong. Consider that a few years ago, only 324 Athenians reported having swimming pools on their tax returns. This seemed low to some civil servants in Greece, so they decided to check. Using Google Map, they counted 16,974 of them — despite efforts by citizens to camouflage their pools under green tarpaulins.
Whether the data is injected, or simply collected unreliably, data’s first weakness is its source. Collection is seldom authenticated. There’s a reason prosecutors insist on chain of evidence; but big data and analytics, like DNS and SMTP, is usually built for simplicity and ubiquity rather than for resiliency or auditability.
Mistraining the algorithms
Most of us get attacks almost daily, in the form of spam and phishing. Most of these attacks are blocked by heuristics and algorithms.
Spammers are in a constant arms race with these algorithms. Each message that’s flagged as spam is an input into anti-spam algorithms — so if a word like “Viagra” appears in a message you consider spam, then the algorithm is slightly more likely to consider that word “spammy” in future.
If you run a blog, you probably see plenty of comment spam filled with nonsense words — these are attempts to mistrain the machine-learning algorithms that block spammy content by teaching it innocuous words, undermining its effectiveness. You’re actually watching a fight between spammers and blockers, played out comment by comment, on millions of websites around the world.
Making other attacks more effective
Anti-spam heuristics happen behind the scenes, and they work pretty well. Despite this, some spam does get through. But when it does, we seldom click on it, because it’s easy to spot. It’s poorly worded; it comes from an unfamiliar source; it doesn’t render properly in our mail client.
What if that weren’t the case?
A motivated attacker can target an individual. If they’re willing to invest time researching their target, they can gain trust or impersonate a friend. The discovery of several nation-state-level viruses aimed at governments and rich targets shows a concerted, hand-crafted phishing attack can work. In the hands of an attacker, tools like Facebook’s Graph Search or Peekyou are a treasure trove of facts that can be used to craft a targeted attack.
The reason spam is still easy to spot is that it’s traditionally been hard to automate this work. People don’t dig through your trash unless you’re under investigation.
Today, however, consumers have access to “big data” tools that spy agencies could only dream of a few short years ago, which means attackers do, too, and the effectiveness of phishing, identity theft, and other information crimes will soar once bad actors learn how to harness these tools.
But digging through virtual trash and data exhaust is what machines do best. Big data lets personal attacks work at scale. If smart data scientists with decent grammar tried to maximize spam effectiveness, we’d lose quickly. To them, phishing is just another optimization problem.
Trolling to polarize
Data warfare doesn’t have to be as obvious as injecting falsehoods, mis-training machine learning algorithms, or leveraging abundant personal data to trick people. It can be far more subtle. Let’s say, for example, you wanted to polarize a political discussion such as gun control in order to reduce the reasoned discourse of compromise and justify your hard-lined stance. All you need to do is get angry.
A recent study showed that the tone of comments in a blog post had a tangible impact on how readers responded to the post. When comments used reasonable language, readers’ views were more moderate. But when those comments were aggressive, readers hardened their positions. Those that agreed with the post did so more strongly, and those who disagreed objected more fiercely. The chance for compromise vanished.
Similar approaches can sway sentiment analysis tools that try to gauge public opinion on social platforms. Once reported, these sentiments often form actual opinion, because humans like to side with the majority. Data becomes reality. There are plenty of other examples of “adversarial” data escalation. Consider the programmer who created 800,000 books and injected them into Amazon’s catalog. Thanks to the frictionless nature of ebooks and the ease of generating them, he’s saturated their catalog (hat tip to Edd for this one.)
The year of data warfare
Data warfare is real. In some cases, such as spam, it’s been around for decades. In other cases, like tampering with a competitor’s data, it’s been possible, but too expensive, until cloud computing and new algorithms made it cheap and easy. And in many new instances, it’s possible precisely because of our growing dependence on information to lead our daily lives.
Just as the inexorable cycle of good, bad, and stable has happened at every layer, so it will happen with big data. But unlike attacks on lower levels of the stack, this time it won’t just be spam in an inbox. It’ll be both our online and offline lives. Attackers can corrupt information, blind an algorithm, inject falsehood, changing outcomes in subtle, insidious ways that undermine a competitor or flip an election. Attacks on data become attacks on people.
If I have to pick a few hot topics for 2013, data warfare is one of them. I’m looking forward to next week’s online event, because I’m convinced that this arms race will affect all of us in the coming years, and it’ll be a long time before the armistice of détente.