By Robert Munro
Plain text is the world’s largest source of digital information. As the amount of unstructured text grows, so does the percentage of text that is not in English. The majority of the world’s data is now unstructured text outside of English. So unless you’re an exceptional polyglot, you can’t understand most of what’s out there, even if you want to.
Language technologies underlie many of our daily activities. Search engines, spam filtering, and news personalization (including your social media feeds) all employ smart, adaptive knowledge of how we communicate. We can automate many of these tasks well, but there are places where we fall short. For example, the world’s most spoken language, Mandarin Chinese, is typically written without spaces. “解放大道” can mean “Liberation Avenue” or “Solution Enlarged Road” depending on where you interpret the gaps. It’s a kind of ambiguity that we only need to worry about in English when we’re registering domain names and inventing hashtags (something the folk at “Who Represents” didn’t worry about enough). For Chinese, we still don’t get it right with automated systems: the best systems get an error every 20 words or so. We face similar problems for about a quarter of the world’s data. We can’t even reliably tell you what the words are, let alone extract complex information at scale.
We’ll talk more about the state-of-the-art in language technologies at Strata 2013. For this article, we decided to answer a more basic question: “how are people actually communicating right now?”
The infographic accompanying this article shows the breakdown of what languages people are using for face-to-face communication, relative to phone-based communication and internet-enabled communication. By word count, almost 7% of the world’s communications are now mediated by digital technologies:
- Every three months, the world’s text messages exceed the word count of every book ever published.
- Text is cheap: every utterance since the start of humanity would take up less than 1% of the world’s current digital storage capacity (about 50 exabytes, assuming 110B people have averaged 16,000 words a day for 20 years each).
- The Twitter ‘firehose’, outside the processing capacity of most organizations, would be about the size of dot above the “i” in “English”.
- There are more than 6,000 other languages: only the top 1% are shown.
- Not one language from the Americas or Australia made the cut.
- Also omitted, email spam would be larger than every block except spoken Mandarin (官話).
- Short messages (sms and instant messaging) account for nearly 2% of the world’s communications. This makes short message communication the most popular and linguistically diverse form of written communication that has ever existed.
- If the Facebook ‘like’ was considered a one-word language, it would be in the top 5% most widely spoken languages (although still outside the top 200).
- Your browser probably won’t show Sundanese script (ᮘᮞ ᮞᮥᮔ᮪ᮓ), even though the world’s Sundanese speakers out-number the populations of New York, London, Tokyo and Moscow, combined.
- You misread that last point as “Sudanese” which is a variety of Arabic (العربية) and were surprised at the difference: we have a blind-spot when it comes to knowing about the existence of languages.
- Is a picture worth a 1,000 words? If so, shared pictures would double the size of the “social networks” block.
- Across all the world’s communications, 5 in every 10,000 words are directed at machines, not people: mainly search engines.
Perhaps the most surprising outcome for many people will be the relatively small footprint of internet publications (www). Between all the news sites, blogs, and other sites, we simply aren’t adding that much content when counting by words produced. The persistence of web pages means that they are consumed more often, and there is a bias towards more dominant languages like English, especially in areas like technology and scientific publications. But when we use digital technologies to communicate, most of us are privately interacting via sms, email and instant messaging, and more likely to communicate in our first languages as a result.
As the connected world gradually takes in more of the actual world, we can expect the diversity of technology-enabled communication to more closely align with face-to-face communication. Understanding human communication at scale will be central to the next generation of people-centric technologies.