How the world communicates in 2013

Sneak Peek at Upcoming Session at Strata Santa Clara 2013

By Robert Munro

Plain text is the world’s largest source of digital information. As the amount of unstructured text grows, so does the percentage of text that is not in English. The majority of the world’s data is now unstructured text outside of English. So unless you’re an exceptional polyglot, you can’t understand most of what’s out there, even if you want to.

Language technologies underlie many of our daily activities. Search engines, spam filtering, and news personalization (including your social media feeds) all employ smart, adaptive knowledge of how we communicate. We can automate many of these tasks well, but there are places where we fall short. For example, the world’s most spoken language, Mandarin Chinese, is typically written without spaces. “解放大道” can mean “Liberation Avenue” or “Solution Enlarged Road” depending on where you interpret the gaps. It’s a kind of ambiguity that we only need to worry about in English when we’re registering domain names and inventing hashtags (something the folk at “Who Represents” didn’t worry about enough). For Chinese, we still don’t get it right with automated systems: the best systems get an error every 20 words or so. We face similar problems for about a quarter of the world’s data. We can’t even reliably tell you what the words are, let alone extract complex information at scale.

We’ll talk more about the state-of-the-art in language technologies at Strata 2013. For this article, we decided to answer a more basic question: “how are people actually communicating right now?”

languages

The infographic accompanying this article shows the breakdown of what languages people are using for face-to-face communication, relative to phone-based communication and internet-enabled communication. By word count, almost 7% of the world’s communications are now mediated by digital technologies:

  • Every three months, the world’s text messages exceed the word count of every book ever published.
  • Text is cheap: every utterance since the start of humanity would take up less than 1% of the world’s current digital storage capacity (about 50 exabytes, assuming 110B people have averaged 16,000 words a day for 20 years each).
  • The Twitter ‘firehose’, outside the processing capacity of most organizations, would be about the size of dot above the “i” in “English”.
  • There are more than 6,000 other languages: only the top 1% are shown.
  • Not one language from the Americas or Australia made the cut.
  • Also omitted, email spam would be larger than every block except spoken Mandarin (官話).
  • Short messages (sms and instant messaging) account for nearly 2% of the world’s communications. This makes short message communication the most popular and linguistically diverse form of written communication that has ever existed.
  • If the Facebook ‘like’ was considered a one-word language, it would be in the top 5% most widely spoken languages (although still outside the top 200).
  • Your browser probably won’t show Sundanese script (ᮘᮞ ᮞᮥᮔ᮪ᮓ), even though the world’s Sundanese speakers out-number the populations of New York, London, Tokyo and Moscow, combined.
  • You misread that last point as “Sudanese” which is a variety of Arabic (العربية) and were surprised at the difference: we have a blind-spot when it comes to knowing about the existence of languages.
  • Is a picture worth a 1,000 words? If so, shared pictures would double the size of the “social networks” block.
  • Across all the world’s communications, 5 in every 10,000 words are directed at machines, not people: mainly search engines.

Perhaps the most surprising outcome for many people will be the relatively small footprint of internet publications (www). Between all the news sites, blogs, and other sites, we simply aren’t adding that much content when counting by words produced. The persistence of web pages means that they are consumed more often, and there is a bias towards more dominant languages like English, especially in areas like technology and scientific publications. But when we use digital technologies to communicate, most of us are privately interacting via sms, email and instant messaging, and more likely to communicate in our first languages as a result.

As the connected world gradually takes in more of the actual world, we can expect the diversity of technology-enabled communication to more closely align with face-to-face communication. Understanding human communication at scale will be central to the next generation of people-centric technologies.

Sources: Ethnologue, Data Center Knowledge, Gigaom, Nationalencyklopedin, Pingdom, Population Reference Bureau.
tags: , , , , ,
  • DOC

    The article is interesting. And although I know the inforgraphic is of 2013, could you please correct the word for Hindi. It is not hindustani (misspelled as हन्दुस्तानी) The right word is Hindi हिंदी or हिन्दी. Incidentally the Codeblock 900 for Devanagarai caters to a large number of languages including 10-11 official languages of India. The same is the case for Codeblock 600 which caters to languages using Arabic Script. As you know Unicode is interested in script and not language.

  • http://www.robertmunro.com/ Robert Munro

    Hi Doc

    Author the article here. You are right that Hindustani is not in common use. It’s still occasionally used by linguists to cover both Hindi and Urdu. The same is true for our spelling of Mandarin Chinese right next to it: in this article, we used the (older) formal name rather than the more common one.

    If you look at the image on our company page, we use the more common names for these and a few other languages:
    http://idibon.com/idibon-at-strata/

    Thanks for pointing this out!