ENTRIES TAGGED "language"
Sneak Peek at Upcoming Session at Strata Santa Clara 2013
By Robert Munro
Plain text is the world’s largest source of digital information. As the amount of unstructured text grows, so does the percentage of text that is not in English. The majority of the world’s data is now unstructured text outside of English. So unless you’re an exceptional polyglot, you can’t understand most of what’s out there, even if you want to.
Language technologies underlie many of our daily activities. Search engines, spam filtering, and news personalization (including your social media feeds) all employ smart, adaptive knowledge of how we communicate. We can automate many of these tasks well, but there are places where we fall short. For example, the world’s most spoken language, Mandarin Chinese, is typically written without spaces. “解放大道” can mean “Liberation Avenue” or “Solution Enlarged Road” depending on where you interpret the gaps. It’s a kind of ambiguity that we only need to worry about in English when we’re registering domain names and inventing hashtags (something the folk at “Who Represents” didn’t worry about enough). For Chinese, we still don’t get it right with automated systems: the best systems get an error every 20 words or so. We face similar problems for about a quarter of the world’s data. We can’t even reliably tell you what the words are, let alone extract complex information at scale.
A look at the historical accuracy of "Mad Men's" dialogue.
"Mad Men" is praised for its precise attention to historical visuals, but how does its dialogue stack up against text from the 1960s? Ben Schmidt's new visualization explores that question.
A look at the historical accuracy of "Downton Abbey's" language.
Ben Schmidt ran the script of the "Downton Abbey" season two finale through Google Ngrams to see how the show's language matches up with history.
Big data as a discipline or a conference topic is still in its formative years.
Big data is a massive opportunity, but the language used to describe it ("goldrush," "data deluge," "firehose," etc.) reveals we're still searching for its identity.