I spent a few hours at the Mobile Voice conference and left with an appreciation of Google’s impact on the speech industry. Google’s speech offerings loomed over the few sessions I attended. Some of that was probably due to Michael Cohen’s keynote1 describing Google’s philosophy and approach, but clearly Google has the attention of all the speech vendors. Tim’s recent blog post on the emerging Internet Operating System captured the growing importance of networked applications that rely on massive amounts of data, and it was interesting to observe in person its impact on an industry. (Google’s speech and language technologies were among the examples Tim cited.)
Google thinks of seamless voice-driven interfaces as having two key features: (1) ubiquitous availability so users can access speech interfaces from any app and on any device, and (2) high-performance so speech technologies lead to frictionless user interactions. In order to produce and deliver ubiquitous, high-performance speech interfaces, Michael Cohen emphasized Google’s big data systems as key to how they develop all their services.
Having speech technologies in the cloud lets Google quickly iterate and push enhanced speech engines on a regular basis. More importantly, their speech engines learn and get trained using real data from their many interconnected services. Speech engines typically rely on both language and acoustic models. Language models are statistical models of word sequences and patterns. Cohen pointed out that their language models use data collected from web searches, giving them access to an ever growing corpus that few can match (230 billion words collected, refined to a vocabulary of the million most common words). Cohen disclosed that some of the more recent acoustic models they’re evaluating are built using unsupervised machine-learning algorithms. (These are speech algorithms trained on recorded speech that haven’t been transcribed by hand.) While he coyly avoided explaining how an accurate system can be built from unsupervised techniques, it’s likely they use data from their 411 service (something Tim predicted 3 years ago). [Update (4/27): Readers point to Youtube and Google's Voice Search on smartphones as likely sources of data for tuning speech engines.]
Of course having access to relevant real-world, user-generated data is pointless if one can’t operate at a large-scale. Fortunately Google pioneered many of the recently popular big data management and parallel computing technologies, so they’re probably the best company equipped to use large-scale data. Big data technologies are essential pieces of infrastructure that Google engineers tap into. In fact their speech algorithms wade through massive amounts of data on a regular basis, resulting in a virtuous cycle of refinements.
There are situations when embedded speech engines make sense (e.g., speech enabled navigation systems should still work in the “middle of nowhere”). Google’s access to relevant data and their big data skills make them a formidable general purpose, cloud-based2 speech engine. Hybrid systems that use cloud services when available, and otherwise default to embedded speech engines, were mentioned frequently at the conference. This is great news for players like Nuance that have both embedded and cloud engines. But as network connections become more reliable and ubiquitous, Google’s cloud-based (and big data driven) speech engines are going to get harder to beat. In recent years many speech companies have amassed lots of data, but in Google they face a competitor that leverages web-scale data.
Microsoft with its search engine and call center data, speech products and research group, is also a major player. It just isn’t clear if they are using data from their interconnected services to benefit their speech products, as efficiently as Google does.
(1) Michael Cohen was one of the founders of Nuance and is currently the Manager of Speech Technology at Google.
(2) Some of Google’s speech engines are easily accessible (at least on Android) through simple API’s.