Machine learning (ML) is all the rage, riding tight on the coattails of the “big data” wave. Like most technology hype, the enthusiasm far exceeds the realization of actual products. Arguably, not since Google’s tremendous innovations in the late ’90s/early 2000s has algorithmic technology led to a product that has permeated the popular culture. That’s not to say there haven’t been great ML wins since, but none have as been as impactful or had computational algorithms at their core. Netflix may use recommendation technology, but Netflix is still Netflix without it. There would be no Google if Page, Brin, et al., hadn’t exploited the graph structure of the web and anchor text to improve search.
So why is this? It’s not for lack of trying. How many startups have aimed to bring natural language processing (NLP) technology to the masses, only to fade into oblivion after people actually try their products? The challenge in building great products with ML lies not in just understanding basic ML theory, but in understanding the domain and problem sufficiently to operationalize intuitions into model design. Interesting problems don’t have simple off-the-shelf ML solutions. Progress in important ML application areas, like NLP, come from insights specific to these problems, rather than generic ML machinery. Often, specific insights into a problem and careful model design make the difference between a system that doesn’t work at all and one that people will actually use.
The goal of this essay is not to discourage people from building amazing products with ML at their cores, but to be clear about where I think the difficulty lies.
Progress in machine learning
Machine learning has come a long way over the last decade. Before I started grad school, training a large-margin classifier (e.g., SVM) was done via John Platt’s batch SMO algorithm. In that case, training time scaled poorly with the amount of training data. Writing the algorithm itself required understanding quadratic programming and was riddled with heuristics for selecting active constraints and black-art parameter tuning. Now, we know how to train a nearly performance-equivalent large-margin classifier in linear time using a (relatively) simple online algorithm (PDF). Similar strides have been made in (probabilistic) graphical models: Markov-chain Monte Carlo (MCMC) and variational methods have facilitated inference for arbitrarily complex graphical models . Anecdotally, take at look at papers over the last eight years in the proceedings of the Association for Computational Linguistics (ACL), the premiere natural language processing publication. A top paper from 2011 has orders of magnitude more technical ML sophistication than one from 2003.
On the education front, we’ve come a long way as well. As an undergrad at Stanford in the early-to-mid 2000s, I took Andrew Ng’s ML course and Daphne Koller’s probabilistic graphical model course. Both of these classes were among the best I took at Stanford and were only available to about 100 students a year. Koller’s course in particular was not only the best course I took at Stanford, but the one that taught me the most about teaching. Now, anyone can take these courses online.
As an applied ML person — specifically, natural language processing — much of this progress has made aspects of research significantly easier. However, the core decisions I make are not which abstract ML algorithm, loss-function, or objective to use, but what features and structure are relevant to solving my problem. This skill only comes with practice. So, while it’s great that a much wider audience will have an understanding of basic ML, it’s not the most difficult part of building intelligent systems.
Interesting problems are never off the shelf
The interesting problems that you’d actually want to solve are far messier than the abstractions used to describe standard ML problems. Take machine translation (MT), for example. Naively, MT looks like a statistical classification problem: You get an input foreign sentence and have to predict a target English sentence. Unfortunately, because the space of possible English is combinatorially large, you can’t treat
MT as a black-box classification problem. Instead, like most interesting ML applications, MT problems have a lot of structure and part of the job of a good researcher is decomposing the problem into smaller pieces that can be learned or encoded deterministically. My claim is that progress in complex problems like MT comes mostly from how we decompose and structure the solution space, rather than ML techniques used to learn within this space.
Machine translation has improved by leaps and bounds throughout the last decade. I think this progress has largely, but not entirely, come from keen insights into the specific problem, rather than generic ML improvements. Modern statistical MT originates from an amazing paper, “The mathematics of statistical machine translation” (PDF), which introduced the noisy-channel architecture on which future MT systems would be based. At a very simplistic level, this is how the model works : For each foreign word, there are potential English translations (including the null word for foreign words that have no English equivalent). Think of this as a probabilistic dictionary. These candidate translation words are then re-ordered to create a plausible English translation. There are many intricacies being glossed over: how to efficiently consider candidate English sentences and their permutations, what model is used to learn the systematic ways in which reordering occurs between languages, and the details about how to score the plausibility of the English candidate (the language model).
The core improvement in MT came from changing this model. So, rather than learning translation probabilities of individual words, to instead learn models of how to translate foreign phrases to English phrases. For instance, the German word “abends” translates roughly to the English prepositional phrase “in the evening.” Before phrase-based translation (PDF), a word-based model would only get to translate to a single English word, making it unlikely to arrive at the correct English translation . Phrase-based translation generally results in more accurate translations with fluid, idiomatic English output. Of course, adding phrased-based emissions introduces several additional complexities, including how to how to estimate phrase-emissions given that we never observe phrase segmentation; no one tells us that "in the evening" is a phrase that should match up to some foreign phrase. What’s surprising here is that there aren’t general ML improvements that are making this difference, but problem-specific model design. People can and have implemented more sophisticated ML techniques for various pieces of an MT system. And these do yield improvements, but typically far smaller than good problem-specific research insights.
Franz Och, one of the authors of the original Phrase-based papers, went on to Google and became the principle person behind the search company’s translation efforts. While the intellectual underpinnings of Google’s system go back to Och’s days as a research scientist at the Information Sciences Institute (and earlier as a graduate student), much of the gains beyond the insights underlying phrase-based translation (and minimum-error rate training, another of Och’s innovations) came from a massive software engineering effort to scale these ideas to the web. That effort itself yielded impressive research into large-scale language models and other areas of NLP. It’s important to note that Och, in addition to being a world-class researcher, is also, by all accounts, an incredibly impressive hacker and builder. It’s this rare combination of skill that can bring ideas all the way from a research project to where Google Translate is today.
Defining the problem
But I think there’s an even bigger barrier beyond ingenious model design and engineering skills. In the case of machine translation and speech recognition, the problem being solved is straightforward to understand and well-specified. Many of the NLP technologies that I think will revolutionize consumer products over the next decade are much vaguer. How, exactly, can we take the excellent research in structured topic models, discourse processing, or sentiment analysis and make a mass-appeal consumer product?
Consider summarization. We all know that in some way, we’ll want products that summarize and structure content. However, for computational and research reasons, you need to restrict the scope of this problem to something for which you can build a model, an algorithm, and ultimately evaluate. For instance, in the summarization literature, the problem of multi-document summarization is typically formulated as selecting a subset of sentences from the document collection and ordering them. Is this the right problem to be solving? Is the best way to summarize a piece of text a handful of full-length sentences? Even if a summarization is accurate, does the Franken-sentence structure yield summaries that feel inorganic to users?
Or, consider sentiment analysis. Do people really just want a coarse-grained thumbs-up or thumbs-down on a product or event? Or do they want a richer picture of sentiments toward individual aspects of an item (e.g., loved the food, hated the decor)? Do people care about determining sentiment attitudes of individual reviewers/utterances, or producing an accurate assessment of aggregate sentiment?
Typically, these decisions are made by a product person and are passed off to researchers and engineers to implement. The problem with this approach is that ML-core products are intimately constrained by what is technically and algorithmically feasible. In my experience, having a technical understanding of the range of related ML problems can inspire product ideas that might not occur to someone without this understanding. To draw a loose analogy, it’s like architecture. So much of the construction of a bridge is constrained by material resources and physics that it doesn’t make sense to have people without that technical background design a bridge.
The goal of all this is to say that if you want to build a rich ML product, you need to have a rich product/design/research/engineering team. All the way from the nitty gritty of how ML theory works to building systems to domain knowledge to higher-level product thinking to technical interaction and graphic design; preferably people who are world-class in one of these areas but also good in several. Small talented teams with all of these skills are better equipped to navigate the joint uncertainty with respect to product vision as well as model design. Large companies that have research and product people in entirely different buildings are ill-equipped to tackle these kinds of problems. The ML products of the future will come from startups with small founding teams that have this full context and can all fit in the proverbial garage.
: Although MCMC is a much older statistical technique, its broad use in large-scale machine learning applications is relatively recent.
: The model is generative, so what’s being described here is from the point-of-view of inference; the model’s generative story works in reverse.
: IBM model 3 introduced the concept of fertility to allow a given word to generate multiple independent target translation words. While this could generate the required translation, the probability of the model doing so is relatively low.