We recently faced the type of big data challenge we expect to become increasingly common: scaling up the performance of a machine learning classifier for a large set of unstructured data.
Machine learning algorithms can help make sense of data by classifying, clustering and summarizing items in a data set. In general, performance has limited the opportunities to apply machine learning to understanding big or messy data sets. Analysts need to factor in time for speeding up off-the-shelf algorithms or even whether a machine learning pass would complete in a timely manner. While using smaller random samples can help mitigate performance issues, some data sets lend themselves to improved results when applied to more data.
Here we share our experience implementing a set-oriented approach to machine learning that led to huge performance increases (more detail is available in a related post at O’Reilly Answers). Applying a set-oriented approach can help you expand the opportunities to gain the benefits of machine learning on larger, unstructured and complex data sets.
We are working with the US Department of Health and Human Services (HHS) on a project to look for trends in demand for jobs related to Electronic Medical Records (EMR) and Health Information Technology (HIT). The twist, and the reason we decided to build a classifier, is that we wanted to separate jobs for those using EMR systems from those building, implementing, running and selling EMR systems. While many jobs easily fit in one of the two buckets, plenty of job descriptions had duties and company descriptions that made classifying the jobs difficult even for humans with domain expertise.
Identifying the approximately 400,000 jobs with EMR and related references was achieved with high accuracy using a regular expression rule-base. All the job description data is stored on a multi-node Greenplum Massively Parallel Processing (MPP) database cluster, running a Postgres engine. Having an MPP database has been critical for analyzing the large, 1.4-billion-record data set we work with — we can generally run investigative queries against the full data set in minutes.
After some discussion, we decided a Naive Bayes classifier seemed appropriate for the task. While there are some Python open source naive bayes classifiers available, such as NLTK and Orange, I decided to use the algorithm in Toby Segaran’s “Programming Collective Intelligence” so I could tweak the code and play with different feature arrangements. Toby does a great job of tying the code to the principles behind the naive bayes algorithm, and I thought that would help with modding and tuning the classifier for our purposes.
We had a tricky data set with categories that could be only subtly different. We wanted the classifier to be fast enough to iterate through the data many times so we could spend enough time training and tuning the algorithm to optimize classifier accuracy. Starting with a training set of 1,800 categorized jobs (phew…) and a random sample of 1,850 jobs, we set to work trying and reviewing different sets of feature combinations.
That’s when it hit me, job descriptions have lots of words that are often not carefully entered. That creates a large set of words and probabilities to work with, slowing down the algorithm. And, the algorithm was written to explain how Naive Bayes works, not for maximum efficiency. With training and classifying the sample data running for more than six hours, we needed to do something to speed up the process to handle all 400,000 records we wanted to classify.
I contacted Daisy Zhe Wang, an EECS doctoral student at UC Berkeley and a consultant at Bayes Informatics, because of her focus on scaling in-database natural language and machine learning algorithms.
Daisy, together with Bayes Informatics founder Milenko Petrovic, developed a set-oriented approach to implementing the Naive Bayes algorithm that treats the data derived from the training set (features (words) and counts) as a single entity, and converting the Naive Bayes algorithm to Python User Defined Functions (UDFs) that, since Greenplum is a distributed database platform, let us parallelize the classifier process.
The result: The training set was processed and the sample data set classified in six seconds. We were able to classify the entire 400,000-record data set in under six minutes — more than a four-orders-of-magnitude records processed per minute (26,000-fold) improvement. A process that would have run for days, in its initial implementation, now ran in minutes! The performance boost let us try out different feature options and thresholds to optimize the classifier. On the latest run, a random sample showed the classifier working with 92% accuracy.
My simple understanding of their algorithm is that training set results are treated like a model and stored as a single row/column in the database. They’re parsed into a permanent Python data structure once, while each job description is parsed into another temporary data structure. The Python UDFs compare the words in the temporary data structure to the words in the model. The result is one database read for each job description and a single write once the probabilities are compared and the classification assignment made. That’s quite a contrast from reading and writing each word in the training set and the unassigned job.
Why does the set-oriented approach to machine learning matter? Performance and scale issues have long been a problem when trying to fully apply machine learning to large or unruly unstructured data sets. Set-oriented machine learning provides a straightforward way to bypass performance roadblocks, making machine learning a viable option for categorizing, clustering or summarizing large data sets or data sets with big chunks of data (e.g., descriptions or items with large numbers of features).
With any data set, speeding up machine learning processes allows quicker iterations through the data. That creates room to run experiments that improve accuracy and more time to focus on and interpret results to gain insights about the data. Quicker processing reduces the risk of applying machine learning to new topics by reducing the time investment to determine if results are worthwhile.
In summary, the performance boost provided by set-oriented machine learning makes for:
- Handling larger and more diverse data sets
- Applying machine learning to a larger set of problems
- Faster turnarounds
- Less risk
- Better focus on a problem
- Improved accuracy, greater understanding and more usable results