By Clint Burford, Machine Learning Software Engineer at Evernote
All the way back in December 2012 we launched Food 2.0 and its “My Cookbook” feature with the ability to find and display recipes that had previously been added via the user’s Evernote account. In our follow up technical blog post we talked about some of the technical challenges involved in building the classifier and integrating it: sourcing representative training data in 11 languages, choosing features, choosing a model, and designing a generic flow for classification in the Evernote platform. I recommend checking out that post if you haven’t read it. In this post I’m going to talk about a second phase of development that took place in the early part of this year and yielded an impressive new feature and some noticeable overall accuracy improvements.
Recipe Image Classification
When we designed the first version of the classifier we deliberately focussed on the core use case: automatically identifying recipes that had been clipped from the web or manually typed into Evernote. Both of these types of recipes tend to be made up of text and a few images. The images can be ignored and word-based features can easily be extracted from the encoded text contents. But clipping and typing aren’t the only ways to get recipes into Evernote. The Evernote mobile clients are great for taking photographs of recipes from printed or hand-written books. Similarly, lots of folks love to hook up their Evernote desktop clients to their scanners and archive lots of scanned recipes from magazines and books.
Our primary goal for the second version of the recipe classifier was to support these use cases. Fortunately we had a head start, because the Evernote platform already has a state-of-the-art text recognition system for extracting printed and hand-written text from pictures and scans. Supporting recipe image classification would seem to be as simple as piping the output of the text recognition system into the recipe classifier.
Unfortunately things are not quite that simple because there is a great deal of uncertainty inherent in text recognition. Photos and scans in Evernote can contain text that is out of focus, cropped, or just plain convoluted and messy. Happily, Evernote has a great way of handling this uncertainty. For every possible word it finds in an image, it generates an internal list of interpretations, each with a numeric rating representing the strength of the system’s confidence about that interpretation.
To get the recipe classifier working with the recognition system, we needed a set of rules for incorporating confidence-weighted recognition candidates into the feature model. It took some time to decide what these should be, but in the end if was relatively simple. That said, this wasn’t the end of the story. Our evaluation (on a diverse multi-language corpus of hand-written and printed recipe photos taken for us by our very accommodating ODesk contractors) showed that performance in many of our supported languages was ok, but in others we wanted to do better.
Support Vector Machines
There are three semi-orthogonal dimensions to classifier design: feature engineering, model selection, and training data acquisition. Engineering a good classifier is about knowing which of these three will give the biggest return for the least effort. Having just spent some time on feature engineering we decided that moving to a more advanced model would give us a relatively easy win and set us up for the future.
Our original choice of Naive Bayes was about focussing on getting our first implementation off the ground without over-complicating things. Support Vector Machines (SVMs) are considered the best option for a lot of different machine learning problem types these days, and they happen to be particularly well-suited to document classification. One of the reasons SVMs are good for document classification is that they do a good job of handling large numbers of semi-redundant and irrelevant features. By adding the confidence-weighted word features output by the text recognition system we had added a significant number of these, possibly confusing the Naive Bayes model in the process, and there was reason to hope that SVMs would do better.
We did our prototyping for SVMs using the excellent scikit-learn and built our final implementation using libSVM, a popular open-source library. Having access to a trustworthy existing implementation of a complex algorithm like SVMs took away a source of great potential pain!
Just as we hoped, SVMs gave us a big performance boost when classifying recipe images. Fortunately we had a bit more time to spare and an opportunity to try for more gains while further enhancing our toolkit in anticipation of tackling more complex classification problems in future.
It was time to look back at our training data again. The obvious thing was to go for more volume by having our ODesk workers collate more web data for us, but this didn’t feel right. Beyond a certain point, one training item is just like another and the law of diminishing returns means that the reward to effort ratio is low. What we wanted was not more of the same training data, but a collection of high value training data that would really “teach the classifier a lesson” by focussing on its mistakes. It turns out that machine learning researchers have thought of a great way of doing this that hinges on the fact that most classifiers actually know when they are most likely to be wrong. They know this because their decision functions output numeric scores that can be interpreted as measures of confidence. Uncertainty sampling is one of the simplest techniques from a family known as Active Learning, where “active” refers to the fact that the machine learning algorithm itself guides the data acquisition process. Here is the algorithm:
1. Create an initial classifier 2. While budget remains for labeling (i) Apply the classifier to each unlabeled example in the corpus (ii) Find the b examples for which the classifier is least confident (iii) Have the b examples manually labeled (iv) Train a new classifier on all labeled examples
For an original academic reference for this approach, check out this paper.
Uncertainty sampling works to maximize the return for labeling effort by zeroing in on the examples in the corpus that are the “hardest” to classify. Setting a relatively small value for b allows the classifier to be trained on successive sets of training data that are tailored to address the classifier’s particular weaknesses as it improves over time.
The obvious choice for a large corpus that is likely to contain lots of potentially confusable recipes and non-recipes is the web. Fortunately, the job of spidering the entire web and making it conveniently available for work like this has already been done us. The Common Crawl Corpus is a freely available, 6 billion page, 100 TB archive of the web from 2012. For the record, we found that running our classifier over a relatively small randomly selected portion of the corpus gave us more than enough hard-to-classify examples, obviating the need for spinning up massive numbers of EC2 instances to do our processing.
We were delighted with the performance improvements we got from our move to SVMs and adoption of uncertainty sampling from the web for sourcing maximum value unlabeled data. We’re also excited about the future prospects for applying some of this technology to helping make people smarter with Evernote. Your feedback about this area is most welcome.