With the recent addition of Chinese support, Evernote Recognition System (ENRS) indexes handwritten notes in 24 languages. Each time we add another language, we have to overcome new challenges specific to the particular alphabet and style of writing.
Yet another batch came once we approached the CJK group of languages — Chinese, Japanese and Korean. These languages require support for two orders of magnitude more symbols, each being vastly more complex. Writing does not require spaces between words. Fast scribbling shifts to cursive, making interpretation rely heavily on context.
Before going into specifics of CJK support, let’s first look at the challenges that need to be addressed for Latin script recognition. For our engine, the first step to parsing a handwritten page is finding lines of text. This already could be a non-trivial task. Let’s take a look at an example:
Lines could be curved. Letters from different lines cross each other and the distance between lines varies randomly. The line segmentation algorithm has to follow the lines, untangling the accidental connections as it goes.
The next challenge comes once the lines are extracted — how to split them to words. This task is mostly simple for printed texts, where there is a clear difference in distance between letters and words. With handwriting, in many cases it is not possible to tell just by distance whether it is a symbol or a word break:
What could be helpful here is understanding what is written. Then, by understanding the words, you can tell where each begins and ends. But, this requires the ability to recognize the line as a whole, not just reading word after word — the way most regular OCR engines operate. Even for European languages, the task of recognizing handwriting turns out to be not that different from the challenges of processing CJK texts. To illustrate, here is an example of a Korean handwriting:
Each line’s flow needs to be traced similarly, with possible overlaps untangled. After a line is singled out, there is no way to even attempt a space-based word segmentation. As with European handwriting, the solution would be to do recognition and segmentation in a single algorithm, using the understanding of recognized words to decide where the word boundaries are to be found.
Now, let’s look at the steps for the process of symbols interpretation. It first estimates where individual characters could begin. These would be smaller whitespaces between strokes and specific connecting elements, characteristic of cursive handwriting. We will have to ‘oversegment’ here, placing extra division points — at this point we have no clear idea if a segmentation point is correctly placed outside of symbol boundaries, or falls inside it:
To assemble the actual symbols, we will try to combine these smaller parts into bigger blocks, estimating every combination. The next image illustrates an attempt to recognize the combination of the first two blocks:
Of course, this means that we will have to recognize many more symbol variants than there are actual characters written. And for CJK languages, this in turn means that the recognition process becomes much slower than it is for Latin languages, as estimating different combinations is multiplied by so many more symbols to consider. The core of our symbol recognizer is a set of SVM (“Support Vector Machine”) decision engines, each solving the problem of recognizing its assigned symbol ‘against all the rest.’
If we need to have about 50 such engines for English (all Latin letters + symbols), in order to support the most common Chinese symbols, we would need 3,750 of them! This would’ve been 75 times slower, unless we devised a way to run only a fraction of all these decision engines each time.
Our solution here is to first employ a set of simpler and faster SVMs, which would pre-select a group of similarly written symbols for a given input. Such an approach usually allows us to net only five to six percent of the whole set of characters, thus speeding up the overall recognition process about 20 times.
To decide which variants of the multiple possible interpretation of the symbols of handwriting should be selected for the final answer, we now need to refer to different language models — context that would allow us to create the most sensible interpretation of all the possible symbol variants generated by the SVMs. Interpretation starts with simply weighing up the most common two-symbol combinations, then raising the context level to frequent three-symbol sequences, up to dictionary words and known structured patterns — like dates, phones, and emails. Next comes probable combinations of the proposed words and patterns set together. And at no point in the process before you weigh all the possibilities in depth, can you tell for sure what is the best way to interpret the writing of that line. Only evaluating millions and millions of possible combinations together, similar to how “Deep Blue“ was analyzing myriad of chess positions playing against Kasparov, is it possible to come up to the optimal interpretation.
Once the best interpretation for the line is established, it finally can define the word segmentation. Overlaid green frames on the images below show the best segmentation to words the system could devise:
And, as you can see, the process turned out to be mostly the same for both, European and CJK handwriting!