Evernote Tech Blog

The Care and Feeding of Elephants

Evernote Indexing System

Evernote Indexing System is designed to extend Evernote search capabilities beyond text documents into media files. Its task is to peruse through those files and bring any textual information into the searchable domain. Currently it processes images/PDFs and digital ink documents, with provisions to extend the service to other media types. The produced index is delivered in the form of an XML or PDF document, containing recognized words, alternative spellings, associated confidence levels and their location rectangles.

The Indexing System is implemented as a farm of dedicated Debian64 servers, each running an AMP processor and multiple ENRS processes – usually per number of the CPU cores of the server. ENRS (EN Recognition Server) is implemented as a set of native libraries wrapped into a Java6 web server application. It currently houses two components — AIR and ANR, the first of which handles various image types and PDFs and second is dedicated to the digital ink documents. AMP communicates with the servers through simple HTTP REST API which allows flexibility of the system configuration while maintaining high throughput essential for passing over large media files.

AMP retrieves resources from the user store shards and return back the created indexes. These will be included into the search index for EN Web Service, and passed over to Evernote phone/desktop clients to facilitate in-media searches locally. To minimize the extra traffic imposed on the shards already busy with user requests, AMPs broadcast queue information to each other, forming a single distributed media processor optimized for the current EN Service load and processing priorities. Evernote Indexing System is resilient enough to be operational even if only one of each type of the components will remain functional (and currently there are 37 AMP processors and over 500 ENRS server processes in operation processing around 2 million media files a day).

EN Indexing System diagram

Let’s have a closer look at the AIR part of the ENRS server. AIR’s reco philosophy is different from the ubiquitous OCR systems as its goal is to produce a comprehensive search index — instead of a printable text. Its focus is on finding as many words out of any kind and quality of an image as possible. Also, it has the flexibility to produce alternative readings for incomplete, unclear, blurred words.

To deal with the real-world images, AIR server does its processing in multiple ‘passes’, focusing on different assumptions in each of them. The image may be huge, but contain just a few words. It may contain scattered words at different orientations. Fonts may be very small and quite large in the same area. Text may alternate between black-on-white and white-on-black. It could be a mix of a different languages and alphabets. For Asian languages horizontal and vertical lines may be present in the same area. Similar-intensity font colors may blend into same gray levels under standard OCR processing. Printed text may include handwritten comments. Ad material art may be warped, slanted, changing size on the go. And that’s just to name a few problems that AIR servers currently face about two million times a day.

Inside AIR server

Below is a diagram of the main building block of the AIR server — a single ‘pass’. Depending on the call parameters, it will specialize on a different kind of processing (scale, orientation, etc), but the general scheme stays the same. It starts with the preparation of the set of images specific to the pass – scaled, converted to gray, binarized – depending on the pass. Then image graphics, tables, markup and other non-text artifacts need to be removed as much as possible to let the system focus on actual words. After candidate words are detected, they are assembled into proposed text lines and blocks.

Each line of each block will then pass through analysis by a number of recognition engines – these include ones developed internally and licensed from other vendors. Employing multiple reco engines is important not only because they specialize on different types of text and languages, but also as this allows to implement ‘voting’ mechanism — analyzing word alternatives created by diverse engines for the same word allows for better suppression of false recognitions and giving more confidence to consensus variants. Those confident answers would become pillars on which the final block of the ‘pass’ processing would base its text line reconstruction – re-deciding the structure of text lines, word segmentation, and purging most of the less-confident variants to reduce search false positives.

Diagram of a single AIR 'pass'

The number of passes to make will be determined initially by the image rendering and analysis module, but as recognition progresses this number may be increased or reduced. In case of a clean scan of a document, it may be enough to run only the standard OCR processing and be done with the whole process. A snapshot of a complex scene taken by a phone camera under poor lighting conditions may require deep analysis, with full set of passes to retrieve most of the textual data. Lots of colored words on a complex background may require additional passes specifically tailored to color separation. Presence of small blurred text will require expensive reverse-digital-filtering technics to restore the text image before attempting any reco processing. And once all passes are complete, it will be time for another critical part of the AIR processing to take stage – final results assembly. On complex images different passes may have produced wild variety of interpretations of the same areas. All these conflicts need to be reconciled, best interpretations selected, most of the incorrect alternatives need to be rejected and final blocks and lines of text built.

Once the internal document structure is finalized, it is only the last step left to create the requested output format. For PDF documents it is still PDF, where images are replaced with text boxes of recognized words. For all other input documents it is an XML index, containing the list of recognized word and their bounding boxes or stroke lists (for digital ink documents). This location info will allow to highlight the searched word over the source image or text of an ink document once a user will look for the document containing it.

11 Comments

  1. Good article, unfortunately, it does not cover several of the most important issues from a user’s perspective.

    – How are the Free and Premium user queues managed
    – What are typical service times for the queues
    – How can a user determine if a note resource has been indexed
    – What happens to pending unindexed resource requests when a user upgrades from to premium.

  2. Ummm. Fixing typos in previous post.

    Good article, however, it does not cover several of the most important issues from a user’s perspective.

    – How are the Free and Premium user queues managed?
    – What are typical service times for the queues?
    – How can a user determine if a note’s resources have been indexed?
    – What happens to pending unindexed resource requests when a user upgrades from free to premium?

    • The post was intended to be more of a technical design overview, thus did not focus so much on the usage side — but let’s touch it too.
      To your questions:
      — free and premium queues are actually the same queue, only that premium images are inserted in the front, instead of being placed in the end of it.
      — so if the indexing system is not too busy, there may be not much of a difference in processing time – in both cases image will pass the full cycle in a matter of minutes. But if there is a spike of user activity, the reco queue may span an hour or more. In this case premium images will still be processed in couple of minutes – while regular submissions will have to wait.
      — to check if a note was already indexed, sync your desktop EN client once again in a few minutes after you created the note. For PDF notes, check for the option ‘Save Searchable PDF’ when you right-click the PDF object in question. If this line is present in the menu, the note was already indexed. For image notes, choose to ‘Export’ the note to the ‘archive’ .enex format. This will produce an XML file that you can open with a text editor or your browser. Look in the end for the tag which wraps the image indexing results – if it is present, image was processed and is searchable now.
      — once a note is created and synced up to the Service, its resources are processed according to the user settings at the moment. This includes language preferences, premium status, etc. If any of those are changed after the note was synced up, it will not affect the processing. The note’s images will still stay in the end of the reco queue if the user status changes to ‘premium’. They will still be recognized as English-only even if the user preferences change to ‘Japanese+English’ from ‘English’ while the resources are still in queue.
      N.B. EN Service keeps the ‘fingerprints’ of all images and PDFs in the system to avoid storing multiple copies of the same file. So even after the language preferences were changed, same image included into a different note will just assume personality of its first copy — including its reco index. So to have the index change to a different language, it will be required to provide a slightly modified version of the image, or request all nothes reindexing in the user personal settings.

      • Thanks for the response. Very helpful.

        Bit confused about the queue management as described. It suggests LIFO for Premium and FIFO for Free.

        The service times as described suggest service time distribution of 1 sigma of 2 minutes, 2 sigma of 15 minutes, and 3 sigma of perhaps 2 hours. However, I have seen frequent comments on forum about unsearchable and therefore unindexed resources that have lasted days. Some of those are explainable by the user not understanding the desktop client 2 sync requirement to see the indexing. Some are not.

        What is the minimum change required to a note to force a re-index of its resources? Something like touching the note body with a null change? e.g .

        Wish there was a more direct attribute that could be searched/tested to determine a notes index status, however…

        Thanks again for the very useful information.

      • My example of a null change got trashed by your html filter.

        e.g. [space][backspace]

      • PS: I do understand that processing time for a resource will add to total transaction processing time. However, the worst case example that Dave Engberg mentioned in his ETC presentation was, IIRC, about an hour for a specific PDF.

      • PPS: Guessing 3 sigma for premium users of about 5 minutes, excepting any abnormal site maintenance problems.

  3. >> What is the minimum change required to a note to force a re-index of its resources?

    Actually, it is not the note change that matters, but the *resource* itself. The ‘fingerprint’ is stored separately for each of the resources, so if you want an image to be indexed anew, it should be actually different — even if in a single bit. Changing the note that contains it will have no effect.

    >> I do understand that processing time for a resource will add to total transaction processing time.

    Yes – while it is just seconds for an average image, it will certainly be much longer for a huge multi-page PDF.

  4. Alex. This has been a big help in clarifying my understanding of how Evernote works.

    Thanks.

  5. The OCR you do is quite extrodinary. How long has the system been in development?

    • Thank you! The current system is in the works for the last 10 years, and the first commercial recognition system we did became part of the Newton project, back in 1992.


Leave a Comment

* Required fields