Lucene: We Got Some Explaining to Do

Lucene: We Got Some Explaining to Do

Posted by Dave Engberg on 25 Aug 2011

Posted by Dave Engberg on 25 Aug 2011

[I was originally going to go with “I Love Lucene”, but did a quick Google search and found that TheServerSide beat me to it…]

As we mentioned in our architectural overview post, the data from each note is spread across three different storage systems on each shard.

All of the metadata about each note goes into structured tables in MySQL. And by “metadata”, I mean all of the fields in the data model structures for a Note and its Resources, except for the Resource’s raw data body and any recognition/alternate data files.

Those Resource files are de-duplicated in software on each shard (using MD5+length) and then stored on a relatively simple hierarchical file system using a folder tree derived from the MD5 checksum.

The combination of MySQL and the file system allows us to store the full contents of the data model and support the vast majority of our API calls. Text-based searches on our servers require some sort of Full-Text Search (FTS) engine to provide any sort of usable performance across large data sets.

This first required us to define the semantics of our search specification (see Appendix C of our API Overview). For example, if I type the word “spatula” into Evernote’s search box, should that match notes that have that word in the title? In the name of a tag on that note? In a PDF embedded in the note?

We decided that we should match words that you see if you open the note in its own window … so all of those examples should match. That means that we index a combination of: note text content, note title, tag names, PDF text (up to 1MB of text), and OCR results.

For the first few months after launch, we tried to implement text search within MySQL itself using MyISAM’s fulltext indexing. We concatenated the relevant texts together in a special table with a FULLTEXT index on the big text blob column.

Unfortunately, we found that the performance of MyISAM’s FTS engine didn’t work for our application. When users create or update notes, they expect those notes to immediately match any text searches. This means that text indexing is basically synchronous for our application … we can’t batch up index updates to run every hour. Updates to MyISAM’s FTS perform poorly under concurrent environments with lots of small changes coming in from different threads. If one user added a note and flushed the index, this would block another user from adding a note until the indexing completed.

We tried a few things to batch updates in MyISAM, but finally gave up and switched everything to Lucene.

Each Evernote user has their own Lucene search index in a separate directory on the file system. When someone modifies their notes, or performs a search against their account, we open their index and perform the updates and searches against that index.

The basic text search functionality was pretty straightforward in Lucene, but there were a number of things that we have to do to improve overall performance.

First, we denormalize all searchable and sortable data for each note into the Lucene index rather than just storing the raw search text. This allows us to just ask Lucene for the list of notes matching a set of criteria without doing an in-memory merge with separate data from MySQL.

For example, the note’s “last update time” and “notebook” are stored in each Lucene document. If you have an account with 60,000 notes, and you want to find 50 notes with “spatula” in your “Cooking” notebook sorted by last update time starting at offset 100, we can perform that full query directly against Lucene and directly receive the correct 50 note identifiers. If the notebook and update time were only in MySQL, we would potentially need to compare tens of thousands of results from MySQL and Lucene to find the desired 50.

We also keep user indices open in our server until the index has been idle for too long (currently 65 seconds) rather than opening and closing the index with each request. We have a dedicated maintenance thread that closes idle indexes sequentially, waiting for each to complete any merges before continuing. We found this was necessary to keep from spiking disk activity while closing multiple indices.

Finally, we perform a fairly elaborate pipeline of transformations on both note text and search expressions in order to normalize the representation of the text for correct comparisons. We have Lucene analysis filters that operate on the sequence of tokens to:

  • Remove apostrophes and other intra-word punctuation
  • Convert upper-case letters to lower case
  • Remove English “stop words” like “the” and “and”
  • Normalize letters with diacritics so that “ñ” becomes “n”
  • Convert “narrow width” Japanese characters to “full width”
  • Reorganize Chinese/Japanese/Korean text into pseudo “words”

Overall, we’ve been happy with the power and flexibility of Lucene. We have found, however, that it has become the most expensive software component in our shard infrastructure. On a busy shard, Lucene makes twice as many IO operations as MySQL, and those operations are less sequential. This means that it’s the top priority for future software optimizations, and also for future hardware tuning to ensure that our shards continue to scale well as we grow.

View more stories in 'Java'

3 Comments RSS

  • Another great article and a rare look under the hood of my favorite application. Thanks!

  • Hi,

    maybe is something you should look at …

  • Puneet Kaura

    Nice insights :). For scaling further, I agree with Peter have a look a elasticsearch(basically provides REST based distributed search using Lucene as the indexer.).