Chances are good that you have more notes in your Evernote account than can be counted on the fingers of one hand. Luckily, even if you have tens of thousands of notes, that one note you are looking for is just one search away.
Often, one or two keywords are enough to get a good list of relevant notes, even from the most unorganized account. And even if you don’t search yourself, Evernote can find related results for you, whilst writing a new note, or when browsing the Web.
To make this work at the scale of millions of users and terabytes of data, our server-side search component is built on top of a rock-solid foundation, Apache Lucene, the high-performance search engine library.
Being avid long-time users, we have tweaked and improved Lucene over the past couple of years to suit our needs, and it has proven to be a reliable component in our core infrastructure.
Despite its reliability, we found out that our customized Lucene 2.9 installation has become the most expensive software component in our shard infrastructure, being responsible for twice as many IO operations as our SQL database.
It became clear that we had to improve the situation, as the pressure to the search system increases the more features we add to it and the more often people use Evernote in general.
Apart from the ostensible goal of just getting search go faster, we really wanted to enable us to move faster. That is, being able to roll out new features faster than before, with less headache about the implications of reindexing millions of users’ accounts.
Previously, we might have avoided changes to the index structure if we could somehow achieve the same effect at runtime, even if that meant higher-than-necessary CPU and IO costs. Reindexing meant significant server downtime, and that would have been worse.
We solved this dilemma by separating the What from the How
In fact, we are in the very good situation that we do not have “one” Big Data problem, we have many “Small Data” problems.
Each user’s data lives in its own special silo and has its own, dedicated Lucene index. Which means that, theoretically, nothing stops us from running different search engine implementations per user. And that gives us a good path to migrating our users’ indexes without taking down a whole shard.
So we rolled our own index versioning scheme. We refactored our server-side search code and separated the implementation (the “How”) from the general contract (the “What”). With this change, Lucene 2.9 technically was no longer a core component of our search infrastructure. It was “just” one implementation of a little Java interface called UserIndex. A UserIndexManager class is now responsible for wrapping-up the internals: “give me a UserIndex for user #123” is all we need to know, the rest is implementation-specific.
What this really means is that we can now have different implementations of “search” for each user, based on whatever exciting new path we want to explore. And whenever we think a new version of search is ready for prime time, we can now upgrade one user after another, without any server downtime.
Internally, we maintain two properties for each index, the base type (e.g., Lucene 2.9) and an internal version number to describe the index schema we defined on top of it. We can configure a “desired” type and version that we want to migrate to on a per-user and per-shard basis. Index Migration takes the shard’s CPU and IO utilization into account and only executes when enough resources are available. If so, it builds a new search index on disk, based on the user’s note data stored in MySQL, which takes less than a minute for the average user.
In the figure below, we show how a typical index migration currently looks in the beginning and at the end. Initially, a lot of high-volume users get migrated in parallel; they search frequently, and it is important to migrate them first. At a certain high-water level of busy indexer threads, we hold back more indexes from being migrated until we again reach a low threshold of active threads. Eventually, we reach the smaller, less active accounts, which take less time to reindex, and so we need less threads in parallel. We are experimenting with a better selection of users to reduce the number of parallel threads, but in general this approach works pretty well for us. Should a reindexing operation fail for whatever reason (e.g., a JVM crash), it will be resumed automatically.
Finally being able to easily migrate to a new index and implementation, in order to compare different implementations, we needed a way to benchmark them.
Since the “UserIndex” contract was nothing more than just an interface, we crafted a small benchmarking implementation of UserIndex that can wrap any other UserIndex implementation. Using this “shim” code, we can easily observe the per-call timings as well as the per-method disk IO. For the latter, we have built a custom Lucene Directory implementation that records the amount of disk reads and writes. We tunnel the corresponding UserIndex method signature to that code using a ThreadLocal, which in summary gives us a very detailed view about which UserIndex method (read: API call) needs most of our resources.
With the new versioning scheme, it was relatively easy to just upgrade a couple of shards to Lucene 4.2, which was just around the corner when we started the project. Without making any changes to the index schema and overall behavior, we could check how 4.2 performed compared to 2.9.
While migrating from our custom Lucene 2.9 installation to Lucene 4 alone already showed a significant reduction of disk IO (most of it are reads, of course) and CPU as well, thanks to our benchmarking code we found that the most significant improvement could be achieved by changing the way we normalize note lengths when scoring related results (a longer note is not necessarily more related, even if it is more likely to contain the same words as some other, short note). Previously, we used Lucene’s functional queries to influence the ranking for very long notes, which turned to be a performance bottleneck. We switched to a custom Similarity implementation that takes care of properly discounting the effect of long notes.
After all, by using Lucene 4 and the other modifications, we were able to reduce the overall disk IO created by the search component by a factor of 5.
The improvement to scoring related results alone already reduces total disk IO by more than 60%, whereas basically all other search operations saw IO reductions between 30% to 90% of their corresponding share of total disk IO.
In context, we can literally avoid several terabytes of disk reads per week, just by switching to the new codebase.
Switching everything to Lucene 4
The results essentially spoke for themselves. Time was ripe to migrate our complete user base to the new search engine.
What looked problematic in the beginning was that — in order to provide a seamless migration — we had to run both Lucene 2.9 and 4.2 concurrently in our web application context (for performance reasons, we want to have most of our code in the same JVM). While both Lucene versions share the same Java packages and classes, they are inherently incompatible. Lucene 4 had dropped supported for 2.9 indexes, there was no way to run Lucene 4 in “legacy mode”.
Fortunately, by removing Lucene as a direct dependency of search, we were finally in the position to completely hide all Lucene code behind the UserIndex interface. Since Java allows classes with the same name to co-exist as long as they are created by different ClassLoaders, we can stow them away separately and still access them through the common interface. Experience tells us there is always an exception to the rule, and in fact we had to work around a case where Lucene would access the SystemClassLoader. After fixing this, we were ready to go.
After the migration is before the migration
As of today, all of our 50M+ Lucene indexes have been migrated to Lucene 4 (now running version 4.3), and except for our Ops team, who carefully monitored the whole undertaking, probably few people have noticed. That’s the way we like it.
For this first migration, we had some extra overhead for monitoring and carefully controlling the whole process. Overhead aside, a reindex for a complete shard typically takes around 10-15 hours, with no alerting increase in disk IO.
As I am writing this blog post, we are already preparing the next big reindex, this time with new features under the hood. If we do it right, you won’t notice this change either.Christian Kohlschütter, Senior Search Researcher