When we designed our synchronization protocol way back in 2007, we wanted to make sure that a client could use a minimal number of network requests to find all of the content in the user’s account, or only of the relevant changes since the last sync. We chose a “non-locking object-state-based replication scheme based on an account-specific Update Sequence Number (USN) for every object in our heterogeneous, sometimes-cyclic data model”. A client can just say “Hey, what’s new?” and our servers will give back all of the relevant changes in the user’s Notes, Tags, Notebooks, Resources, etc., usually in a single HTTPS response.
This has worked well as a wire protocol, but our original service implementation had some scalability challenges in the last six years as we grew from two empty shards in February 2008 to more than four hundred shards containing billions of notes today. It can take a lot of IO operations to find the correct information to give the client just the right set of objects for its next SyncChunk.
In our original design, every shard stored the metadata for around 100,000 accounts within a MySQL database running on a RAID1 of 15krpm spinning disks. Each object type was stored in its own table, with a synthetic primary key and a secondary index to efficiently find entries by user, sorted by USN. For example, here are the relevant bits of our ‘notes’ table:
CREATE TABLE notes ( id int UNSIGNED NOT NULL PRIMARY KEY, guid binary(16) NOT NULL, user_id int UNSIGNED NOT NULL, notebook_id int UNSIGNED NOT NULL, title varchar(255) NOT NULL, update_sequence_number int UNSIGNED NOT NULL, ... KEY user_id_usn_idx (user_id, update_sequence_number), UNIQUE guid_idx (guid) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The three keys in this table all make it possible to find a single entry, or a range of entries, in efficient (logarithmic) time. But actual rows are clustered on disk, ordered by primary key. To find multiple rows by a secondary index (e.g. user_id, above), the database must first traverse the secondary index, and then follow those pointers to the actual disk pages where each separate row resides.
When a client asked the service “Give me up to 100 objects from my account with a USN higher than 8476”, the user’s shard would have to check up to 9 different tables for relevant entries and then merge those results together to return only the 100 entries with the lowest matching USNs. This was implemented as a single massive SQL UNION, which would crawl the user_id secondary index on each table and then load the primary pages for each matched row. In a worst-case scenario, this could potentially require thousands of non-sequential pages to be loaded from disk to satisfy one request.
That’s a ton of expensive I/O, which translated into slow replies for our clients. This problem was multiplied by the rapid growth in notebook sharing by individuals and businesses, which require separate client queries for each notebook. Our ops team bought us time by transitioning our database from spinning disks to SSDs, but the root problem required a software optimization.
Index to the Rescue
To make synchronization more efficient, we wanted to avoid hitting a dozen different tables to determine what to put into each sync chunk. To support our growth in shared notebook and business usage, as well as implement new features such as contentClass sync, we wanted to place enough information in one table to minimize the unnecessary data brought back from MySQL into our Java middle layer.
Consider further that we had to query more rows from the entity tables (notes, tags, etc.) than we would return. To return three entries from notes and tags, we would query three <note ID, USN> pairs and three <tag ID, USN> pairs and then sort by USN to find the entries to return. Round trips to the database are expensive and so querying row by row in hopes of reading only the ones you need can increase the cost. And when MySQL is asked to filter on values from the entity tables, additional table joins are often required.
So we embarked on our quest to design one table to index them all: the “sync index” table. This one table would be queried to determine the primary keys of the candidate NoteStore entities to read for further consideration. Many entities are efficiently filtered within MySQL by accessing a single table and treating each row independent of any other values in the database.
We chose a primary key of <user_id, USN> so that a user’s sync index rows would be clustered on disk in the order needed for synchronization. A single database page read returns rows representing entities of various types of efficient filtering. The rows are designed to be small to pack as many into a single I/O operation as possible. The information needed per row includes:
Expunged status: Was the entity expunged from the service at the given USN?
Active status: Linked notebook synchronization hides inactive notes. If a note becomes inactive, we send an “expunge”, or skip the note if not including expunged GUIDs in the chunk.
Entity type and object ID: We need to know the type (tag, note, etc.) and object identifier that the row refers to so we query the service entity. An entity is either active, inactive, or expunged at any one time. We therefore combine the entity type (note, tag, etc.) with a “modifier” (active, inactive, expunged, etc.) into a single 1-byte field for compactness.
Containing Notebook ID: Most permissions in Evernote are granted at the Notebook level. For linked notebook and business synchronization, the notebook identifier allows us to filter most notes and resources without accessing those tables.
Content class: Applications such as Evernote Hello and Food define a contentClass on notes and can synchronize only those notes matching a common prefix. To save space, we use a 32-bit hash to approximately match the first 32 characters rather than storing the entire content class.
The above covers the “current” state of an entity. Some historical state is also needed. When synchronizing in a context allowing access to a proper subset of notebooks in the account, the service will send an “expunge” event when a note is moved from a notebook to which you have access into one to which you do not. We capture this state as “moved” records in the sync index, adding “moved” to our entry type modifier values mentioned above. We also add a column to record the previous notebook ID.
There is also new state that was not previously captured. Many of our clients currently synchronize a business account one linked notebook at a time. This allowed them to re-use existing, proven linked notebook logic to sync and detect lost access to a notebook but increases the round-trips to the service proportional to the number of notebooks that a user has joined in the business. The sync index now records historical records of “lost access” by a recipient to a notebook. The service can now send an “expunge” informing the client that access has been lost. We again use the entry type modifier to record “lost access” and add a field for the user ID of the recipient who lost access. Evernote clients will switch to synchronizing all of their notebooks from a business in one pass.
The resulting table is thus defined as shown below. Nullable fields use only a bit when the value is null.
CREATE TABLE IF NOT EXISTS sync_index ( user_id int UNSIGNED NOT NULL, update_sequence_number int UNSIGNED NOT NULL, entry_type tinyint NOT NULL, notebook_id int UNSIGNED, grave_notebook_id int UNSIGNED, object_id int UNSIGNED, guid binary(16), content_class_hash int UNSIGNED, recipient_id int UNSIGNED DEFAULT NULL, service_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL, PRIMARY KEY (user_id, update_sequence_number), KEY objectid_entrytype_idx (object_id, entry_type) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 ;
For the programmers reading this, one obvious question might have come to mind: how is the sync_index state, which is replicated from various service entity tables, going to be kept up to date with those tables? The answer was to use a Hibernate interceptor. All of the changes to our service entities go through Hibernate. The interceptor code that we wrote takes care of the denormalization and hashing. If you change a note attribute, the interceptor will check to see if it was contentClass and update the note and resource rows of the sync index table accordingly. If you move a note from one notebook to another, the interceptor verifies whether a new “moved” historical record should be created for the previous notebook, whether a prior historical record for the now-current notebook should be removed, and will update the notebook identifiers on all note and associated resource rows in the sync index to allow proper filtering on notebook ID for linked notebook and business synchronization. Changes to shared notebook records result in creating or removing lost access rows. All of this happens behind the scenes when our engineers work on the service API code. They don’t need to think about what to do … it happens automatically.
Filling the Sync
Once we finished the code to properly store new and updated entries in the sync_index, we had to find a way to load the prior six years of history into the sync index to permit new clients to synchronize old accounts. It was harder than we expected to do this without affecting users or producing any incorrect entries. After a few weeks of trial-and-error, we came up with the following recipe:
For each table (notes, notebooks, etc.), SELECT … INTO OUTFILE with the desired columns for the sync index.
Use LOAD DATA INFILE to load that into the sync_index.
Go back over every table to find inserted entries that were made obsolete by runtime changes between steps #1 and #2. Dump their primary keys to disk and then LOAD them into a (small) temp table.
Use a MySQL multi-table DELETE to join that temp table against the sync_index and only remove matches.
Steps #1 and #2 were at least three times faster than the more obvious INSERT … SELECT statement, and they avoided locking any rows, which prevented ugly timeouts in the running application. Steps #3 and #4 cleaned up after the non-transactional insert, also avoiding painful locking on the big source tables from a more straightforward DELETE.
Did it Work?
Verification of our work took multiple forms. A new unit test framework was developed and used to verify the correct values in the sync index table and results of synchronization. This battery of tests runs for all of our continuous integration builds. Our system test team performed repeated regression tests on staging platforms. In the code, a number of sanity checks were added, such as verifying the state of a sync index row against the service entity when we add it to the sync chunk. We also ran the previous and current sync algorithms side by side to verify the end results correctness. Lastly, we built infrastructure and service configurations to control the introduction of the new technology into production use.
We tested the times for full synchronization of one of our accounts with 2456 personal notes and 2861 notes across 31 linked Business notebooks. In our tests, this synchronization was around 99% faster with the new sync index than before.
The user-visible improvements are reflected in the median processing times for individual sync functions. For example, this graph shows the change in median response times for calls to getLinkedNotebookSyncChunk on one shard before and after the sync_index changes:
In addition to improving the sync times for our users, these changes also significantly reduced the IO and CPU usages of our shards. This graph shows the load average on one of our eight-core shards over the last week (blue) and a random week in November (green):
This will give our servers more head room to support future features and larger accounts as our users upload more data.
We have a few other optimizations in the pipeline that should improve the performance even further, but we’re pretty excited with the improvements so far. Thanks for your patience as we continue to scale Evernote to store all of your memories!