Operations

Shards Over Easy

Posted by Dan Hart on 24 Oct 2014

Posted by Dan Hart on 24 Oct 2014

In previous Operations architectural posts, we discussed an overview of the Evernote service, as well as a deeper dive into the improvements we’ve made to our NoteStore “shard” server infrastructure.  These NoteStore shards provide the virtual home for our users, storing and processing their content metadata and search indices, and providing efficient, secure access to the Evernote service.

State of the Elephant

Our NoteStore shard architecture provides linear scalability that has served us well as the number of people who use Evernote increased from 10 million in 2011 to over 100 million today.  In 2011 we maintained ~90 shards; today this has grown to over 500.  As new hardware became available and our architectural design iterated and improved, we were able to increase the user density from a soft limit of 100,000 users per shard to just over 200,000.  This soft limit is designed to be conservative and ensure the best possible user experience, even during sustained peak traffic and unexpected surges.

Thanks to a well-planned and precisely-executed migration that took place earlier this year, all of our NoteStore shards are now running what we refer to as “version 3.5” of the architecture.  Prior to this migration, our Operations team supported several iterations of this infrastructure.  Dave’s “Shard Boiled” post essentially details what was version 3.0 of this architecture.  Version 3.5 is very similar, with a few tweaks and improvements.

A notable iteration is the switching from in-rack to centralized “DAV” fileservers.  In the 3.0 architecture, fourteen 1U metadata+app NoteStore shards were racked along with four 4U bulk DAV fileservers to provide file resource storage utilizing the WebDAV protocol.  In v3.0, the NoteStore shards would write two copies of this resource data to the “DAV” fileservers in the same rack, and one copy to an RDAV (Remote DAV) server located in a different datacenter.  For version 3.5, we have migrated away from in-rack DAV fileservers, and instead leverage a central pool of WebDAV systems instead, which we call “central” DAV, or CDAV.  For v3.5, we now place twenty 1U metadata+app NoteStore shards in a rack vs. fourteen for v3.0.

Current NoteStore Shard Infrastructure

v3.5 shard architecture - New Page (2)

Shards constitute the core of the Evernote service.  Each shard handles all data and traffic (web and API) for a set number of users.  Currently, we feel comfortable housing over 200,000 registered Evernote users per shard and maintain over 500 highly available pairs of shard hypervisors.

Each NoteStore shard runs as a VM on top of a Xen hypervisor.  Both hypervisor and application VM run the Debian Linux distribution.  We use DRBD to replicate the logical volume that constitutes the NoteStore shard VM to a partner hypervisor.  Each hypervisor has sufficient resources to comfortably run two NoteStore shard VMs at once.

We deploy solid-state drives (SSDs) in a RAID5 volume on these NoteStore shards to serve the MySQL database and Lucene indexes.  We maintain a spare drive for these volumes to ensure a higher level of reliability without sacrificing the performance of a second parity disk.  This configuration allows for a speedy experience as people search and work with their notes.  On the DAV servers, we run large spinning disk drives in a RAID6 configuration to maximize storage.

The work of mapping users to shards remains primarily the work of our load balancer layer, as described in the first architectural post.  Our core application stack also remains fairly similar:  Debian + Java 7 + Tomcat 7 + Hibernate + Ehcache + Stripes + GWT + MySQL.

The virtualization and replication architecture has been much optimized, however.  We are no longer dependent on Heartbeat or any other CRM; we now replicate the VM’s entire block device, and failover the entire logical system from one hypervisor to the other, instead of migrating cluster services individually.  This provides a simple, robust mechanism for invoking predictable shard failovers.

We do not operate any snowflake systems.  All shards are built by the Automatic Memory Machine, while hypervisor and shard configurations are managed by Puppet and backed by a CMDB inventory system.

In order for Work Chat to provide an ideal, low-latency experience, we are leveraging WebSockets at our Tomcat webapp and load balancer layers.

Beyond the Shard

Although a critical, core part of the service, NoteStore shards are only part of the complete Evernote experience.  Our Operations team manages several other farms of servers, ranging from file storage to mail processing.

Since we manage all aspects of our service, from the application containers and operating systems down to the physical components, cabling, and PDUs, we have a great deal of flexibility in designing, tuning and operating a robust, efficient service.

DAV Servers

These 4U WebDAV servers run Debian and provide multiple redundant storage of images, PDFs, office files, and other such bulk resources.  DAV functionality is provided by Apache and mod_dav.

Today we manage more than 100 primary CDAV servers, as well as over 30 RDAV servers to immediately replicate content off-site.  These RDAV systems are stored in a datacenter that is geographically separate from our primary datacenter.

AIR Processors

In order to allow you to search for words found within PDFs and images inside your notes, we maintain over 45 “Advanced Image and Recognition” (AIR) servers which dedicate their many cores to text recognition and indexing using software developed by our R&D team.  On an average day in 2014, this translates into over 6 million separate documents processed, vs. ~1.3 million in 2011.

In addition to this main farm, we also maintain a smaller pool of servers dedicated to business card scanning.  These systems run the same AIR software, however are tailored specifically to business cards.

UserStore

While the vast majority of all user data is stored within the NoteStore shards and CDAV bulk resource storage, a small amount of information is centralized in a single master “UserStore” database.  This database stores necessary routing details about each account, such as username, home shard ID, etc.

The UserStore is a pair of high-end servers running a MySQL database on top of DRBD replication.  We also maintain a “UserStore Replica” instance off of these servers in order to perform maintenance queries and other operational responsibilities without impacting the primary, live running database.

Couchbase, Hibernate, and Ehcache (Oh My!)

In order to prevent the 500+ shards from continuously talking to the UserStore, we leverage multiple layers of caching.  We use Hibernate in the NoteStore application to abstract database interactions, and utilize two separate Ehcaches local to each NoteStore shard.  In addition, we maintain a Couchbase farm of six caching servers that sit in front of UserStore.

MessageStore

The newest addition to our service, the MessageStore farm uses MySQL and InnoDB to provide the back-end message storage for Work Chat.

Other Services

Our Operations team also maintains several smaller farms of servers that perform miscellaneous tasks, such as administration functions and smaller-footprint services (e.g., NTP, Puppetmaster, incoming and outgoing mailgateways, centralized log storage and search, WordPress blogs, etc.)

Our architectural mandate is to keep everything in configuration management and prevent the existence of snowflake systems.  Even if we are only deploying a single instance, that deployment must conform to architectural standards and be predictable and repeatable.

Finding Operational Nirvana

Data safety and operational reliability are paramount to our decision processes.  Our entire Operations team participates in the on-call rotation, and we all aim to make the service as safe and easy to operate as possible.

Configuration changes are made via Puppet, staged in a lower environment, and fully canaried before being rolled out to the full environment.  We use Git for all of our version control needs, and are actively working to streamline and improve our SDLC even further.

We employ Ansible to orchestrate service releases, perform sysadmin tasks like modifying which JVM is being used or upgrading system packages, and running ad-hoc commands across the entire fleet.

Our CMDB system allows us to isolate specific, runtime-determined sets of hosts on which to run operations with Ansible.  With the CMDB, we are also able to abstract service variables and secrets outside of the source repository itself.

It is paramount that our Operations team have as much situational awareness about the service as possible, and have access to sufficient tools to handle whatever might come up, whether it be a DDoS attack, fallout from increasing SSL key size, or an unexpected invasion from R’lyeh.

To detect and respond to any such eldritch thing, we leverage Nagios + PagerDuty as our alerting engine.  To give us greater insight into our environment than stock Nagios provides, we use NagUI as our primary monitoring interface.  Tools such as Splunk feed alerts into PagerDuty, and many of our infrastructure processes implement passive Nagios checks to ensure correctness.  We also maintain an elephant’s army of Graphite dashboards, capacity reports, and other tools to gain as much visibility into the service as we can.

Conclusion

The Evernote infrastructure has evolved and matured significantly since our first few architectural posts, however we remain as committed as ever to providing the most robust, efficient, and above all reliable service that we possibly can.

View more stories in 'Operations'

2 Comments RSS

  • Swaroop

    Great walkthrough! One question – I read this part twice but I’m still not clear on _why_ the shift from in-rack DAV in “3.0” to centralized DAV servers to “3.5”?

    • Dan Hart

      The in-rack DAV systems let us take a step forward while preserving the “everything in a single rack” design we had from previous architectures, however we knew that this would impose a limit on capacity expansion. The v3.5 centralized approach means an additional out-of-rack dependency, but provides significant flexibility, more equal data distribution, and allows an extra 1.7M users in each rack.