Evernote Tech Blog

The Care and Feeding of Elephants

Shard Boiled

In our architectural overview post last May, we gave a high-level description of the “shard” servers we use for both data storage and application logic. Since Evernote is a personal memory service rather than a social network, we can easily partition individual user data onto separate shards to allow for fairly straightforward linear scalability. Each pair of these “shard” servers runs two virtual machines:

Old Shard Architecture

Each of these shard virtual machine stores transactional “metadata” within a MySQL database operating on a RAID-1 pair of 300GB Cheetah 15krpm drives. A separate RAID-10 of 7200rpm Constellation 3TB drives is split into partitions for storing large files and per-user Lucene text search indices. The paired virtual machines replicate each of these partitions from the current “Primary” half to the current “Secondary” VM using synchronous DRBD.

These shards have enough storage and IO capacity to comfortably handle 100,000 registered Evernote users for at least 4 years, with empty drive bays in their 4U chassis to expand later as needed. Adding dual L5630 processors and 48GB of RAM brings the cost of each box up to around $10,000 with a power draw of around 374 watts each. I.e. around $0.10 of hardware and 3.7mW per registered user.

Room for Improvement

This generation of shard hardware has given us pretty good price/performance, with the extremely high level of data redundancy that we require for our users. But we did find several areas where this design wasn’t ideal for our purposes. For example:

  1. The 15krpm drives for the MySQL database are usually 95% idle since InnoDB does a good job with caching and IO serialization, but we hit occasional bottlenecks when users with huge accounts first access their data. If their metadata isn’t already in the RAM buffers, the random IO workload may grind the disks heavily.
  2. The Lucene search indices for our users generate a lot more IO than we expected. We see twice as many read/write operations on the Lucene partition as we do on the MySQL partition. This is largely caused by our usage patterns: every time a note is created or edited, we need to update the owner’s index and flush the changes to disk so that they will take effect immediately.
  3. DRBD is great for replicating one or two small partitions, but it’s a hassle for large numbers of big partitions per server. Each partition needs to be independently configured, managed, and monitored. Various problems may occasionally force a full resync of all volumes, which may take many hours even over a dedicated 1Gbps crossover cable.

These constraints were the main factor that limited the number of users we were willing to assign to each shard. Improving the manageability and metadata IO performance would allow us to increase our user density safely. We’re addressing these issues in our next generation of shards by moving metadata storage onto SSDs and moving the logic for redundant bulk file storage out of the OS and into our application.

Shard Target

Our new design replaces the racks of ten 4U servers with racks that mix fourteen 1U metadata+app shards with four 4U bulk file storage servers.

The 1U shard heads have a pair of simpler virtual machines that each use a single partition on a single RAID-5 of 300GB Intel 320 SSDs. Those two partitions are replicated with DRBD, and the VM image is only run on one server at a time. The SSD drives are overprovisioned to 80% capacity to significantly increase write endurance and IO throughput. We include a hot SSD spare with each box rather than using RAID-6 to avoid that level’s additional 15% performance loss since rebuild times will be short and DRBD replication gives us coverage for hypothetical multi-drive failures.

Bulk resource file storage has been moved from local disks in the primary servers into pools of dedicated WebDAV servers running large RAID-6 file systems. Whenever a new resource file is added to Evernote, our application synchronously writes a copy of that file to two different file servers in the same rack before the metadata transaction completes. Offsite redundancy is also handled in the app, which replicates each new file to an offsite WebDAV server through an asynchronous background thread.

Results

This new design has enough IO capacity and storage to handle 200,000 users per shard head for at least four years of use. The rack of 14 shards and 4 file servers cost around $135k and draw around 3900 watts, which works out to about $0.05 and 1.4mW per user.

This reduces the number of future servers and per-user power consumption by 60% on the primary servers. Factoring in power consumption from other service components (switches, routers, load balancers, image processing servers, etc.), we estimate an overall 50% reduction in per-user power draw compared to our previous architecture. All of this translates into long-term reductions in our hosting costs.

I’m hesitant to make claims that smell like corporate greenwashing. [Insert picture of Phil hugging a baby harp seal.] But this 50% reduction in per-user power consumption will also reduce Evernote’s carbon footprint by a proportionate amount.

In addition to these specific savings, the process of evaluating and testing solutions has given us a better understanding of the various components and technologies that we’re using. We’ll plan to do a few more posts with some of the details of our testing and optimization for SSD RAIDs, Xen vs. KVM IO throughput, DRBD management, etc. Since some of our results are a bit counter-intuitive, we hope to provide some useful information for other folks building storage-intensive services.

27 Comments

  1. Hi Dave, thanks for another great post.

    Could you please explain why/how you decided on webdev (over http of course) vs. a file-based protocol like NFS for access to the file system where note attachments are stored?

    Part B of the question is what webdav implementation are you running?

    Finally, I understand that a user’s data sits on at least multiple VMs and multiple disks, but I’m curious if ALL data is formally backed up to some other storage?

    Thanks!

    • Good questions, Matt.

      Since we’re only doing “whole file” operations, the performance of WebDAV is comparable to NFS, but requires less setup on each shard. We’re also doing offsite replication in the same code, and WebDAV has better failure modes for networking problems than NFS, which can require more tuning at the OS to prevent locking.

      We’re currently using Apache+mod_dav as a simple baseline, but we’re hitting an occasional hiccup in Apache (500 internal server error) that we’re in the process of debugging. I.e. I wouldn’t recommend mod_dav yet.

      Yes, all databases are backed up nightly to a second data center using Percona’s Xtrabackup. Resource files on the older systems were also backed up nightly (via tar -> NFS -> rsync). The new systems replicate to the backup IDC in near realtime via WebDAV. Once all systems are using WebDAV, we plan to increase the frequency of our database backups to reduce the theoretical data loss window in the event of a catastrophe at the primary IDC.

  2. Where’s my harp seal?

  3. Does this mean larger individual note sizes? And larger monthly limits?

    Where’s Phil’s Harp Lager?

    • Our limits are more based on networking and client behavior than raw disk storage capacity on the servers. I.e. we need to make sure that notes work properly on all clients that you might view them on and synchronize to the service, even for lower-powered mobile devices with limited memory, etc.

      • Thank you for this interesting article and for your answers.

      • mmm, so note size is governed by the class of lowest common denominator client that will access it…. I think my Blackberry Bold would commit suicide if I tried to download a 50mb note to it over 3g right now..

      • We’ve supported clients that can’t display certain things, but we want people to feel confident that anything that they put into “Evernote” will work well in “Evernote”, for reasonable definitions of the word “Evernote.” :-)

  4. I didn’t understand ANY of this. How long before the Evernote hoverboard?

  5. Very cool, Dave.

    They may not have existed in a production-ready state back when you guys started, but I’m curious whether you evaluated using a NoSql DB like CouchDB before settling on MySQL? They would seem to be a good fit for the type of usage that a platform like Evernote requires (not sure about the hoverboard support, though).

    • MySQL databases are serving our current runtime needs: http://blog.evernote.com/tech/2012/02/23/whysql/
      But we’d certainly consider using alternate solutions for future projects.

  6. Hey Dave! Thanks for the elaborate and informative blog post. I am indeed happy to know that my favourite software service also cares about its carbon footprints and mother nature in general :)
    I dont really understand all the hardware specifics you have used. Would u kindly give a simpler explanation, for us technically inclined folk(but not so much towards hardware)? You could post a flowchart or video detailing what happens when a user saves or edits notes, NOW as against EARLIER.

    Appreciate the effort! Thanks! :)

    • Ok, I don’t have a quick version of this, so will plan to post some more details in the future.
      But at a thumbnail level:

      OLD: when you upload a note, the “metadata” structure of that note goes into a MySQL DB on your shard, which is constantly replicated to a second box for redundancy. Any resources/images/files/PDFs/etc. in that note were written to local file systems on drives in that box, which were also replicated to the same second box.

      NEW: when you upload a note, the “metadata” structure still goes into MySQL, and that’s still replicated to a second box via DRBD, but the storage is now on SSDs instead of 15krpm disks. The resource files are pushed to separate pools of WebDAV file servers … two copies on local servers, and one copy on a DR server in a different city.

  7. Dave,
    Thanks so much for sharing details. I’m an infrastructure SE myself and I appreciate the additional details you provide. I’m familiar with ALL of the equipment, so it helps me not only to understand and appreciate the back-end operations of my favorite tool, but also appreciate the great efforts you take to optimize for both users and costs. While most don’t fully understand the details, I appreciate you posting them at the risk of an avalanche of questions, and not being shy about the way things work at Evernote.

  8. Great article, I really appreciate the look inside the Evernote “kitchen”. Nice use of Xen with DRDB and MySQL. So you don’t use any MySQL clustering features? I like the KISS idea because MySQL clustering makes it more complicated in large environments.

    Can you tell something about the amount of data stored in MySQL?

  9. Hi Dave,

    Thanks for a great post. My backfround is in infrastructure/sys administration.
    I have a couple of questions:

    1. Why host both file servers in the same rack, and what happens when power is lost
    during a synchronous write?

    2. I’m not clear on how the data is backed up. Is the data just replicated off-site, or
    is it replicated, and the replicated data is then backed up?

    Regards,

    Darren

  10. I am sure you are familiar with Xeround http://xeround.com.
    It seems like it might offer some benefits for Evernote, as it is a NOSQL that and is robust, self-healing and self-scaling and requires no code changes. Happy to make and intro to the CEO if you like.

    • Currently, we’re putting shards and their WebDAV servers in the same rack so that all of the IO goes through the top-of-rack switches in the same rack, and none of it goes through the core switches. This will prevent the file traffic from competing with service traffic on the core, especially if we ever need to rebuild a WebDAV server that fails completely. But we may change this in the future, or just migrate WebDAV servers out when they become “full.”

      Resource files are written twice locally and once offsite, each on RAID6 arrays. The offsite copy is the “backup” copy since we never delete from the offsite array. The MySQL databases are replicated locally and then backed up offsite nightly via Xtrabackup and rsync copying.

  11. Dave, thanks for the awesome post. I enjoy reading the post very much.

    I am wondering how do you determine to run the service in a virtual machine given you can control the hardware and the I/O cost of a VM may be high? Is it because a VM is easy to manage?

    Also, what file system is running on top of SSD? Do you use a log structured file system or an ordinary file system, e.g. ext4, xfs? Or mysql’s storage engine could make better use of SSD?

    Thanks.
    Regards,
    Peter

    • We started with pairs of physical boxes, but found the replication management to be more of a hassle, and the CPU utilization was more wasteful since 50% of the boxes were sitting completely idle at all times. Virtualized server images are easier to manage, since the entire “shard” can be replicated over DRBD and brought up on a second box.

      We did a fair amount of testing to find a combination of kernel and virtualization platform to minimize the IO overhead. We’ll try to put together a post with some of these details. Short version: kernel 3.2.10 on Xen provided significantly better local random r/w IO to the SSD RAID than earlier kernels and/or KVM. We also tested various file systems and decided to go with ext4.

  12. Hi Dave,

    Very insightful blogs into the world of Evernote and plug ins in general. I was just wondering if you have written a concise report on the functional description of Evernote?

    Thanks,

    Jamie.

  13. It’s been more than a year now. How do the SSD drives hold up? I heard stories about multiple SSD failures in a single array but wonder if it’s real or just a legend. Have you found a better alternative to apache + mod_dav for WebDAV yet?

    • The Intel 320′s have worked out better than I expected. I can only recall one failure across hundreds of drives so far. We were worried about endurance, but we’ve checked their endurance level with the low-level drive commands, and it looks like they’ll last for 5+ years at our read/write volume. (I.e. longer than we need.)

      Those drives are basically EoL at this point, so we’ve started using the newer E3500 models. They seem to perform slightly better, so nothing too exciting there. They have the capacitor to flush the RAM buffer to flash, which I believe is a hard requirement for serious usage.


Leave a Comment

* Required fields