In our architectural overview post last May, we gave a high-level description of the “shard” servers we use for both data storage and application logic. Since Evernote is a personal memory service rather than a social network, we can easily partition individual user data onto separate shards to allow for fairly straightforward linear scalability. Each pair of these “shard” servers runs two virtual machines:
Each of these shard virtual machine stores transactional “metadata” within a MySQL database operating on a RAID-1 pair of 300GB Cheetah 15krpm drives. A separate RAID-10 of 7200rpm Constellation 3TB drives is split into partitions for storing large files and per-user Lucene text search indices. The paired virtual machines replicate each of these partitions from the current “Primary” half to the current “Secondary” VM using synchronous DRBD.
These shards have enough storage and IO capacity to comfortably handle 100,000 registered Evernote users for at least 4 years, with empty drive bays in their 4U chassis to expand later as needed. Adding dual L5630 processors and 48GB of RAM brings the cost of each box up to around $10,000 with a power draw of around 374 watts each. I.e. around $0.10 of hardware and 3.7mW per registered user.
Room for Improvement
This generation of shard hardware has given us pretty good price/performance, with the extremely high level of data redundancy that we require for our users. But we did find several areas where this design wasn’t ideal for our purposes. For example:
- The 15krpm drives for the MySQL database are usually 95% idle since InnoDB does a good job with caching and IO serialization, but we hit occasional bottlenecks when users with huge accounts first access their data. If their metadata isn’t already in the RAM buffers, the random IO workload may grind the disks heavily.
- The Lucene search indices for our users generate a lot more IO than we expected. We see twice as many read/write operations on the Lucene partition as we do on the MySQL partition. This is largely caused by our usage patterns: every time a note is created or edited, we need to update the owner’s index and flush the changes to disk so that they will take effect immediately.
- DRBD is great for replicating one or two small partitions, but it’s a hassle for large numbers of big partitions per server. Each partition needs to be independently configured, managed, and monitored. Various problems may occasionally force a full resync of all volumes, which may take many hours even over a dedicated 1Gbps crossover cable.
These constraints were the main factor that limited the number of users we were willing to assign to each shard. Improving the manageability and metadata IO performance would allow us to increase our user density safely. We’re addressing these issues in our next generation of shards by moving metadata storage onto SSDs and moving the logic for redundant bulk file storage out of the OS and into our application.
Our new design replaces the racks of ten 4U servers with racks that mix fourteen 1U metadata+app shards with four 4U bulk file storage servers.
The 1U shard heads have a pair of simpler virtual machines that each use a single partition on a single RAID-5 of 300GB Intel 320 SSDs. Those two partitions are replicated with DRBD, and the VM image is only run on one server at a time. The SSD drives are overprovisioned to 80% capacity to significantly increase write endurance and IO throughput. We include a hot SSD spare with each box rather than using RAID-6 to avoid that level’s additional 15% performance loss since rebuild times will be short and DRBD replication gives us coverage for hypothetical multi-drive failures.
Bulk resource file storage has been moved from local disks in the primary servers into pools of dedicated WebDAV servers running large RAID-6 file systems. Whenever a new resource file is added to Evernote, our application synchronously writes a copy of that file to two different file servers in the same rack before the metadata transaction completes. Offsite redundancy is also handled in the app, which replicates each new file to an offsite WebDAV server through an asynchronous background thread.
This new design has enough IO capacity and storage to handle 200,000 users per shard head for at least four years of use. The rack of 14 shards and 4 file servers cost around $135k and draw around 3900 watts, which works out to about $0.05 and 1.4mW per user.
This reduces the number of future servers and per-user power consumption by 60% on the primary servers. Factoring in power consumption from other service components (switches, routers, load balancers, image processing servers, etc.), we estimate an overall 50% reduction in per-user power draw compared to our previous architecture. All of this translates into long-term reductions in our hosting costs.
I’m hesitant to make claims that smell like corporate greenwashing. [Insert picture of Phil hugging a baby harp seal.] But this 50% reduction in per-user power consumption will also reduce Evernote’s carbon footprint by a proportionate amount.
In addition to these specific savings, the process of evaluating and testing solutions has given us a better understanding of the various components and technologies that we’re using. We’ll plan to do a few more posts with some of the details of our testing and optimization for SSD RAIDs, Xen vs. KVM IO throughput, DRBD management, etc. Since some of our results are a bit counter-intuitive, we hope to provide some useful information for other folks building storage-intensive services.