Operations

The Great Shard Migration, Part I

Posted by Dan Hart on 04 Feb 2015

Posted by Dan Hart on 04 Feb 2015

Experience, technological advancement, and data analysis have allowed us to iterate from the venerable generations of our early shard architecture to the sleek, modern v3.5 we run today.

The next challenge was to migrate all our previous generation shards to the new standard.

The Challenge:  Shard Version Concurrency

The Challenge:  Shard Version Concurrency

Before the shard migration project, we ran four distinct versions of the architecture.  Although all shards implemented the same service, how they did so varied considerably between architecture generations.  These blog posts provide more details on this.

Newer iterations of the architecture eliminated some of the pain points of previous generations.  However, these hardships still manifested on the earlier deployments.  As the hardware of earlier generations aged, they experienced more failures and encountered hardware bugs that later models fixed.

Multiple procedures had to be maintained to handle operations on different iterations on the architecture, due to critical differences such as how failovers should happen.  Our Operations engineers had to be well-versed in all of these different methods.  Documentation had to carefully note differences between generations, and runbooks had to detail procedures for each.  New tools and scripts had to account for all generations, requiring us to develop and maintain many guardrails to keep things safe.

To achieve Operational Nirvana and reduce the number of actionable pages our sacrificial on-call engineer received, it was clear we needed to standardize all the shards to our current design.

Reconnaissance

We first identified what would be required to achieve this desired end state.  The interruption of service to a given shard impacts the users housed on that shard in such a way that certain features and functionality may not be available.  Since we operate a world-class service millions of people depend on daily, we had some very stringent requirements:

  • Only impact a very limited number of users at a time
  • Keep any outage window to under an hour
  • Lose absolutely no customer data or transactions
  • Fully test and qualify the correctness of new shards after final data migration, but before they are brought live
  • Ensure confidentiality and protection of all customer information

These were the rules for an off-peak (Saturday morning PST) maintenance window.   Should we want to do weekday migrations, the requirements were even stricter.

The Key Capabilities

We identified five key capabilities we needed to have to achieve successful migrations:

  1. Local bulk file resources migrated to DAV
  2. Migration of MySQL user note data
  3. Migration of Lucene search data
  4. Network changes
  5. Host changes

We opted to deploy racks of new shard containers (paired VMs on redundant hypervisors) to be used as replication targets.  How each generation of the architecture replicated their data to these targets varied in implementation details, but every shard of a previous generation would be replicating to a brand new v3.5 home.

Shard Provisioning

Before we could migrate onto the new targets, those new targets had to be built.  To do this, we extended our Automatic Memory Machine to support allocating temporary shards, and the techniques described in our Elephant Factory post to bootstrap initial software deployment and configuration.  These existing automation tools allowed us to rapidly build the new hosts with very little user intervention.

At first, we required an operator to manually bootstrap MySQL and start replication.  However, this was quickly automated by one of our engineers, who wrote the MySQL suite of tools described below.

Res2Dav: Migrating the local resources

res2davdb Migration State Machine

Early generations of the architecture stored resources locally on each shard and its failover partner.  We have since improved on this approach and now centralize data on special-purposed resource file servers — CDAVs and RDAVs.  Before we could migrate the earlier shards to our current standard, these local resources had to be migrated.  Every local resource would be copied to two CDAV copies in the same datacenter, and one remote RDAV copy.

To tackle this challenge, we built a suite of tools called “res2dav.”  These tools allowed us to perform the migrations in the steps outlined below, without requiring any downtime to our users.

At Evernote, we like state machines.  The picture at the top of this section is the state machine we defined for this res2dav work.

We used Redis as a central database for coordination, which worked very well for us.  We found Redis to be easy to setup, and that it provided good visibility into key operational metrics (such as current and peak memory usage.)  We fed this data into Graphite dashboards we used to track progress.

Res2Dav High-Level Process

  1. Freeze all local resource folders
    This made the local resources on the shards effectively read-only.  New data would only be written to the CDAVs and RDAVs.  Only existing data would remain served from these local resources.
  1. Copy all resources to RDAV and CDAV
    With millions of files involved in this migration, traversing the file systems would take an incredibly long time.  To avoid this, we migrated the data using direct `dd`.  It wasn’t as simple as a straight copy, however; some post-processing was required to move subdirectories to correct locations.To ensure the local data on older generations was correct before promoting to the central DAVs, we looked at three different sources:  The “A” (active) side, the “B” (cold store) side, and the off-site replication (RDAV) copy.  We compared the md5sum of all three copies to ensure correctness before taking a verified version of the resource to a central DAV.We used mbuffer in conjunction with dd in order to rate limit the transfer, obtain inline checksums, and multiplex the receiving end (i.e., send to multiple systems concurrently.)
  1. Verify all data
    We verified the data via MD5 checksums, which mbuffer generated during the copy and stored in Redis.  These MD5 checksums were compared to the on-disk contents on both the original notestore shards and the central CDAV and RDAV targets.
  1. Update the Database metadata
    Once copies of user resource files were on the central file stores, we were able to update the database records for each user on these shards to point at the central locations instead of the local resource folders.  To do this work, we wrote a tool called `res2davdb` to perform the MySQL work on each shard.

At this point, the local resource files were no longer in use.  These volumes were ready to be wiped and destroyed.

Res2Dav Lessons Learned

We could have optimized this process better in a few ways.  We used a random distribution to select jobs, however this still created an unequal workload.  Next time, jobs could be chosen based on a better selection criteria.

We also didn’t build rack awareness into our tooling, which led to rack uplink saturation.  This issue, combined with the random job selection above, required us to manually allocate some of the work chunks to avoid contention.

MySQL: Automated Replication

Updating the database metadata was only a part of migration challenge with MySQL.  We also needed a way to seed the fresh databases on the -temp targets with a recent xtrabackup full backup.  This seeding was needed before real-time replication could start.

We wrote a tool named `Genesis` which, along with its companion `pushtempbackup.sh`, could automatically seed a newly-built MySQL -temp instance.  Before this tool was created, new -temp shards had to have MySQL bootstrapping done manually.

MySQL Automated Replication

Once seeded, the -temp instances had to be kept in sync with the master databases.  To do this, we developed a suite of tools named `esc-mysql` — Evernote Service Control – MySQL.  This tool allowed us to do several things in a safe, repeatable manner with lots of guard rails and Splunk logging built in:

  • Generate MySQL dumps using a single transaction
  • Transfer a MySQL dump to a new destination host
  • Destructively import a MySQL dump
  • Configure and start replication of a MySQL instance
  • Verify proper replication functionality of a MySQL instance
  • Update migration status of all database instances

These tools leveraged Fabric for orchestration.  Much like Res2Dav, this suite used a state engine to keep track of progress.

MySQL Migration States

Once these tools were built, they had to be qualified.  This involved many trial dry runs across a variety of test environments.  We defined tables of what we needed to test, what our expected results were, and what the actual results showed.  This data allowed us to close in on final our timing estimates.

Lucene Tooling

The Evernote service uses Lucene tables for search and metadata.  In all versions of our architecture, this metadata lives locally on each shard.  In later architectures, this data is stored on fast SSD disks for optimal access speed.  In order to migrate to a new shard, these lucene databases had to be copied over and kept up-to-date.

We built a tool named `migrate_lucene` to perform this work.   This tool can report on the lucene replication status of any migration task, setup sync rules for a new migration, start a sync process, and validate it.   The tool also includes a confidence function which plots statistical deviations on a ministat ASCII chart, to let us see at-a-glance if there are any concerns or red flags.

Lucene Migration Status

Automating the Network and Host Plumbing

Besides these data pieces, a lot of network plumbing and host glue had to be modified to facilitate the migration.  For example, hostnames of shards and hypervisors had to change away from -temp.  Load balancers had to be pointed at the new shards.  DNS had to be updated.

To this end we wrote automated tooling to interact with Puppet, DNS, Nagios, and many other systems.  Below is a brief list of some of the tasks the migration process involved:

  • Disable old nodes / Enable new nodes on the Load Balancer
  • DNS changes
  • Host network and name changes (/etc/hosts, /etc/HOSTNAME, etc.)
  • Revoke old Puppet certificates; re-issue new ones
  • Console / BMC changes
  • Nagios monitoring changes
  • Full verification of changes
  • Kick off backups on new shards
  • Full documentation updates

The final bullet item deserves special mention:  Updating our documentation was considered just as critical as making sure the hosts came up at all.  We listed updating each specific doc as its own step in the runbook, and did not conclude any migration activity until all documentation had been updated.

Testing

Manually converting shards and hypervisors without the tooling we created would likely have taken several hours each.  With our tooling, we were able to perform all migration tasks in under ten minutes.

Just performing the migration was not good enough for us, however.  We do not entrust the health and stability of our service to faith; we needed a full battery of tests and verifications to ensure correctness.

These validation activities represented the bulk of each migration.  Before moving past the Point of No Return — the point at which customer data would be written to the new shards — we had to ensure all our tasks had succeeded.

To accomplish this we kicked off a full failover of each shard after all our tasks were complete, but before any live traffic was sent.  This exercised MySQL, Lucene, Tomcat, DRBD, the VM configuration, and the Hypervisor itself, allowing us fully prove the resiliency of each system before placing any user data onto it.

Conclusion

Over 100 individual commands would have needed to been orchestrated across dozens of systems to perform a single shard migration.  Trying to perform such a feat by hand might well make King Sisyphus think rolling a boulder up a hill was a vacation by comparison.

Even with all this tooling, successful migrations would still require sophisticated coordination.  Some validations could be done parallel to other migration work.  Some migration tasks could be processed in tandem.  We were gearing up to operate on the service with scalpel precision while it continued to run as a whole.

But how long would each migration take, and how many engineers would be required?  What outage windows would we need?  Would we be able to complete the entire migration project within the limited time frame we were given?  And would we be able to get anything else done while tackling this?

In The Great Shard Migration, Part II, we will look at how the the migrations were tactically performed, some of the issues we ran up against, and dive into the tools and methods our Operations team used to make this migration a reality.

View more stories in 'Operations'

Comments are closed.