Operations

The Great Shard Migration, Part II

Posted by Dan Hart on 08 Oct 2015

Posted by Dan Hart on 08 Oct 2015

In The Great Shard Migration, Part 1, we discussed the technical tooling, diligent planning, and stringent requirements needed to operate on the Evernote service with scalpel precision.  This post will cover the actual execution of these migrations.

As detailed in the “Reconnaissance” section of Part 1, data integrity and confidence was our paramount concern.  We needed our outage windows to stay under an hour — an hour to finalize the migration of user data onto new hardware running the newest iteration of the service.  This had to be choreographed perfectly.

There was no room for mistakes.

Tactical Tooling

In order to achieve a well-coordinated migration we leveraged several tools.  While the actual work was automated, we wanted human gatekeepers involved in validation and orchestration.  Timing was crucial.

Communication Tools

We used Fuze video conferencing with open mics so that all members of the working team could communicate as if we were in the same room, regardless of our geographic location.  We also used a maintenance channel in our chat system  to mark step completions, communicate status to external teams, and note important things that came up on the audio bridge.  This chat log served as an important transaction history to drive iterative improvements to successive migrations.

Terminal Tools

We wanted human operators to keep a close, real-time eye on automation execution, and we needed a quick way to spot anomalies.  Reviewing log output after execution was complete wasn’t ideal for two reasons:  First, this isn’t time efficient; we could be reviewing steps already completed while longer-running tasks were still in progress.  Second, it is difficult to note differences in execution runs from log files.  It is easier to use a visual tool like diff, or just having multiple outputs side-by-side.

To this end we leveraged iTerm2’s split terminal functionality.  A very useful (and incredibly dangerous,) option for this functionality is to send keyboard input to all split panes simultaneously.  

iTerm2 split screen example

This allowed us to simultaneously kick off multiple copies of the same scripts, sending all output to standard out for immediate review.

We created wrapper scripts and placed these into separate subfolders for each set of servers to be migrated.  The wrapper scripts used identical naming, but specified different arguments to the automation scripts they called.  Execution looked similar to this simplified example:

Example Execution

 

Coordination Tools

Most of us tend to keep our working notes and planning inside of Evernote itself.  As this migration predated the awesome collaboration streamlining made available by Work Chat, we used a hosted Confluence wiki for initial group discussions and collaborations.

Once we started detailing out all the individual steps, sections, and needed to track assignments and other attributes, we started using Google Sheets for its ability to structure data and auto-number rows.  Each row represented a task to be completed by a person.  Leaning heavily on Google Sheet formulas and templates to avoid manual data entry, we generated one of these spreadsheets to track each set of migrations.

Spreadsheet excerpt

The First Trials

It was time to get our feet wet.  Before trying to coordinate an entire team to enter well-defined, but only lab-tested waters, we thought it best to have a single person complete the first trial migrations on a small subset of canary nodes.  Taking these lessons learned, we could improve and streamline our process.

On a mid-December day, late in the afternoon, our test pilot attempted to migrate a first set of two shards.  These shards had been carefully picked to have the least potential impact to the least number of users.  Just over two hours later, we had good news and bad news.

The good news:  Our process had worked without a hitch.  The shards had been successfully migrated, without a single transaction or byte of customer data lost.

The bad news:  Unfortunately, we had overrun our allotted outage window, extending it to be over twice as long.

We re-grouped and discussed our next steps.  The plan seemed solid.  It just wasn’t fast enough.

True, our lone engineer had too much work to do, and too much data to review.  But this inefficiency only accounted for a portion of the delay.  The bigger problem was more substantial:  Some of the automation tasks were taking entirely too long.

We iterated on the scripts to do things in more efficient ways, and eliminated some steps that had proved unnecessary.  However, there were many safeguards and validations we wanted to keep, despite the lengthy time they required.  Safeguarding our customers’ data integrity was paramount, overriding any other concern.

Besides tool optimization (and the earlier work of automating everything,) the bulk of our timing improvements came from careful analysis and identification of which tasks could be completed in parallel, and where we needed to insert checkpoints prior to entering critical points.

Critical Section FlowchartOur test pilot set out another late December afternoon for round two.  The improvements were working.  Our times were decreasing.  Even with our test pilot running a one-man show, we’d been able to trim our outage to be just under our single hour goal.

The Behemoth

One of our oldest shards houses some our largest, most tenured, and most active accounts.  As our v1 refresh hardware continued to age (and because it used the older, less efficient shard architecture,) this shard was showing the most growing pains.  Performance was becoming a significant concern.

So on an early Saturday morning, our test pilot (having the most experience), set out to migrate this shard.  The rest of the migration team stood virtually on the sidelines watching, testing, and validating.  One of us was designated Flight Coordinator, communicating on behalf of the test pilot and writing notes.

Thirteen minutes into the outage we hit a critical issue.  A bug with a specific version of NIC firmware that was in the hardware had been worked around in our earlier days via an update to /etc/network/interfaces.  This workaround pre-dated our configuration management and version control, and had not been persisted.

Due to this oversight, we lost seven minutes of our outage window.  We were concerned we’d miss our sixty-minute commitment.

Twenty-two minutes later the behemoth had been migrated.  Syncs were running blazing fast.  The entire migration window, including the seven minutes dealing with the NIC bug, had been completed in forty-two minutes by a single engineer.

We took a step back to put our accomplishment into perspective.  We had now migrated the oldest shard (v1 architecture) and the youngest shard of v2 architecture.  But we still had almost twenty-three million more users to migrate, while we had only successfully completed about 5.57% of that.

We’d proven our timetable against the oldest, most onerous shard with some of the biggest, most active users.  We’d proven we could successfully migrate any prior iteration of the hardware to the newest version without compromising data integrity.   We’d proven our safeguards and backups were valid and working.

It was time to roll.

Enter the Ninjas

Knowing these shard migrations would consume the vast majority of time for everyone on the migration team, we split our systems operations group in two.  Six of us would have a razor focus on the migrations.  The rest would provide air cover, taking care of day-to-day tasks, and continue pushing project work forward. 

We scheduled a two-hour Shard Migration Workshop to review our test pilot’s first runs and make sure we were all intimate with the details.  We debated potential improvements to the runbook, and continued to iterate on the process.

Out of this meeting we decided to tackle the next set of six shards in pairs of two.  The following two Saturday outage windows had to be canceled, however, due to unrelated vendor issues we were worried could conflict.  Finally, in the early hours of Saturday, March 1st, our group checked in on Fuze and prepared to get started.

We worked as a team of five, one of us acting as Flight Coordinator, and the others completing the work.  The first two sets of shards were completed without a hitch, taking just under forty minutes each.  We ran into a Lucene issue on the final set of shards, however, and had to roll that migration back.  Out of this Saturday maintenance window, after two cancellations, we had only been able to migrate two thirds of our target.

We met on Monday to discuss our next steps.  The current strategy of using only weekend outage windows would cause the migration project to be dragged out a very long time.  We were risking a quagmire.

We took our data and timings to business leadership and discussed our goals and challenges.  As a result of our careful planning, documentation, and ability to achieve a minimized outage window, we were granted approval to perform most of the remaining migrations during normal business hours, late in the afternoon.  However, we were also given a strict timeframe by when the project had to be completed.  It could not drag on and disrupt upcoming goals.

In order to meet our timeframe, we had to increase our rate of migration from two shards a day to at least six.  Even if we were able to squeeze in two sets a day over two separate outage windows, we would still have to be faster.

We decided to test breaking into two teams of three engineers each, working on two sets of shards simultaneously, rather than one team working on them all.   At 2pm one early April day, we plodded out to double our concurrency.

Double Trouble

There was a problem with this plan:  We only had one Fuze chat room.  Cross-talk immediately became a problem between the two teams.  It was like trying to follow two conversations at once.

Each team had their own flight controller, who were intended to coordinate important milestones to be synchronized, while the other engineers completed tasks.  We learned some valuable takeaways from this exercise:

  1. The more people who are simultaneously talking, the louder each individual will talk, and the less they will listen.
  2. Video Conferencing software has a difficult time dealing with everyone talking on open mic at once; it’s not quite the same as everyone talking at once in the same room.
  3. Teams cannot effectively ignore noise from other teams
  4. Teams cannot selectively focus on only listening to their own flight coordinator, and ignore the other team’s.

We quickly realized this idea was untenable, and moving forward would just march us deeper into the swamp.  Instead, we quickly regrouped into a team entirely physically located in Redwood City, who adjourned to a private conference room.  The second team was given all remote members and retained the Fuze conference bridge.  The flight coordinator for the RWC-local team did double duty, listening in on the Fuze meeting, relaying important events, and maintaining tight control over the critical sections.

Despite our quick recovery, this migration almost ran over our allotted window.  None of us were very happy with our maintenance, despite having achieved a successful result.  We were frazzled; we needed a better path.

And our deadline kept looming closer.

Scaling Up the Surgery

We knew we needed more parallelism, but breaking apart into multiple teams wasn’t the answer.  Even if we had sufficient separation of teams, we had too many critical points that would have to be coordinated.  Breaking apart the network and other shared pieces to remove some of these critical points would cause the tasks to take too long and exceed our outage windows.

We had to scale, and if we couldn’t do that by splitting into smaller teams, it meant that each of us would have to perform our tasks on more than one set of systems at a time.

Instead of migrating one set of servers at a time, we would migrate two sets, simultaneously performing each step against each set of servers at the same time.  We needed to achieve this while maintaining the requirement of an engineer monitoring the automation process in real time.  To this end, we greatly relied upon iTerm2’s split terminals.

After one more weekend window to make up for lost time, we were feeling pretty confident about our migrations.  We had iterated our tools and processes even further.  We all had plenty of practice now.  The process that had at first taken almost two hours for each set of shards could now be completed in about thirty-five minutes.

Instead of doing one shard set at a time, we were now doing two.  Remarkably, we were able to do this without increasing the overall outage window.  We were still holding pretty solid at around 35 minutes.

Despite this success, however, we were still behind schedule.  Migrating two sets at a time was just not sufficient.

Exponential Power

If we could do two sets of migrations in the same time it took us to do one, then why not four sets?  We were nervous, wondering if we were attempting to juggle more than we should.  But we were practiced, had well-documented, well-defined procedures that were nearly entirely automated and constantly iterated upon, and we had a strong desire to get to the promised land of shard architecture consolidation.  (Not to mention a deadline.)

So on May 1st, 2014, we attempted our first batch of eight servers; four sets.  Our outage window was still thirty-five minutes.

By this point, the most taxing portion of the entire process had become updating the documentation.  As of May 6th, even this step became automated.

We iterated our process every week, and were becoming increasingly practiced in our roles.  We were now on target to meet our deadlines, but we started thinking:  If we could do eight a day without breaking a sweat, why not sixteen?

So we did.  On May 8th we migrated sixteen servers in two batches of eight each.  Due to our constant iterative improvements, the outage window had actually lowered to just over thirty minutes.

We were now well ahead of schedule.

The Final Sprint

We migrated the next 48 shards over the following week, completing sixteen a day.

On May 15th we decided to ramp this up even further, targeting twenty-four shards in a single day.  This was our largest single batch.  Each of our operation engineers had monitors filled with split-screens scrolling the output of each automation task for real-time review.

On May 16th, 2014, we migrated the final twenty-two shards.  When we finished, we felt a bit odd, like something was missing.  And then the realization struck:  The migration was complete!  There were no further shards to migrate.

The promised land of shard version concurrency had been realized.

Conclusion

We performed our first test migrations on December 13th, 2013.  Over the next five months we completed the remaining 197 migrations.  Twenty-two racks of machines were consolidated into just ten.  We swapped 214 4U boxes for 198 shiny 1Us.

Our tooling and process had enabled us to migrate over twenty-two million users to new, better-performing homes, end generations of variations, and create a truly consistent environment.

It was time to start planning the v4 Shard architecture.

View more stories in 'Operations'

Comments are closed.