Tonight, our planned 15-20 minute service window turned into a full two hour outage. We regret the problems that this may have caused for some of you and wanted to provide a little more technical detail here.
As we mentioned in our Architectural Overview post last year, the hundreds of terabytes of data within Evernote user accounts are spread over hundreds of “shards,” but we have one redundant pair of servers that run our master “UserStore” database. This only has a small amount of information for each registered Evernote account in order to handle authentication and commerce … usernames, passwords (salted+hashed), etc. This database has less than 1kB per account, but since Evernote has over 27 million accounts, the overall size is around 25GB.
This same pair of servers has been running our UserStore database for more than three years, and it’s time to upgrade them. So we set aside two of our new “shard” servers with 3x as much RAM, SSDs instead of 15krpm disks, bonded networking, updated kernel, etc. The hardware and OS configuration on the new systems is virtually identical to the last 38 shards we’ve deployed, and our dry-run tests with the new hardware showed good performance with copies of the UserStore database.
So we planned a scheduled service window tonight to bring the whole service down, copy the current UserStore database to the new hardware, bring it online, and bring the servers back up. In our dry runs, the copy took 8 minutes and the sanity tests and server juggling took another 10-15 minutes, so we aimed for a 20 minute outage.
When we stopped the service tonight and copied the database, we were able to transfer the data as planned, but then experienced an unexpected “kernel panic” from the new primary server before we could bring it live. The error appeared to be triggered from the TCP stack on the system on the crossover DRBD connection between the pair of servers. We haven’t seen this on any of the (nearly identical) NoteStore shard boxes, and didn’t have a good explanation.
After a quick review, we decided it wasn’t safe to continue with the upgrade on this new hardware until we could determine the source of this problem. So we needed to fall back on the old original pair of boxes.
Unfortunately, the old boxes needed a reboot to clear a few lingering issues and also needed to allocate more storage space to the database in order to be safely used without interruption for a few more weeks. We’d deferred this maintenance on the assumption that the upgrade would be successful, but we needed to resolve these issues after the upgrade failed.
After the reboots and file system adjustments were complete, we brought the old UserStore servers back online. We then brought up the rest of the servers gradually, in order to spread the load on the old hardware (with no DB pages in RAM cache…).
We apologize again for the inconvenience this may have caused you during the outage, and we’ll be reviewing why our dry-run testing procedures didn’t catch this issue and ensuring that our next upgrade goes smoothly with a less disruptive fallback plan.