Evernote Tech Blog

The Care and Feeding of Elephants

Outage Details

Tonight, our planned 15-20 minute service window turned into a full two hour outage. We regret the problems that this may have caused for some of you and wanted to provide a little more technical detail here.

As we mentioned in our Architectural Overview post last year, the hundreds of terabytes of data within Evernote user accounts are spread over hundreds of “shards,” but we have one redundant pair of servers that run our master “UserStore” database. This only has a small amount of information for each registered Evernote account in order to handle authentication and commerce … usernames, passwords (salted+hashed), etc. This database has less than 1kB per account, but since Evernote has over 27 million accounts, the overall size is around 25GB.

This same pair of servers has been running our UserStore database for more than three years, and it’s time to upgrade them. So we set aside two of our new “shard”┬áservers with 3x as much RAM, SSDs instead of 15krpm disks, bonded networking, updated kernel, etc. The hardware and OS configuration on the new systems is virtually identical to the last 38 shards we’ve deployed, and our dry-run tests with the new hardware showed good performance with copies of the UserStore database.

So we planned a scheduled service window tonight to bring the whole service down, copy the current UserStore database to the new hardware, bring it online, and bring the servers back up. In our dry runs, the copy took 8 minutes and the sanity tests and server juggling took another 10-15 minutes, so we aimed for a 20 minute outage.

When we stopped the service tonight and copied the database, we were able to transfer the data as planned, but then experienced an unexpected “kernel panic” from the new primary server before we could bring it live. The error appeared to be triggered from the TCP stack on the system on the crossover DRBD connection between the pair of servers. We haven’t seen this on any of the (nearly identical) NoteStore shard boxes, and didn’t have a good explanation.

After a quick review, we decided it wasn’t safe to continue with the upgrade on this new hardware until we could determine the source of this problem. So we needed to fall back on the old original pair of boxes.

Unfortunately, the old boxes needed a reboot to clear a few lingering issues and also needed to allocate more storage space to the database in order to be safely used without interruption for a few more weeks. We’d deferred this maintenance on the assumption that the upgrade would be successful, but we needed to resolve these issues after the upgrade failed.

After the reboots and file system adjustments were complete, we brought the old UserStore servers back online. We then brought up the rest of the servers gradually, in order to spread the load on the old hardware (with no DB pages in RAM cache…).

We apologize again for the inconvenience this may have caused you during the outage, and we’ll be reviewing why our dry-run testing procedures didn’t catch this issue and ensuring that our next upgrade goes smoothly with a less disruptive fallback plan.

  1. Wow, that must have been super-stressful for you, Dave. Good job on getting the system back online.

  2. Thanks for the clarification and the insights. I think it now pays off that you have written the architectural overview to let us laymen understand these issues.

    I’m not an expert in system architecture, but I have one question out of curiosity: You said you have a redundant pair of servers for the UserStore DB. Do you need to bring them both down at the same time? What are the technical difficulties to shutdown one of them while still serving requests from the other one?

    In any case, I do appreciate your continuous status updates on status.evernote.com and on this tech blog. Keep up the great work.

    • We can absolutely run the service with only one box from a pair. So if the second UserStore server is down, the service will run fine. But during that period, any new data that’s written will only be saved on the single box until the secondary comes back online to catch up. So we get a little nervous when we’re running in that configuration and work nonstop to get back to full redundancy.

      Note that we have a RAID of multiple disks on each box as well, so the failure of a single hard drive won’t cause data loss even if we’re only running with a single box. But we sleep better at night knowing that there are 4 copies of all of your latest data instead of only 2 copies.

  3. As an Evernote Newby as of Wed. 7/11/12 who is still in a “learning curve” I am relieved that the tasks I tried to accomplish unsuccessfully last night (7/11) weren’t Newby issues. I was fortunate to find a blog last night that advised newbies would not be able to do a lot of Evernote functions. It would’ve been nice to have been informed of the dated ” down time” upon installing the App so that I would not have spent 30-45 minutes trying to figure out what I was doing wrong when trying to transfer my workday’s Skitch projects to Evernote & my PC. Just saying… A “heads up” in advance for newly registered users would’ve been appreciated.

    • We post about planned outages on http://status.evernote.com/, and have a widget on our Support page to display the recent messages from that feed.
      Unfortunately, it’s a bit difficult to notify everyone in a more direct way, since sending out 27.5 million emails takes a few days and generates complaints of “spam.” But we’ll keep thinking of other ways we can keep people in the loop.

      One advantage of Evernote’s system is that many of our applications will continue to function just fine while Evernote’s servers are down. The Mac and Windows client have a full copy of your data and you can keep working while the servers are unavailable (including when you’re on an airplane, etc.). This doesn’t avoid all disruptions, but it helps a bit compared to web-only software.

      Thanks

  4. This is one of the reasons I LOVE Evernote, the transparency and customer focus. A lot of companies would have said ‘we had an unscheduled outage’ and that’s all.

    Thanks for the frank and open explanation.

  5. As a premium member of Evernote in Japan, I must say that the outage lasting over three hours in core Japanese business time yesterday was unacceptable incident. In view of the number of paying clients in Japan, you should have avoided the business time zone for maintenance activities.

    I demand that you will shift the regular maintenance time so that many of the angry Japanese Evernote clients will not suffer service outage again.

    Representing very angry Japanese paying cliets…

  6. Thanks for your transparency. Yeah Mr Japan, as I am in a similar TZ I appreciate your anger. But in a 24 hour global world someone somewhere will suffer.


Leave a Comment

* Required fields