Evernote Tech Blog

The Care and Feeding of Elephants

Stages of Denial

At 2:33pm on Tuesday afternoon (Pacific Time), an attacker began a Distributed Denial of Service (DDoS) against Evernote’s servers. At normal times, Evernote receives around 0.4 Gbps of incoming traffic and transmits out around 1.2 Gbps. We have a diverse set of pipes to the Internet that can handle several times that volume. During this attack, we experienced over 35 Gbps of incoming traffic from a network of thousands of hosts/bots. This quantity of bogus traffic exceeded the aggregate capacity of our network links, which crowded out most legitimate users.

By 3:25pm, our Operations team was able to diagnose the problem and enable a DDoS mitigation service that we had previously established with CenturyLink, one of our network providers. This required moving traffic away from our other feeds to CenturyLink via BGP and then enabling mitigation for our main IP address. Their filters were able to remove virtually all of the bogus packets and permit normal user traffic to flow again.

A few minutes later, the attackers shifted the nature of the attack to send different types of network payloads and to target other addresses that happened to share common infrastructure. This resulted in a couple hours of back-and-forth between our network engineers and CenturyLink to adapt to each attack while minimizing the impact on legitimate users (and our own incident response).

As our network links recovered, we received an unusually high volume of pent-up sync activity. This traffic was about 80% higher than we would have seen during a comparable time on another day. The extended outage had caused many of our servers to expire various cached records, so this initial stampede of client traffic led to a temporary spike in query volume on our central accounts database that was unsustainable.

This required us to perform a rolling service restart to allow individual shards time to handle their users’ pent up synchronizations and repopulate their caches without overloading our accounts database. (This problem was not directly related to the DDoS, it was just an unanticipated side effect of having an hour-long outage followed by an immediate resumption of full service without tuning our database and Couchbase cluster for that scenario.)

The ongoing network attacks and service after-effects persisted for nearly three more hours until the last components were restored to full functionality at 6:14pm.

CenturyLink’s DDoS mitigation service was able to scrub out invalid traffic to restore access, but it took a while for us to enable and configure the solution. This process was a bit haphazard because we had not yet completed our deployment configuration and testing before the incident began. Our networking team had only recently contracted for this service, and they were carefully working through a deployment plan to ensure smooth operation during a future incident. All of the procedures and runbooks were still being drafted, so we hadn’t yet determined exactly which rules would need to be applied to block an attack while permitting all legitimate traffic.

Our final days of testing and configuration were compressed down to a few hours, so the initial DDoS mitigation heuristics were not tuned for our particular application characteristics. This was successful at scrubbing out virtually all of the bogus traffic, but led to a moderate level of “false positives,” which blocked some legitimate users (and partner services like Livescribe) from connecting to Evernote.

Over the following day, we saw another wave of network attacks, which were fully mitigated. Our network engineers worked with CenturyLink to incrementally refine our filtering heuristics to reduce the number of legitimate users that were blocked. As of 4pm Wednesday, we felt that we had addressed virtually all of the incorrect blockages to restore service to the remainder of our customers.

Post-Mortem

Overall, our Operations crew handled their DDoS trial-by-fire extremely well, but we have work ahead to minimize the disruption to our users in future incidents.

The network engineers get to complete the DDoS procedures, configurations, runbooks, automation, etc. so that they can trigger the full set of mitigations in minutes rather than hours. The systems group has a set of improvements planned to make the service handle “recovery stampedes” after extended outages more gracefully. And our client teams have a couple of tickets to reduce those stampedes in the first place.

Ultimately, we know that every minute of outage for the Evernote service may prevent important tasks for thousands of our users, so we will make every effort to reduce or eliminate the impact of such attacks in the future.

  1. Thanks so much for the work you guys did to mitigate this, but also for your clear communication throughout the situation. It was great to know what was happening through various sources (I was relying on your Twitter account and your blog), and I appreciated that you guys were transparent and up-front about what was going on.

    One thing that might be another great follow-up would be a post for folks (which could then be linked to if/when there are future similar situations) that includes best practices for how to manage our data while the sync functionality isn’t working — how best to perform some stopgap operations on our personal side so we have the data we need where we need it, but also so it ends up all synced and happy after the fact. I had my own ad hoc way of handling things (cutting and pasting notes I would have otherwise just emailed, for example, so I knew they were at least in one of my Desktop apps and would sync later), but I’d love to know what you guys recommend, specifically, for how we handle the data transfer stoppage on a temporary basis.

    Thanks!

    • Thanks, Genie. Part of our job is making it work so that you don’t have to do manual steps when we have an outage. Our mail gateway is still catching up on the emails sent into accounts, but all of those should be finished today, so there won’t be a need to resend. Our desktop applications should do a good job at handling an extended outage … notes you create should just get to the server when the coverage is restored.

      There’s a few cases that aren’t handled automatically, however. If you rely on the web interface for Evernote, then that doesn’t have any sort of “offline mode.” Our mobile clients may not sync all note contents by default, so if you happen to try to view a note on your iPhone that you haven’t inspected in a while, that may not be available if the service is down unless you’ve set up “Offline Notebooks” on the phone.

  2. Excellent rundown of the entire ordeal. The intensity as well as the frequency of these attacks has been growing and growing of late. While extremely problematic and troublesome, it must have been very challenging and satisfying to be able to work towards finding ways to mitigate the attack. Kudos!

  3. I like to email notes and comments to Evernote from my phone (Windows phone, using gmail.) That way, the email also goes to my computer because I get my gmail on Outlook, so it is in the computer, not just on line. Therefore, I have all the emails that I sent for the last 24 or so hours that never showed up on my Evernote. I didn’t realize that there was a problem for quite awhile because I find that frequently (once or twice/week) the emails go astray or don’t show up. Because I have those emails, I can see that now and then, I make a mistake when I sent the email, but most often, I don’t see any errors in either the email address or the subject like. Sometimes they never show up. Sometimes much later, hours or the next day, the emails is on Evernote in the conflicting changes folder. Bottom line is that I have copies of what is lost.

  4. Thanks for the excellent rundown. But it does leave unanswered a key question. What motivated this attack? Is it random net vandalism or are there other reasons for attacking Evernote?

  5. Yes, thanks for the speedy work and the transparency. I’m also wondering what motivated the attack – perhaps an unanswerable question at this point.


Leave a Comment

* Required fields