At 2:33pm on Tuesday afternoon (Pacific Time), an attacker began a Distributed Denial of Service (DDoS) against Evernote’s servers. At normal times, Evernote receives around 0.4 Gbps of incoming traffic and transmits out around 1.2 Gbps. We have a diverse set of pipes to the Internet that can handle several times that volume. During this attack, we experienced over 35 Gbps of incoming traffic from a network of thousands of hosts/bots. This quantity of bogus traffic exceeded the aggregate capacity of our network links, which crowded out most legitimate users.
By 3:25pm, our Operations team was able to diagnose the problem and enable a DDoS mitigation service that we had previously established with CenturyLink, one of our network providers. This required moving traffic away from our other feeds to CenturyLink via BGP and then enabling mitigation for our main IP address. Their filters were able to remove virtually all of the bogus packets and permit normal user traffic to flow again.
A few minutes later, the attackers shifted the nature of the attack to send different types of network payloads and to target other addresses that happened to share common infrastructure. This resulted in a couple hours of back-and-forth between our network engineers and CenturyLink to adapt to each attack while minimizing the impact on legitimate users (and our own incident response).
As our network links recovered, we received an unusually high volume of pent-up sync activity. This traffic was about 80% higher than we would have seen during a comparable time on another day. The extended outage had caused many of our servers to expire various cached records, so this initial stampede of client traffic led to a temporary spike in query volume on our central accounts database that was unsustainable.
This required us to perform a rolling service restart to allow individual shards time to handle their users’ pent up synchronizations and repopulate their caches without overloading our accounts database. (This problem was not directly related to the DDoS, it was just an unanticipated side effect of having an hour-long outage followed by an immediate resumption of full service without tuning our database and Couchbase cluster for that scenario.)
The ongoing network attacks and service after-effects persisted for nearly three more hours until the last components were restored to full functionality at 6:14pm.
CenturyLink’s DDoS mitigation service was able to scrub out invalid traffic to restore access, but it took a while for us to enable and configure the solution. This process was a bit haphazard because we had not yet completed our deployment configuration and testing before the incident began. Our networking team had only recently contracted for this service, and they were carefully working through a deployment plan to ensure smooth operation during a future incident. All of the procedures and runbooks were still being drafted, so we hadn’t yet determined exactly which rules would need to be applied to block an attack while permitting all legitimate traffic.
Our final days of testing and configuration were compressed down to a few hours, so the initial DDoS mitigation heuristics were not tuned for our particular application characteristics. This was successful at scrubbing out virtually all of the bogus traffic, but led to a moderate level of “false positives,” which blocked some legitimate users (and partner services like Livescribe) from connecting to Evernote.
Over the following day, we saw another wave of network attacks, which were fully mitigated. Our network engineers worked with CenturyLink to incrementally refine our filtering heuristics to reduce the number of legitimate users that were blocked. As of 4pm Wednesday, we felt that we had addressed virtually all of the incorrect blockages to restore service to the remainder of our customers.
Overall, our Operations crew handled their DDoS trial-by-fire extremely well, but we have work ahead to minimize the disruption to our users in future incidents.
The network engineers get to complete the DDoS procedures, configurations, runbooks, automation, etc. so that they can trigger the full set of mitigations in minutes rather than hours. The systems group has a set of improvements planned to make the service handle “recovery stampedes” after extended outages more gracefully. And our client teams have a couple of tickets to reduce those stampedes in the first place.
Ultimately, we know that every minute of outage for the Evernote service may prevent important tasks for thousands of our users, so we will make every effort to reduce or eliminate the impact of such attacks in the future.