Scenario: In the last week or two, lots of people noticed sporadic errors when they tried to synchronize with Evernote or access our web site. The errors would disappear if they manually forced another sync or reload. The web site worked fine after that initial hiccup.
Debugging: The symptoms pointed to a problem with establishing new HTTPS connections, since subsequent requests (over keep-alive connections) worked fine. We were able to reproduce the problem by just hitting our site with ‘curl‘ a few times:
curl -v -i -s https://www.evernote.com/robots.txt
The majority of requests worked fine, but some percentage would fail with an “SSL protocol error”:
* About to connect() to www.evernote.com port 443 (#0) * Trying 18.104.22.168... connected * Connected to www.evernote.com (22.214.171.124) port 443 (#0) * SSLv3, TLS handshake, Client hello (1): * Unknown SSL protocol error in connection to www.evernote.com:443 * Closing connection #0
The openssl ‘s_client’ command-line tool is also useful for low-level SSL debugging, and it showed a similar error during SSL negotiation:
$ openssl s_client -ssl3 -state -debug -msg -connect www.evernote.com:443 CONNECTED(00000003) SSL_connect:before/connect initialization ... >>> SSL 3.0 Handshake [length 005e], ClientHello ... <<< SSL 3.0 Handshake [length 004a], ServerHello ... <<< SSL 3.0 Handshake [length 0004], ServerHelloDone ... >>> SSL 3.0 Handshake [length 0084], ClientKeyExchange ... >>> SSL 3.0 ChangeCipherSpec [length 0001] ... >>> SSL 3.0 Handshake [length 0028], Finished ... SSL_connect:SSLv3 write finished A SSL_connect:SSLv3 flush data read from 0x100119170 [0x100811400] (5 bytes => 0 (0x0)) SSL_connect:failed in SSLv3 read finished A 5023:error:1409E0E5:SSL routines:SSL3_WRITE_BYTES:ssl handshake failure:/SourceCache/OpenSSL098/OpenSSL098-35/src/ssl/s3_pkt.c:530:
This failure deep within the SSL handshake was particularly confusing. It seemed like our HTTPS server was dying in the middle of the SSL negotiation.
As I mentioned in our Architectural Overview, we offload our SSL processing onto a pair of A10 AX 2500 load balancers that have performed well since we installed them in January. But this error made us worried that there may be some sort of deep cryptographic error within that hardware.
So we wasted a couple of days trying to fix the problem by rebooting (and cold booting) the boxes on the theory that this would “shake loose” the cryptographic errors. This caused the problem to go away for a while, but then it would be back again the next morning during our peak traffic hour (6am-7am Pacific). Our theories about the possible causes got more and more complicated, and our proposals for fixes got more and more baroque (and expensive).
Finally, we realized that the failure rate seemed to be the highest during our peak traffic times, and that the problem might actually just be a simpler capacity issue. I.e. maybe we’d grown enough to exceed the ability of this hardware. Unfortunately, we didn’t see anything in the UI for our balancer that indicated we were at any sort of capacity limit:
The CPU usage was low, and the “SSL Conns/sec” was around 14% of the rated SSL CPS for this hardware. We contacted our support representatives from A10 and they scheduled a call with several of their experts to help track down the problem or get us new hardware if needed.
They told us immediately that we were hitting a limit, but it wasn’t the “new connections per second” limit, but rather the “total open SSL connections” limit of 250,000.
We were only receiving 1105 new SSL connections per second, and we only processing 2500 HTTP requests per second over those connections, but we were holding a very large pool of idle connections. This was due to an “idle connection timeout” parameter of “10 minutes,” which meant we’d keep the HTTPS socket open for 600 seconds after the last response to the client before we’d close it.
In retrospect, this timeout setting was a bit excessive. Empirical testing with openssl shows that this is several times longer than the idle SSL connection timeouts used by other big Internet web services.
Now, we’ve lowered the idle connection timeout to 2 minutes. This means that we’re closing idle connections 5x faster, and the number of open connections has dropped dramatically as a result. As I type this:
[Yeah, I can kind of go overboard with Skitch. But, to be fair, I liked it long before the acquisition…]
The moral of the story? Sometimes, it’s easy to overlook the simple explanation for a problem when the symptoms are unusual. Our apologies to folks who were inconvenienced by the sync errors while we sorted out these problems.