Evernote’s servers process a lot of data for our users. At any given time, a shard may be performing different activities for different clients. For example:
- Constructing dynamic web pages for user accounts
- Performing API calls on notebooks, tags, etc.
- Uploading or downloading images
- Uploading or downloading other files (audio, PDF, etc.)
- Clipping web pages from remote sites
- Managing image/PDF text recognition
- Indexing notes (including search data and PDF contents) into Lucene indices
- Performing searches against Lucene
- Rendering “thumbnail” images for notes, images, PDFs
- Handling new or recurring payment processing for PayPal, Google Checkout, CyberSource, iTunes
This heterogeneous mix of tasks and data means that our application can be particularly sensitive to concurrency bottlenecks in both our code and third-party libraries. While Evernote activity isn’t particularly “bursty” compared to some web services, the daily variation across our 95 shards means that even infrequent chokepoints will hit some shards from time to time.
When our monitoring systems detect that a particular shard is underperforming, we try to capture as much information as possible about the current state of the server without introducing more problems. One low-tech tool is “sudo killall -3 java”, which dumps the current stack trace for every Java thread to standard output. We can then inspect the state of each thread for signs of problems. Here’s a fun example of the sort of bottleneck we find by inspecting enough stack dumps:
On regular occasions, we’d find a number of threads in a choking server all waiting to convert a byte to a String using a named encoding or vice versa. We’d find blocked threads originating in code from Tomcat, MySQL Connector/J, GWT, SAX, Thrift, and the JRE itself. The threads would all look something like this:
java.lang.Thread.State: BLOCKED (on object monitor) at sun.nio.cs.FastCharsetProvider.charsetForName(Unknown Source) - waiting to lock <0x00007f4c3d48acb0> (a sun.nio.cs.StandardCharsets) at java.nio.charset.Charset.lookup2(Unknown Source) at java.nio.charset.Charset.lookup(Unknown Source) at java.nio.charset.Charset.isSupported(Unknown Source) at java.lang.StringCoding.lookupCharset(Unknown Source) at java.lang.StringCoding.encode(Unknown Source) at java.lang.String.getBytes(Unknown Source) at com.mysql.jdbc.StringUtils.getBytes(StringUtils.java:499) ...
After reading the JRE code, we found that the concurrency bottleneck is caused by a simple synchronization block in the [ironically-named?] FastCharsetProvider.charsetForName method that looks up a cached Charset for a String name (like “UTF-8”). The use of Java’s ‘synchronized’ call to protect this in-memory cache prevents two threads from breaking the cache data structures, but means only one thread can look in the cache at a time.
There’s at least one Java RFE filed to improve this bottleneck. As suggested by Paul Linder, the modern ConcurrentHashMap collection provides a better alternative to fully synchronized classic Maps for caching.
But we don’t really have the luxury to wait for a full JRE fix, so we have to reduce the impact of this bottleneck ourselves via things like:
- Patch Tomcat
- Patch the GWT parser
- Suggest fixes for MySQL Connector/J
- Replace all relevant byte<->String transformations across our own codebase. (Including such unpleasantness as removing all use of JRE classes URLEncoder/URLDecoder with their own unpatchable String encodings.)
Short version: Large-scale concurrency is kind of hard. Java’s ConcurrentHashMap is super awesome for in-memory caching.