Evernote Tech Blog

The Care and Feeding of Elephants

WhySQL?

When we describe our overall service architecture to smart people who have been involved in other big services, the two most common questions are:

  1. Why is your structured data stored in SQL databases instead of something like [big-data, web-scale, No-SQL platform X]?
  2. Why are you running your own hardware instead of hosting Evernote in [cloud service provider Y]?

These are both valid and interesting questions. I’ll start with #1 and save #2 for a future post.

For the right application, a modern key-value storage engine may offer significant performance or scalability advantages in comparison to a single SQL instance. There are a few reasons that we’ve decided to store all of your account metadata within a single (replicated) MySQL instance instead.

Electric Kool-Aid

First, the ACID properties of a transactional database like MySQL’s InnoDB are important for our application and synchronization model.

Here’s a little snippet of the database tables for storing “notebooks” and “notes” within a shard’s SQL database:

CREATE TABLE notebooks (
  id int UNSIGNED NOT NULL PRIMARY KEY,
  guid binary(16) NOT NULL,
  user_id int UNSIGNED NOT NULL,
  name varchar(100) COLLATE utf8_bin NOT NULL,
  ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8; 

CREATE TABLE notes (
  id int UNSIGNED NOT NULL PRIMARY KEY,
  guid binary(16) NOT NULL,
  user_id int UNSIGNED NOT NULL,
  notebook_id int UNSIGNED NOT NULL,
  title varchar(255) NOT NULL,
  ...
  FOREIGN KEY (notebook_id) REFERENCES notebooks(id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

If you create a notebook named “Cooking” on your Windows client and then immediately clip a recipe for “Quick Tomato Sauce” into that notebook, your client will do something like this on the next sync:

Each of these coarse-grained API calls is implemented through single SQL transaction, which ensures that a client can completely trust any reply given by the server. The ACID-compliant database ensures, for example:

Atomicity: If an API call succeeds, then 100% of the changes are completed, and if an API call fails, then none of them are committed. This means that if we fail trying to store the fourth image in your Note, there isn’t a half-formed Note in your account and incorrect monthly upload allowance calculations to charge you for the broken upload.

Consistency: At the end of any API call, the account is in a fully usable and internally consistent state. Every Note has a Notebook and none of them are “dangling.” The database won’t let us delete a Notebook that still has Notes within it, thanks to the FOREIGN KEY constraint.

Durability:  When the server says that a Notebook was created, the client can assume that it actually exists for future operations (like the createNote call). The change is durable so that the client knows that it has a consistent reflection of the state of the service at all times.

The Durability property is the most important for our synchronization protocol … if the client can’t assume that changes made on the server will be Durable, then the protocol would become much more complex and inefficient. Each synchronizing client would need to constantly double-check whether the state of each server object matched the local state. Maintaining absolute consistency for an account with 20k Notes, 40k Resources, and 10k Tags would be very expensive if changes couldn’t assume Durability.

Big Data?

The ACID benefits of a transactional database make it very hard to scale out a data set beyond the confines of a single server. Database clustering and multi-master replication are scary dark arts, and key-value data stores provide a much simpler approach to scale a single storage pool out across commodity boxes.

Fortunately, this is a problem that Evernote doesn’t currently need to solve. Even though we have nearly a billion Notes and almost 2 billion Resource files within our servers, these aren’t actually a single big data set.  They’re cleanly partitioned into 20 million separate data sets, one per user.

This extreme locality means that we don’t have one “big data” storage problem, but rather we have a lot of “medium data” storage problems that partition neatly into a sharded architecture.

But maybe later…

We’re very interested in all of the cool new data storage systems for future projects that don’t require strong ACID transactionality and do require horizontal scalability. For example, our reporting and analytics system has gradually outgrown its current MySQL platform and needs to be replaced with something bigger/faster/cooler.

But we’re relatively satisfied with sharded MySQL storage for Evernote user account metadata, even though that’s not going to win any style points from the cool kids.

34 Comments

  1. There are other reasons to choose NoSQL
    For example, when Craigslist was using mySql and they had to change their schema, it took MONTHS to facilitate the change across all their slaves.
    You can also have a mixed strategy of using both RDBMS and NoSQL to achieve consistency while being able to be flexible to architecture changes.
    Lastly- have you looked at total overal cost? Setting up a large cluster with mySql will have a large operational cost and it may not be partition tolerant so if the wrong servers go down, it may cascade to your whole data store.

    • When we add new optional information to core data structures (like adding the ‘placeName‘ attribute on Notes), we map that onto 1:N ‘attributes’ tables that don’t require table schema changes.
      Less frequently, we have changed existing tables during our weekly maintenance/release cycle. We do that every month or two, and it usually takes 30-120 seconds since it’s parallelized across 187 different shards with 15krpm drives in each.

      We aren’t using a single homogeneous “MySQL cluster”, but rather 187 completely independent single-tiered servers, so we don’t have “monolithic schema update” problems, and any issues on any individual shard are completely isolated from other shards.

      “Schema free” databases are a bit like weakly-typed programming languages … they provide a an easier programming model when everything goes right, but also more opportunities for you to shoot yourself in the foot. The flexibility of “no fixed representation of data” also means that you need to be rigorous to avoid spaghetti unmarshalling code with dozens of switch clauses for every variant of data you might find.

      This isn’t a “good” versus “bad” thing, but rather you’re making different trade-offs, which may be desirable in some situations (and not in others).

    • Months?!? Could you be more specific on the exact schema change in MySQL that directly resulted in such a lengthy implementation cycle, and how a NoSQL DB would have made this largely a non-issue?

      And, yes, I find it hard to believe, but am open-minded, esp. to tangible examples.

    • I’m having a hard time believing that coding for no schema is really more efficient than coding for a schema that might change. What I’ve normally observed is the problem is coding for a static schema, which will kill you regardless of whether you use a traditional RDBMS or a NoSQL database.

  2. I’d like to hear more about what your actual sharding policy is? You have one single mysql server, and 20M users – does that mean you have 20M tables?

    • We have 187 “shards”, each has one MySQL database instance and set of tables that store the user data for the 100,000+ users with accounts on that shard. So the ‘user_id’ column on virtually every table provides the partitioning by user, and there’s only one ‘notes’ table per shard.

      • Do you have a master database for authentication, and for looking up the shard for a given user? Or some other mechanism?

        • Yes, we have one small master “account” database that we use for simple things like username, password, etc.
          http://blog.evernote.com/tech/2011/05/17/architectural-digest/

      • Thanks very much for sharing Dave! It really helps to have access to real examples such as Evernote.

  3. great pragmatic viewpoint … I have never seen anyone throw tools out of their toolshed and I advocate the same thing with software.

    Lets face it, the IT media and in some respects the hot house culture that surrounds our industry tends to create hype/fashion cycles … they like a meme with legs (web 2.0 anyone) that explains incredibly complex activities and goals in a single easy to digest term.

    What you are doing with Evernote data model is being pragmatic with being a good curator of the current data model and simultaneously open to new techniques … in fact I would argue you are well positioned to take advantage of the new tech (bigdata) with minimal risk then going all in.

    Its not just about the right tool for the job but its also about the need for us to promote a rich environment that tolerates diversity (not for the sake of diversity) that lets us match any specific software to their strengths and reduce the impact of their weaknesses.

    What I find odd is that there does exist ‘bigdata’ ACID compliant systems (like MarkLogic, sorry I had to get that in there) which works at petabyte scales and does not compromise partially due to the fact we use mvcc architecture, partly due to the fact that we think systems that replace SQL servers will have to be better in every respect (e.g. scale bigger, faster AND ACID) before people really consider moving wholesale to them.

  4. Thanks for sharing.
    Could you elaborate a bit on:
    “Even though we have nearly a billion Notes and almost 2 billion Resource files within our servers, these aren’t actually a single big data set. They’re cleanly partitioned into 20 million data separate data sets, one per user.”

    What do you mean by “cleanly partitioned” and how are you improving performance with it?

    • I probably should have said ‘logically partitioned’. The data in your account is completely independent of the data in my account, and our clients only need to access data in our own accounts (with narrow and fixed-size exceptions for shared notebooks).

      So the 20 million different accounts could be spread across N different servers, where the ratio of 20,000,000:N can be chosen based on optimal hardware price/performance.

      In our case, we allocate 100,000 or more accounts to each box, and put the metadata into one database on each, with indices to optimize access to data in a particular user account.

      • Is the sharding algorithm as simple as all new users go to shard x. And once that hits 100K users, all new users are assigned to shard x+1?

        If so, do you have issues where the current new user shard has significant usage given new users may be highly active when they start using the service?

        Have you thought about round robining new users across a variety of shards to potentially spread the load?

        • Hey, Sachin!
          The load balancers know about a pool of servers that are available for new registrations, and they just send most traffic without shard IDs to those servers.
          Originally, we handled this using two new shards at a time, but now we deploy a rack of 10 servers at a time and put them all into the registration pool to spread the load for the reason you mention.

  5. Dave, thanks for the very interesting post. I’ve published some follow up comments about the requirements you’ve exposed from NoSQL perspective: http://nosql.mypopescu.com/post/18126503708/whysql-mysql-innodb-acid-guarantees-for-evernote

    It would be great to know what do you think of it and if my comments are adding some details or clarifying some of the things with NoSQL databases.

    • The majority of our transactions are definitely multi-row. Creating a note could theoretically create 2,000 rows in the database, and update others (like the row that stores the user’s upload allowance for the month).

      Obviously, on any Turing-complete computing system, you can try to compensate for the shortcomings of one system by adding more smart code and application-layer locking mechanics. But it’s also nice to be able to say that “Every Evernote account is absolutely in a state of consistency that matches our API data model” because I know that it’s impossible to accidentally store a Note with a reference to a Tag that doesn’t exist, etc.

      • Dave-

        Great points. The consistency is key and I think a major reason why NoSQL has been limited to big-scale, web-facing problems to a certain extent.

        I wrote up some more detailed thoughts about it here:
        http://www.cambridgesemantics.com/blog/-/blogs/whysql-evernote%E2%80%99s-boring-old-reliable-architectu-1

        cheers!
        -Rob

  6. This information will surprise many.
    Best phrase in the post:
    “Database clustering and multi-master replication are scary dark arts”…

  7. Why the GUID and ID fields instead of one or the other? It makes sense you want to use a GUID for better global uniqueness, especially if you partition the DB across multiple servers. But why then have an ID field?

    • Good question. When we first built the API and the back-end service, we used the generated GUID (in string form as a char(36)) as the primary key on every table, since this was represented externally through the API.

      We later realized that this was a horrible decision for performance due to the way that InnoDB uses primary keys in secondary indices … every index entry had an extra 36 bytes to store the primary key.

      So we did a pretty huge schema overhaul two years ago to replace the string-based primary key with a 4-byte numeric synthetic ID, and pushed the GUID into a binary(16) with a secondary index on it. That reduced the size of all of our secondary indices by a factor of 2-8x, which makes them smaller, faster, etc.

      In hindsight, I now wish I’d also shoved the user_id into each primary key to force better disk locality of individual user data in the database via InnoDB clustered primary keys, but that optimization probably isn’t worth the effort of changing all of the Hibernate code to use compound primary keys for references, etc.

  8. “They’re cleanly partitioned into 20 million data separate data sets, one per user.”

    Do you mean that you have one set of tables per user, or one database per user?

    • Sorry, that was a little unclear.
      The logical data set for each user is logically independent, so don’t need to go into a single massive data store.
      Currently, we put 100,000+ users onto a single shard, and there’s a single MySQL instance with a single set of tables for each shard. All metadata for users on that shard is stored within that shared database.

      We do, however, keep a completely separate Lucene text search index per registered user.

  9. Jeremy Zawodny, Author of High Performance MySQL (http://shop.oreilly.com/product/9780596003067.do), who works at Craigslist mentions this at 4:40 of the following video:
    http://www.10gen.com/presentations/mongosv2010/craigslist

  10. My comment was in response to Paul Tiseo’s above. For some reason it did not go under his.

  11. Do you rebuild Search Index (Lucene) after every update.

    What tool you used to enable shards. Customs scripts (Lua etc) or some thing like SpockProxy

    • We don’t rebuild the full Lucene index on every change, but we do flush it to disk and ensure that new Lucene IndexReaders use the new contents so that searches will find the correct results. For more information on our usage of Lucene:
      http://blog.evernote.com/tech/2011/08/25/lucene-explainin/

      We partition users during account registration … so the load balancer is configured to send new users to the new shards. Once you are assigned to a shard, all of your future data will be on that shard. When a shard becomes “full”, we deploy new shards and change the load balancer to point to those for new registrations.

  12. You said you will spread users and their data to different shards, and therefore when the shard becomes “full”, you will deploy new shards and change the load balancer to point to those for new registrations.

    What if the users within one shard have a lot of notes, and those notes are all big data. so that specific shard will becomes full, how do you tackle this problem? put those users’ some data to another shard? or just put some exist users and their data to another new shard?

    thanks

    • Currently, we just add more disks to the shards. Each shard is 4RU with 24 hot-swap drive bays, but they only start with 16 drives, so we have room to add 8 more 3TB disks in each.

      In the future, we’re going to switch to a network-based file storage mechanism for resource files, so we won’t have to juggle as many disks in the primary shards.

  13. Very interesting post. I like the balanced, pragmatic point of view. I grow so tired of techie diatribes on the latest fad!

  14. How do you do aggregated queries across all shards (e.g. for reporting?) In other words how do you handle the cases where the users are not actually isolated?

    • For reporting, we track relevant activity (“created a note”, “logged in”) using one flat “event log” file on each server each day. These files are pushed into a reporting server that does loads them into a single database where we can perform aggregate queries to make pretty graphs and charts for Phil.

      On rare occasions, we’ll want to do an ad hoc aggregate query that touches all of the actual shard databases to find out some information that wasn’t in the event logs. E.g. something like “how many notes contain images that are more than 5MB” or something like that. For that, we have a simple script that runs the query in parallel on the different shards and then just displays the individual and aggregate results.

      But neither of those are used for any “real time” operations for end users … they’re back-end activities for our own internal metrics which don’t need to finish quickly for any end-user UI operation.

  15. I still want to know Why Evernote running your own hardware instead of hosting Evernote in [cloud service provider Y].
    As we know, Instagram cloud keep they server very well when they have only ten peoples by use Amazon cloud service.

    • The photo sharing app you mention is a part of the company that owns some of the largest data centers in the world. They manage many of their own cloud services in house.

      Our in house ops team manages our cloud service and we have full control of the setup.

      Hope this helps.

  16. An interesting discussion is definitely worth comment. I do believe that you ought to write more about this topic, it may not be a taboo subject but typically people do not speak about these issues. To the next! All the best!!


Leave a Comment

* Required fields