Evernote stores hundreds of terabytes of online data in order to preserve our users’ memories. Over the last four years, 36.8 million people have created Evernote accounts, and together have uploaded more than 1.2 billion notes with more than 2 billion attachments.
The online application has generated a continuous stream of structured logs, which record high-level activity performed via our APIs from clients and web interfaces. These logs have grown with time and usage so that now we’re recording nearly 200 million events per day, with more than 66 billion events since our launch in 2008.
This is a different sort of Big Data™ problem than our main application. Evernote’s core service stores records and files for fast individual retrieval, clustered by user. Analyzing historic log data requires tools to do large-scale aggregate processing and calculation across a variety of independent variables.
Analytics, February 2008 – July 2012
For Evernote’s first 4.5 years, our historic data warehouse for analytics and reporting was a fat MySQL server. This box stored an InnoDB database on a single huge RAID10 array, and was located in our main data center with our online servers.
Every night, we’d export an incremental dump of a few key reference tables from our runtime “UserStore” database and then import those into parallel tables on the Reporting database. We’d also parse the structured event logs from our application (shard) servers and load them into a huge “events” table, organized into a star schema and partitioned by week. Our nightly scripts would churn for 9-18 hours to load these events and transform them into a cascade of aggregate tables that allowed us to produce a set of hand-tuned SQL reports piped as raw ASCII into the ‘mail’ command for distribution:
At the beginning of this year, we realized that this approach wasn’t suitable for our volume of data or analysis needs. Our primary table had more than 40 billion rows, and was impractical for accessing more than a few days of data at a time. The various derived tables were slow and difficult to maintain, and didn’t allow us to ask new questions without days of SQL trickery by a small number of senior staff.
MySQL is a great database for individual record storage, and a good database for performing analysis over small- to medium-sized data sets, but it’s not designed for flexible analytics over terabytes of data.
Analytics, August 2012 -
This spring, we decided it was time to recruit a team of professionals and build a true data warehouse that would meet our needs for the next few years. We wanted something that could efficiently store the full history of the company, could generate dozens of standard daily reports, and be suitable for ad hoc analytic queries by a few dozen people within the company. And we didn’t particularly want to spend the millions of dollars that would be required for most Enterprise Grade Big Data Analytics Solutions.
We decided to split the analytics environment into three tiers: a ten-node Hadoop cluster for warehousing and cleaning the primary data, a three-node ParAccel columnar database for queries over sets of derived tables, and a JasperReports server for generating periodic reports full of pretty graphs and tables.
The new analytics stack is housed in a completely separate network environment with no connections back to our online servers in order to provide stronger security compartmentalization. The daily online data is securely pushed into the reporting environment through a one-way network connection.
The first stop for the data is Hadoop. Our cluster includes six Data Nodes with eight 500GB drives, for a total of 24TB of raw storage across the cluster. Two 8-core processors and 64GB of RAM each allow us to comfortably run 132 Map/Reduce tasks across the cluster with more than 2GB of RAM for each task.
We run a single Job Tracker, a pair of boxes that replicate a single Name Node VM using DRBD for redundancy, and one client node where we run Hive, Hue, etc. We’re running the CDH4 distribution on Debian Stable with configuration management via Puppet.
Our Hadoop cluster is used entirely through the Hive abstraction layer. This provides a SQL-ish language for performing queries, which is handy, but it also provides optimized HDFS organization and file layouts and for transformed data.
Our logs are delimited text files with one line per event, and one file per shard per day. Each gzip’ed log file is placed directly into HDFS, using Hive to carefully insert each file into a specific “partition” in a virtual events_staging table. That table was created with a set of real columns that correspond to the delimited fields in the log files, and partition columns that represent meta-information about each log file.
LOAD DATA LOCAL INPATH '$f' INTO TABLE events_staging PARTITION ( time_zone='$ZONE', date_local='$DATE_LOCAL', shard='$SHARD');
The name of the shard that produced the log file is represented as a partition column, as is the time zone of the server that produced the logs (since the log timestamps are not in absolute time for historic reasons). This allows us to query and retrieve those virtual columns from Hive SQL.
We take advantage of those virtual partition columns when we SELECT the day’s staged log file “rows” and INSERT them into a daily partition within a native Hive events table. This table cleans up and extends the data from the raw log files. Event times are normalized to UTC (using the virtual time_zone column on each row). Country codes are added based on a lookup against MaxMind’s GeoIP database with the help of JoinTheGrid’s helpful Hive extension. Our own client identifiers are cleaned and normalized, and the end result is an optimized table that we occasionally query when we need to answer very specific questions based on low-level data.
For example, we can perform a Hive query to search across the 1,183,972,870 events that occurred in the first week of the month to determine which versions of Evernote Windows had the most users in Japan:
hive> SELECT client_version, count(distinct global_user_id) AS users FROM events WHERE client='evernote mac' AND country_code='JP' AND date_utc >= '2012-12-01' AND date_utc < '2012-12-08' GROUP BY client_version ORDER BY users DESC LIMIT 3; Hadoop job information for Stage-1: number of mappers: 242; number of reducers: 64 ... Total MapReduce CPU Time Spent: 55 minutes 28 seconds 440 msec OK 400524 50403 400591 21764 300201 13345 Time taken: 99.23 seconds
More often, we want to work with a higher-level representation of activity. For that, we condense the low-level daily events into a higher-level user_client_daily_activity table with one summary row for each active client:user:day tuple. This table is updated in Hive daily, and then cloned into ParAccel through a scripted incremental export/import.
Hadoop is great for cheaply storing a ton of data and performing parallel batch processing jobs in minutes instead of hours (or days). When we need to go all the way back to the low-level data, it’s great to have a brute-force environment that lets us do that. But it’s not particularly quick for more complicated analyses that combine multiple different sets of data.
ParAccel is one of a category of databases that are tuned for aggregate data analysis through optimizations like column-oriented storage, parallelization (MPP), and/or specialized hardware integration. Comparable products are available from HP (Vertica), Sybase (IQ), EMC (GreenPlum), Teradata, etc. (ParAccel’s also the engine behind Amazon’s very interesting new Redshift service.)
Our ParAccel deployment is a three-node cluster of SuperMicro boxes, each with dual L5630 quad-core processors, 192GB of RAM, 10Gbps networking, and a RAID5 of our favorite SSDs. No Debian support, alas, so these were manually provisioned and configured with RedHat instead of using our Automatic Memory Machine.
We load the user_client_daily_activity from Hive into ParAccel every night, along with the reference tables from our online UserStore database. These are transformed into a cascade of derived tables that pre-slice the information into optimal representations for common reports. For example, the sum_client_actives and sum_country_actives tables contain just one row per client/country each day with a sum of the daily, weekly, and monthly active users as of that date.
This database and its tables are tuned for quick aggregate data, so we’re able to answer many types of questions much faster than we could with a brute-force crawl in Hive. For example, it takes three seconds to see which versions of Evernote Windows were the most widely used in Germany during the first week of this month:
bi=# SELECT client_version, count(distinct global_user_id) AS users FROM user_client_daily_activity WHERE client='evernote windows' AND country_code='DE' AND date_utc >= '2012-12-01' AND date_utc < '2012-12-08' GROUP BY client_version ORDER BY users DESC limit 3; client_version | users ----------------+------- 267980 | 31481 268178 | 15566 267864 | 5927 (3 rows) Time: 2822.294 ms
The last stop in the data pipeline a single box running Jaspersoft’s JasperReports Server. This is an open source product with commercial training and support available. Our Jaspersoft server pulls data from ParAccel and generate dozens of reports every day in a variety of formats (HTML, PDF, XLS, CSV, etc.). For example, one report shows the breakdown of account registrations by country in the last 30 days:
The daily reports are saved to an internal web server so they can be accessed without special software, and reports from previous days are archived for future access and comparison.
Lessons and Tips
Overall, the new infrastructure has met our goals. We can load and transform hundreds of millions of records in two hours instead of 10+, we’re generating far more (and far better) reports, and we can safely perform much more complex analyses of user trends than we could before. The road had a few bumps, so we’ve learned a few lessons.
Hadoop is under such active development that it’s hard to figure out exactly which of the dozens of tools and tricks will work without just stumbling along and failing frequently. We hit many compatibility and performance dead ends trying to use things like MRv2 (Yarn), Hive-HBase integration, RCFile storage, dynamic partitioning, etc. If someone wanted to deploy their own Hadoop cluster for a project like ours, I’d recommend sticking with 2-year old technologies. I.e. a cluster with MRv1, basic Hive with Sequencefile storage, a flat tablespace (no schemas), static partitioning, and gzip compression on the input files if you want to save space. Make sure to buy enough RAM for your number of cores because anything interesting you want to do with Hive may need a few GB per task.
Or, of course, make that someone else’s problem by using something hosted like Amazon’s EMR.
ParAccel: The tricks for getting this type of database to perform well are a little different than the older rows-and-indices style of relational database. Just because a database uses COLUMNS(!) doesn’t mean that poorly designed outer joins on gigantic unsorted tables will become instantaneous. You can create tables and start playing around pretty easily, but serious usage requires some quality time with the documentation and training materials to understand the implications of “distribution keys”, “sort keys”, “vacuum”, etc.
JasperReports: For the money, this software is very good for producing attractive, static reports from a variety of databases (including Hive!). We originally planned to also use this to offer a few interactive dashboards to the company, and didn’t have much luck doing anything significant without performance issues. So we’ll be looking for a good interactive dashboarding solution, hopefully avoiding the per-viewer license fees and Windows hosting requirements that seem to be the industry norm.
Team: As part of all of this transition, we stopped doing reporting via manual SQL typed in by company executives, and hired a real team of reporting/analytics experts to do it right. Damon, Adam, and Maggie are looking for one more teammate who can be in charge of all of our analysis and deep data diving. If you know someone who’d be a good fit, send them to the listing on our careers page.