Evernote Tech Blog

The Care and Feeding of Elephants

Graphite at Evernote

Like many Operations teams, we love metrics here at Evernote. Also like many Operations teams, we’ve all used a lot of tools over the years that provide solutions in this space. Often we find products that work, but aren’t exactly the best fit for our needs. Not too long ago, we discarded all of our server and network performance trending products and replaced them with a custom implementation of graphite and collectd. Out of the box, these tools also needed some custom work to meet our objectives, but the combination of the two tools offered a good foundation to build upon. While we’re not completely finished with our enhancements, we’d like to tell you about some of the changes.

Although we have additional components in the mix for monitoring and alerting at Evernote, today I’m going to write specifically about just graphite and collectd which provide the bulk of our performance trending data for servers and network devices.

graphite-data-flowThe diagram to the right depicts the logical flow of data from collectd running on all of our OS instances to a cluster of graphite servers. We have multiple servers acting as a unified graphite cluster with graphite relay software in front of carbon cache instances. The relay acts as a load balancer to distribute metrics across the nodes. A graphite plugin for collectd facilitates this data feed.

We have an additional collectd cluster that’s not depicted. Its purpose is to act as SNMP poller instances for all of our network devices. The data flow is very similar except that it’s a cluster of collectd instances dedicated to polling.

With servers and network devices combined, we collect a little under 800,000 metrics across our production environment every few minutes into this solution.

We also use collectd’s notifications and thresholds. This allows us to set parameters for metrics to detect when there’s an unexpected deviation. With the use of this feature and a very small custom plugin, we’re able to send alerts into our Nagios cluster as events occur from collectd. Nagios, being one of our primary alerting mechanisms, can then contact the appropriate support personnel and manage the alert.

Customization

We enhanced collectd with some plugin changes that provide additional metrics, but the more interesting customization we’ve done is with graphite. A fresh install of graphite can provide static graphs based on available metrics in the Dashboard along with a reasonable method of exploring metrics and dynamically creating graphs in the Composer. This is great, but considering that we’re primarily interested in OS and network device metrics, we would have needed thousands of individual dashboards using graphite’s default capabilities. This isn’t a practical solution. After looking at some of the third party graphite charting options available at the time, we decided to enhance graphite’s web interface rather than replace it.

As a result, we added templated dashboards to graphite in two forms. The first, seen below, shows how we select a template, ‘Linux Server’ in this example, and an appropriate device. The dashboard is then dynamically updated to present data just for that device. Without this capability, that single dashboard we created for ‘Linux Servers’ would instead be thousands of dashboards – one for each of our devices.

graphite-templated

The next variation of this enhancement adds support for applying multiple devices to the same templated dashboard. This allows us to arbitrarily combine devices and view their common metrics. Among other advantages, this technique lets us review variances between systems that should have similar performance profiles.

graphite-multi-item-templated

graphite-template-editor

We retained the support for static dashboards and present the classic graphite Finder interface for that under the ‘Dashboards’ menu. As for the templated dashboards, they’re very easy to maintain with a simple variable substitution system. Any component within the dashboard can include a variable as needed. We have one special variable, {{id}}, that is dynamically updated based upon the currently selected device.

A typical dashboard for us has more than a dozen graphs, but the use of templating has drastically reduced the total number of dashboards we would have otherwise needed to create without these improvements.

Working with graphite and collectd has produced excellent results for Evernote. There are many plugins for collectd and graphite is flexible enough to customize to your organization’s needs. As we continue to enhance graphite for our environment, we intend to release our additions to the community in the future.

Update: We released the source on github. See this blog post for more information.

11 Comments

  1. Maybe you give your graphite customizations to open source community ?

    • Yes, definitely! I hinted at that at the end of the article. We intend to share our enhancements for those that would like to use them.

  2. Can you elaborate on the graphite relay ?

    There are a half dozen projects trying to replace carbon relay at the moment, and all of them are about 5/8ths baked.

    Also Nice job on the UI. It’s a vast improvement.

    • Thanks! We’re still using the original carbon-relay. I’ve thought about rewriting it or trying some of those alternatives as it can become bottlenecked on a single CPU. However, with our collectd update interval being a few minutes, we have a little bit of runway left before I need to do that. I think it might be a small, fun project to do in Go :)

      • You should consider using pypy to run your relay. It may require a bit of massaging to get it to work, but the benefits are worth it.

  3. Are the changes to carbon necessary or can I just use your version of the graphite webapp on top of a non-forked carbon?

    • There are no changes to carbon, just graphite-web. For more details see the blog post for the open source release.

  4. Hey, the templates stuff looks fantastic — are you considering contributing this back to the graphite project?

    • Thanks! The version we use is available on github. The way we use graphite might be a bit too specific for general use. For example, we make assumptions about metric naming and our enhancements are very ‘device’ specific. However, by releasing all of the code we hope it is useful for others. If some of the changes or just the idea of templated dashboards make their way into the official graphite, that would be great.

  5. I’m stuck, is there any documentation for an example template? I can’t load any devices as it tells me to select a template first, but I can’t figure out how to create a template. I’m trying to use your graphite-webapp alongside an existing graphite installation if that helps at all – I can see the devices but can’t select them without a template.

    Thanks!

  6. If you ever want to try an alternative dashboard, I can recommend the new Grafana dashboard. I tried it recently. Looks great and has very advanced dashboard editing and graph editing features. http://grafana.org


Leave a Comment

* Required fields