Operations

Elephant Factory

Posted by Phil Jensen on 19 Dec 2012

Posted by Phil Jensen on 19 Dec 2012

Comment

Matt’s “Automatic Memory Machine” post described our installation progress that we use to deploy servers at Evernote.  I’d like to talk a little bit about our configuration and software deployment processes.

Service Post-Installation

Once servers have been installed at our datacenter and are PXE-booted and prepared for use, a service-specific post-installer is executed automatically.  The post-installer contains all of the necessary functionality to bootstrap a physical host and/or virtual machine on our network so that it can complete a Puppet run successfully on a host with multiple network interfaces configured with Linux bonding.

An example post-installer would execute instructions similar to this:

  • Setup network interfaces (we use 802.3ad extensively for external-facing interfaces, and Linux bonding mode=0 for per-packet round robin load balancing between hosts that make up our shards)
  • Construct Linux LVM logical volumes for Xen virtual machine use
  • Automatically construct DRBD metadevices for fault-tolerance and redundancy across multiple physical hosts
  • Expand and configure virtual machines on hypervisors
  • Ensure a complete initial run of Puppet on hypervisors and virtual machines

We use Puppet extensively for user and configuration management at Evernote.  Once our service-specific post-installer completes, a Puppet run executes a catalog which contains a list of instructions for each type of machine.  We use a conventional file-based node classifier (nodes.pp).

For example, a typical Notestore virtual machine uses a few Puppet modules like this:

node /anexamplenode/ {
 include core::debserver
 include notestore_v2::dom0
 include ntp::client
 include ipmi
 }

We have a standard set of packages and additional configuration that we store in our core module, along with standard services like an NTP client, IPMI event logging, and our monitoring system. Finally, we have a service-specific module which includes files, packages and services that comprise each service.

We run a Puppet master in each logical environment to provide services to the Puppet agents running on each host.  We’ve selected Unicorn with Nginx to provide us with improved performance and scalability, and run each Puppet master on well-sized virtual machines.

Service Deployment

Once the post-installer execution completes (including the initial Puppet run), we perform validation on each host before promoting to production.  As part of our deployment, validation, and day to day operations, we make use of Fabric (https://github.com/fabric/fabric).  From the readme: “Fabric is a Python library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks.”

The Operations team has an Operations Fabric library which contains small functions called atoms that are used to perform various tasks.  An example would be the deployment of the Evernote web service code to the production systems:

def updateAndUpgradeEnweb():
 '''
 Grab the latest package lists & update en-web.
 '''
 sudo('apt-get update')
 sudo('apt-get -y install en-web')

We can use a small atom like this to update a single host (as easily as the entire fleet) and deploy the latest version of web services code.  Fabric allows you to specify and maintain host lists, and you can run each atom sequentially or in parallel (later versions of Fabric include this functionality built-in.) We wrap Fabric with GNU Parallel for large-scale atoms like service restarts and service updates.

A question that has been asked before is: why use Fabric for service updates when Puppet is quite capable of handling package deployments?  We prefer to use Fabric for deployments as it provides for finer grain control over the timing, thus helping to provide increased certainty about when actions will happen and complete.  This helps reduce the duration of our service maintenance window. Furthermore, we usually do not have hours to stage a release ahead of the weekly maintenance window, so Fabric allows us to execute the deployment as soon as we get the QA sign off (vs waiting for all systems to stage the release, etc). Our production shards are reliant on services provided by our central UserStore database, and a schema update on the UserStore may result in having to restart all production shard Tomcat instances (or, conduct a carefully orchestrated service shutdown / schema update / service update / restart).  Using a direct SSH-based tool such as Fabric results in less overhead.

We’ve wrapped our most common Fabric atoms into a menu which all Operations staff can use to perform service maintenance, thus helping improve efficiency and reducing the rate of errors.

Summary

The combination of a customized post-installer and the use of Puppet and Fabric has enabled us to manage our systems effectively.  We are continually working to improve the efficiency of our environment.  Does this kind of work interest you?  If so, we are actively looking for Operations Engineers to help provide the best stability and performance possible, and deployment/configuration management is one of many areas we are actively developing in Operations at Evernote.

View more stories in 'Operations'

3 Comments RSS

  • Jacob Parks

    Are there significant advantages to compartmentalizing the install process instead of using something like OpenStack that has an imaging service? If I’m reading the steps correctly each machine is built from the ground up instead of using a universal base image.

    Thanks!

    • Phil Jensen

      Hi Jacob;
      Our install process has to encompass both hypervisors and virtual machines. The built-in image service (Glance) that is used with OpenStack is featureful but doesn’t address some of our image build requirements (such as dynamically and automatically deploying high availability virtual machines with DRBD.) With the exception of a single VLAN change (from an Embryo-style VLAN to Production), we are approaching zero-touch for our shard initial deployments, including auto-assembly of DRBD metadevices, deployment of custom base Xen hypervisors and virtual machine instantiation.

  • Rizzy Savage

    Good blog on puppet configurations for system admins

    http://puppet-cmt.blogspot.com/