We recently launched the next generation of our content management platform for MLS; we call it MP7 (the old version has been retrospectively named MP6). The core of the CMS is built with Drupal 7 and is tightly integrated with our MatchCenter and REST-based API, both of which are built on Node.js. We had a long list of improvements for MP7 up and down the entire stack. On the infrastructure level, the two most technically disruptive improvements we tackled were Autoscaling and creating an active-active architecture across multiple regions (data centers) in Amazon Web Services (AWS). Both of these improvements required us to radically deviate from the standard Drupal architecture used for high traffic websites.


The biggest advantage of moving to cloud based infrastructure is to make scaling easier. We wanted to easily add and remove capacity depending on our needs. Furthermore, we wanted to have this happen without our direct intervention where possible. AWS makes this almost trivially easy to set up, if your application can support it.

Multi Region

Our previous platform, MP6, is primarily hosted on physical servers in a single data center. We have a redundant architecture within the datacenter, but we have had multiple outages that were caused by the datacenter itself losing connectivity, often because another customer is getting hit by a DDoS.

By hosting our applications in multiple AWS regions, we mitigate the risks associated with a single location (disaster recovery) and give ourselves some new deployment capabilities. We felt strongly that both data centers should be active since there is no better way to ensure your “backup datacenter” works than to actually serve live traffic with it. Plus, in theory, our fans see an improvement in response times through the reduced latency when connecting to a geographically closer server.

Anonymous Traffic

Before we go any farther, it must be pointed out that we are lucky to only have to concern ourselves with horizontally scaling (horizontal vs vertical scaling) anonymous traffic. The only users of MP7 that need to log in to the CMS are content creators (writers and editors) at the league and clubs. The number of content creators is fairly constant over time and is small enough that we can easily vertically scale the “admin” servers that are dedicated to handling this traffic.

In a nutshell, many of the solutions we discussed in this post only apply to our public facing experience and may not translate well (or at all) to sites that need to scale for logged in Drupal users.

Autoscaling Process

Making it feasible to autoscale Drupal took some serious effort. Autoscaling an application reorients how you view your servers. Instead of lovingly tending to a pool of web servers that each have a name according to some convention (webuw205prod), AWS creates, and summarily terminates servers at will. You may never even know their names. That’s good! We should be treating our servers as “cattle” not “pets”.

We had to tackle several interdependent issues to get Drupal to accommodate this kind of environment.

Deploying Code

The first issue we encountered was simply how to deploy our application code. In a “traditional” architecture, you push code to each running server, hopefully with some degree of automation such as a deploy script, a continuous integration server like Jenkins, or something similar. In an Autoscaling architecture, we need to handle code deployments for two scenarios: new, anonymous, VMs coming online, and updating old code on the anonymous VMs that are already running.

One way of handling both scenarios is to “bake” the code deployment process into the image used for creating new VMs during an Autoscaling event. This way, a newly created server will grab the code from somewhere (like Amazon S3), Amazon’s durable, performant, and distributed key value store), run any deployment steps required, and begin serving traffic. When a new build is deployed, all the VMs running the previous build are terminated and Autoscaling is used to create new VMs with the new application code. This is the approach used by Amazon’s own tools, such as Elastic Beanstalk.

We prefer to have tighter control over our deployment process. We often want to deploy one region at a time or run a migration on a single server before pushing the code to the whole pool. Since we were already using SaltStack to manage our infrastructure, it made sense to use it for automating our deployment process.

We leveraged a feature of SaltStack called Syndic to create dedicated master servers that live inside the VPC on the same private network as our Drupal VMs. The Syndics are set to automatically accept any request from a minion to join their group. This setting opens up a potential security hole and should only be turned on in a trusted environment. We use SaltStack’s GitFS backend to store configuration information in Github and keep the master of masters and the Syndics all in sync with each other. Each Syndic does keep its own local mirror of the current States and Pillars in case Github has an outage.

Our Amazon Machine Image (AMI)* automatically connects to the local Syndic on boot and self-issues a highstate run. The Syndic sends the correct States and Pillars based on the grains set in the image’s minion configuration.

To deploy code, we first promote a build from our testing environment to S3 using a Jenkins job. We then run a command on the SaltStack master that tells each running VM to fetch the new build from S3 and re-deploy itself. Subsequently, if a new VM is created by an autoscaling event, it will also fetch the new build from S3 when it joins its Syndic collection.

* A slightly off topic footnote: we roll our AMIs as instance types in order to avoid having to create EBS volumes during Autoscale events. This allows us to be slightly more resilient to EC2’s historical tendency to have outages related to EBS.

Centralized Caching

The cornerstone of any large scale Drupal deployment is caching. The standard approach is to use some mixture of a content distribution network (CDN), Varnish, and Memcached. The CDN and Varnish both operate above Drupal in the traffic pipeline and don’t require too much in terms of special dispensations in order to work with Drupal. Memcached, however, replaces Drupal’s internal caching system which is normally handled in dedicated MySQL tables.

Did you catch that? The cache is normally stored in the database. This means that a cluster of Drupal servers all expect to have the same exact representation of the current state of the cache. And. Drupal. caches. everything.

So what happens when the title of an article is changed? Drupal intelligently expires the corresponding cache keys for that node. And any corresponding blocks. And any corresponding pages. And any corresponding views. At least in theory.

You should use Memcached to replace the database cache. It is an easy change and it greatly reduces database load. Drupal expects that each server is reading and writing to the same cache, so you need to make sure that each server in the autoscaling group is configured to use the same Memcached instances. We use Amazon’s ElastiCache service for this purpose. One more piece of infrastructure we don’t have to maintain!

In order to make things truly elastic, you need the ability to add new cache nodes on the fly. AWS provides a nifty Auto Discovery client for PHP, and it even works as a drop-in (mostly) replacement for PHP’s PECL Memcache package.

There is just one catch. The AWS ElastiCache client manages its own hash ring via consistent hashing. Drupal does this as well. To use Auto Discovery, you configure Drupal to only see the Auto Discovery cache node. The ElastiCache client then does the rest. This works fine under normal Drupal read operations, as the consistent hashing is transparent to Drupal. ElastiCache simply chooses the correct node.

As soon as Drupal nodes or entities are updated, however, everything falls apart. Drupal tries to flush the necessary cache keys, but the ElastiCache client either hashes the keys differently for the necessary flush commands or has some other incompatibility with Drupal’s cache management process. AWS has not released the ElastiCache client source code, so we are just guessing. Either way, the result is chaos! Or at least obnoxious issues where saving and changing content does not always show up in the CMS without multiple tries.

In order to ship, we had to give up on the elasti in our ElastiCache. For now we just run a single, big, ElastiCache instance. We do use a DNS alias, so we can always update the cache endpoint without redeploying code in case of a failure or the need to resize.

Scaling the Database

The CDN and Memcached handle most of our traffic spikes without ever making a request to the database. There are still scenarios that can cause serious load on the database, such as getting crawled by a less than polite search engine. We also want to make sure we are highly available. In AWS, your application must tolerate Availability Zone (AZ) failures. Amazon themselves don’t even consider it to be an outage unless more than one AZ is having issues:

“Region Unavailable” and “Region Unavailability” mean that more than one Availability Zone in which you are running an instance, within the same Region, is “Unavailable” to you. – Amazon Web Services SLA

We use Amazon’s Relational Database Service (RDS) for our MySQL databases. This has the obvious benefit of relieving our team from having to manage database servers and the corresponding replication configurations. The killer feature for us is RDS cross region read replicas. More on this later.

RDS makes it trivial to set up a master in one AZ and a read replica in another AZ. Drupal will happily accept both a master and a slave database via configuration, but it will then proceed to send very little traffic to the slave. Drupal requires queries to be explicitly marked as “slave safe” in core and contributed modules. Makes sense in theory, but in the real world, we saw less than 10% of queries actually get executed on the replica DB server during load testing. Even worse, if the master fails, Drupal refuses to use the read replica as a failover and instead starts barfing errors.

Fortunately for us, Thomas Gielfeldt has already tackled this issue with his Autoslave module. It is a bit gross (you have to copy extra files into Drupal core), but it works great and the result is a more evenly distributed load across your read replicas, and a proper read-only failover mode when the master is unavailable.

One downside is that Autoslave breaks Drush and update.php which makes local development a pain. We run a patched version of Drush on our production servers and use the non-autoslave DB driver for local development to get around these issues.

Shared Static Assets

The last piece of the Drupal Autoscaling puzzle is how to share assets between web servers. Drupal really, really likes to pretend that it is operating on local files. Traditional deployments, like our MP6 platform, usually involve setting up a network attached storage system (NAS) and creating NFS mounts on each web server. This is more complicated in cloud environments, so a solution like GlusterFS is more appropriate. The situation gets tricky when you add in Autoscaling and servers need to discover, add, and remove themselves from the GlusterFS pool. It gets even worse when you start to factor in file synchronization between multiple active data centers.

We elected to bypass this entire web of complexity and instead store all of our static assets in S3. This means that all of our images, CSS, and JavaScript are uploaded directly to S3 and never stored on any individual server. After the initial upload, S3 acts as the origin server for these assets and the request never even reaches web servers. An added benefit of this is that our web servers will answer fewer requests.

s3fs dialog

Setting the content type to use S3.

Thanks to Drupal 7’s support of PHP stream wrappers, there are several modules that exist to help with this task. We found that Robert Rollins’s S3FS module worked the best. The module sets the upload destination for file fields to use S3. It can also set the system default file system to be S3 as well. We did have to fork the module and make several changes (adding S3 cache-control headers and handling imagecache style derivatives correctly) to make it work. We are in the process of contributing those changes back to the project.

Having absolved ourselves of handling uploaded assets, we next had to figure out how to deal with Drupal’s dynamic CSS and JavaScript aggregation. Drupal 7 dynamically includes the required CSS/JS for a page and creates several aggregated files, which it saves to disk (aka-local storage). It then generates a unique name for the aggregated files and inserts them into the HTML for the requested page while also storing it in Drupal’s page cache. As long as all the other web servers in the pool have access to this same files (NFS or GlusterFS), everything is fine. We did not want to use GlusterFS, so for us, everything was not fine.

Our solution was to minify/uglify the CSS and JavaScript during our build process using Grunt and the cssmin and uglify plugins.

First we added new CSS and JS files to our theme info file:

stylesheets[all][] = css/css_site.css
    scripts[] = js/js_site.js

Next, we added code to our theme base template to hook into the CSS/JS aggregation pipeline and add the ability to replace the CSS/JS file collection with the new file if the mls_css_prebuilt or mls_js_prebuilt flags are set and the user is not logged in. When our editors and writers are logged in, they need additional styles and scripts, so we instead unload the aggregated files. Developers can toggle the settings flag for their local environments depending on their needs.

function mp7_css_alter(&$css) {
      $css['sites/all/themes/mp7/css/css_site.css']['preprocess'] = FALSE;

      if (variable_get('mls_css_prebuilt', FALSE) && !user_is_logged_in()) {
        foreach (array_keys($css) as $stylesheet) {
          if ($stylesheet !== 'sites/all/themes/mp7/css/css_site.css' && $stylesheet !== 0) {
      else {

    function mp7_js_alter(&$js) {
      $js['sites/all/themes/mp7/js/js_site.js']['preprocess'] = FALSE;

      if (variable_get('mls_js_prebuilt', FALSE) && !user_is_logged_in()) {
        foreach (array_keys($js) as $javascript) {
          if ($javascript !== 'sites/all/themes/mp7/js/js_site.js'
            && 'http' !== substr($javascript, 0, 4)
            && '//' !== substr($javascript, 0, 2)
            && '.js' === substr($javascript, -3)) {
      else {

Finally, we use the following Grunt configuration to create the new aggregated files. Our build server packages the files alongside the rest of the code. This way, we can guarantee that each web server has the full copy of the correct file.

      cssmin: {
        add_banner: {
          options: {
            banner: '/* <%= siteName %> build <%= buildNumber %> */',
            keepSpecialComments: 0
          files: {
            'sites/all/themes/mp7/css/css_site.css': [
              'sites/all/themes/<%= siteName %>/css/<%= siteName %>_main.css'
      uglify: {
        mp7: {
          options: {
            mangle: true
          files: {
            'sites/all/themes/mp7/js/js_site.js': [
              'sites/all/themes/<%= siteName %>/js/<%= siteName %>_main.js'


When the page is requested in production, these new aggregated files are pulled through the CDN:

<link type="text/css" rel="stylesheet"
     href="http://www.domain.com/sites/all/themes/mp7/css/css_site.css?n3qbsp" media="all">

At this point, the astute observer might be asking why we threw out Drupal 7’s fancy CSS/JS aggregation engine and basically replaced it with the one-big-file approach of Drupal 6. For the sake of slower mobile connections, we actually prefer to have all the CSS/JS for the site retrieved in one request and then stay cached for the duration of the visit. Every page uses the same file, so the browser won’t need to ask for it again.

Let’s Autoscale Already!

This might be quite a bit to digest, so let’s summarize the overall process to this point:

  1. Use RDS with read replicas for the database.
  2. Use the Autoslave Drupal module to make the read database replicas work.
  3. Use a single ElastiCache node for Memcached.
  4. Create an AMI that automatically deploys the latest code (stored in S3) on boot.
  5. Use SaltStack or a similar configuration management tool to deploy new code to running servers.
  6. Use the S3FS Drupal module to push all uploaded files to S3.
  7. Uglify/Minify CSS and JS during build process to keep them synchronized across web servers.

Do all these things and Drupal will happily scale up and down as part of an AWS Autoscaling group. If this meets your needs, then please avoid the next part of this article as things are about to get needlessly complex.

Multi Region Process

Throughout this post, we have danced around the concept of hosting the same Drupal application in multiple regions. Fortunately, most of the work to make Drupal handle Autoscaling also makes it easier to run in multiple regions. All we have to handle is database replication and remote cache invalidation.

Replicating the Database Across Regions

As discussed above, the ability for RDS to replicate across regions is magical. This single feature saved us from having to connect our two VPCs directly together in any way. Which is great, because connecting VPCs in separate regions, in a highly available configuration, is really hard. There are entire companies selling this service as a product.

Instead, we simply created two read replica RDS instances in our second VPC and let AWS take care of the transport and security of the replication data. These replicas are read-only, of course, so our entire second “active” datacenter can only serve anonymous traffic, which is fine for our use case.

We were then left with the tiny problem of getting Drupal to run on a read-only database (RDS MySQL read replicas run in read only mode). The only way we could find was to make a teeny, tiny, small, insignificant, not bad, totally ok, *ahem* hack *ahem* in core.

Every time you hack core, Dries kills a kitten.
We are so sorry, Dries.

To explain ourselves a bit: as long as you don’t use dblog (use syslog instead) we found that the only real issue with serving anonymous traffic from a read-only backed Drupal instance is that it refuses to stop plastering nasty error messages all over the place. Our little hack let’s us configure the Drupal instances in the read-only VPC to hide these errors. The patch is a one liner:

diff --git a/includes/database/database.inc b/includes/database/database.inc
    index 604dd4c..9cb3a1b 100644
    --- a/includes/database/database.inc
    +++ b/includes/database/database.inc
    @@ -375,7 +375,7 @@ abstract class DatabaseConnection extends PDO {
           'target' => 'default',
           'fetch' => PDO::FETCH_OBJ,
           'return' => Database::RETURN_STATEMENT,
    -      'throw_exception' => TRUE,
    +      'throw_exception' => !variable_get('read_only_instance', FALSE),

Remote Data Center Cache Invalidation

So we have the database replicating into our second region, but how do we coordinate the two caches? As we pointed out earlier, Drupal expects all the servers to have the same view of the cache. This means that a cache invalidation in the primary VPC needs to result in an invalidation in the read-only VPC as well. It is actually even worse than that. The cache invalidation cannot happen until the write that triggered it has completed replicating to the read-only VPC, otherwise the read-only servers would simply re-fetch stale data.

We toyed with the idea of trying to replicate the cache using Couchbase, but that would require us to build the VPC to VPC connection that we are trying to avoid. AWS offers DynamoDB, but it also does not inherently support multiple regions. We eventually arrived at a solution to this problem that resulted in a Drupal module we call Orbital Cache Nuke. We call it OCN for short.

It's the only way to be sure.

The premise of OCN is to use the database replication to execute the cache flushes. This way, we avoid any race conditions between the caches and the RDS replication timing.

It works as follows (primary refers to the admin Drupal instance, read-only refers to any Drupal instance in the read-only VPC):

  1. OCN creates a database table that tracks queued up cache invalidations.
  2. The OCN module code hooks into the cache clear process on the primary server.
  3. When cache clear command is executed on the primary server, either drupal_flush_all_caches() or cache_clear_all(), OCN writes a row to the queue table in the database.
  4. The queue table updates are replicated to the read-only datacenter by RDS cross region replication.
  5. A cron job on the primary server checks the queue table every minute. If it finds any queued cache clear commands, it posts them to a protected URL on the read-only VPC. This may be received by any web server in the Autoscaling pool.
  6. The read-only server that receives the cache flush command from the primary will check its replicated copy of the cache flush queue table against the array of commands received via HTTP POST.
  7. Commands received via HTTP POST that also exist replicated database table are executed in the read-only VPC.
  8. The read-only server responds with the commands that were successfully updated.
  9. The primary server removes any commands that were confirmed. Any non-confirmed commands are retried in the next interval.

There are a few gotchas involved, so definitely check out the module documentation if you are interested. There is nothing specific about this strategy to AWS. The same approach could be used in any environment where cross-datacenter replication was occurring.

Wrapping It All Up

MP7 architecture diagram
Click to see the big picture in all its glory.

The image above shows the abstract view of the whole setup. This is quite a lot to digest and the solutions discussed here represent a fair amount of added complexity. The complexity trade-off may or may not be worth it depending on your specific needs. Drupal was clearly not designed to be run in an elastic and/or decentralized architecture, so we have had to work around these limitations. In some cases, our answer has been to create bespoke software, such as our MatchCenter and API, that has been designed from the ground up to scale out in this sort of environment. However, in terms of flexibility and the surrounding community, Drupal is unmatched as a CMS. We feel that the investment to make Drupal work for us is well worth it.

There is much more to MP7 than what we discussed here. We plan on taking a deeper dive on specific topics throughout the year. If there is anything specific you would be interested in, be sure to hit us up on Twitter!

Justin Slattery - @jdslatts

New MLS Mobile App for 2015

January 12, 2015

Open beta for new MLSsoccer.com

December 04, 2014 Hans Gutknecht

Standings Visualizations

October 30, 2014 Tom Youds