Yesterday, Amazon’s EC2 (Elastic Compute Cloud) service suffered a pretty heavy outage in the US East region. This outage caused some pretty widespread issues with sites like Foursquare, Reddit, and Hootsuite, among others. The effects ranged from minor reduction in functionality to complete unavailability.
Here’s the rub, however: it shouldn’t have become such a huge problem, yet it still was for a significant portion of the web. Let’s break it down.
The house that Amazon built
Amazon’s web services infrastructure is broken up into regions. These can be thought of essentially as individual data centers (which they currently are). There are two primary regions in the US: US East in Virginia; and US West in California. Each of these is broken down further into zones (currently four in each region). When a client creates an EC2 instance, they decide which region and zone they want to run it in. In most cases, users will chose the region closest to them and/or their customers geographically, providing a nominal gain in latency performance.
Today’s issue took out two zones within the US East region, making any server instances and attached disks within unavailable. As of the time of this writing, the zones appear to be functioning properly, and Amazon is working hard to make sure disks that were affected are being made available again. However, customers on the official support forums are not so optimistic, and are airing their frustrations at the lack of response from Amazon Support.
With enough planning, EC2 clients could have weathered this outage and emerged relatively unscathed. Load-balancing between regions would have allowed services to remain available in a moderately reduced capacity (as evidenced by the sites that managed to stay online), and in some cases left the sites completely unaffected. Planning DNS zones would have allowed client requests to be routed to instances in the US West region if there was an issue reaching any instances in US East.
Unfortunately, not everyone who uses EC2 (myself, personally, for a few projects here and there included) can afford to create their site and maintain their services in duplicate or triplicate. Additionally, the DNS services for a good number of sites hosted in EC2 are managed from within EC2, creating a cascading effect when services become unavailable. Finally, there are only two regions in the US, making load-balancing an either-or equation.
The fallout
Looking through a lot of the news regarding the outage, it appears that a lot of the bigger sites managed to do ok. Reddit went into an emergency read-only mode, reducing their database traffic significantly while working in a reduced capacity. Foursquare managed to maintain services throughout most of the day with only a few sporadic issues here and there. Hootsuite, however, was essentially down, implying that almost their entire infrastructure is housed in the remaining problematic zone. SCVNGR appears to be functioning normally in the wake of their “The sky is falling!” tweet from yesterday. Hootsuite remains down as of this writing.
Since I work in enterprise IT, most of my day is spent working with developers and database administrators on how to maintain availability of applications and databases. When I talked with a friend affected by today’s issue, my first response to “We’re completely down” was “Why aren’t you mirrored between regions?”. In his case, they could very easily have afforded to maintain duplicate images in the US West region with multiple load-balancers and DNS zones.
The ease and affordability of Amazon’s EC2 on the surface has allowed some startups to become complacent and focus on features, rather than availability and planning. Hopefully, yesterday’s issues are a wake up call, creating a culture that also takes these needs into account.


Articles RSS