What happened to Amazon EC2 and why it shouldn’t have

Nathan Andrews 22 Apr, 2011 at 1:58pm ET Article published in Tech

Yesterday, Amazon’s EC2 (Elastic Compute Cloud) service suffered a pretty heavy outage in the US East region. This outage caused some pretty widespread issues with sites like Foursquare, Reddit, and Hootsuite, among others. The effects ranged from minor reduction in functionality to complete unavailability.

Here’s the rub, however: it shouldn’t have become such a huge problem, yet it still was for a significant portion of the web. Let’s break it down.

The house that Amazon built

Amazon’s web services infrastructure is broken up into regions. These can be thought of essentially as individual data centers (which they currently are). There are two primary regions in the US: US East in Virginia; and US West in California. Each of these is broken down further into zones (currently four in each region). When a client creates an EC2 instance, they decide which region and zone they want to run it in. In most cases, users will chose the region closest to them and/or their customers geographically, providing a nominal gain in latency performance.

Today’s issue took out two zones within the US East region, making any server instances and attached disks within unavailable. As of the time of this writing, the zones appear to be functioning properly, and Amazon is working hard to make sure disks that were affected are being made available again. However, customers on the official support forums are not so optimistic, and are airing their frustrations at the lack of response from Amazon Support.

With enough planning, EC2 clients could have weathered this outage and emerged relatively unscathed. Load-balancing between regions would have allowed services to remain available in a moderately reduced capacity (as evidenced by the sites that managed to stay online), and in some cases left the sites completely unaffected. Planning DNS zones would have allowed client requests to be routed to instances in the US West region if there was an issue reaching any instances in US East.

Unfortunately, not everyone who uses EC2 (myself, personally, for a few projects here and there included) can afford to create their site and maintain their services in duplicate or triplicate. Additionally, the DNS services for a good number of sites hosted in EC2 are managed from within EC2, creating a cascading effect when services become unavailable. Finally, there are only two regions in the US, making load-balancing an either-or equation.

The fallout

Looking through a lot of the news regarding the outage, it appears that a lot of the bigger sites managed to do ok. Reddit went into an emergency read-only mode, reducing their database traffic significantly while working in a reduced capacity. Foursquare managed to maintain services throughout most of the day with only a few sporadic issues here and there. Hootsuite, however, was essentially down, implying that almost their entire infrastructure is housed in the remaining problematic zone. SCVNGR appears to be functioning normally in the wake of their “The sky is falling!” tweet from yesterday. Hootsuite remains down as of this writing.

Since I work in enterprise IT, most of my day is spent working with developers and database administrators on how to maintain availability of applications and databases. When I talked with a friend affected by today’s issue, my first response to “We’re completely down” was “Why aren’t you mirrored between regions?”. In his case, they could very easily have afforded to maintain duplicate images in the US West region with multiple load-balancers and DNS zones.

The ease and affordability of Amazon’s EC2 on the surface has allowed some startups to become complacent and focus on features, rather than availability and planning. Hopefully, yesterday’s issues are a wake up call, creating a culture that also takes these needs into account.

Comments

22 Apr 2011 ~ 2:09pm BuddyJ Excellent commentary on the situation. My company is currently pricing cloud hosting for a system-as-a-service product and EC2 is one option we've considered. The recent outage certainly didn't improve anyone's opinion here at the office, but I don't think too many people realized the outage so bemoaned on Twitter was really something that could just as easily have been a minor nuisance. Because of past experiences, I suspect we'll likely go to Rackspace Cloud and pay their premium prices in exchange for not having to deal with the setup intricacies of EC2.
22 Apr 2011 ~ 2:13pm AlexDeGruven I think that was really the biggest issue. People don't know how to plan for real availability, and when disaster strikes, they're left wondering wtf just happened.

EC2 almost makes it too easy to deploy an application with no forethought.
22 Apr 2011 ~ 2:14pm primesuspect Can I put in a solid recommendation for Storm on Demand (Icrontic's host)? Not only is LiquidWeb reliable, but their customer service has been stellar. In addition, good ol' Ardichoke is an employee there and we've gotten lots of very personal and awesome service from him and his colleagues.

https://www.stormondemand.com/
22 Apr 2011 ~ 2:15pm primesuspect Put it this way: Has Icrontic gone down since we moved to them?

(PS: The times Lincoln pushed dev code out to the production server DON'T COUNT)
22 Apr 2011 ~ 2:23pm AlexDeGruven Here's the real question, though: Are we load-balanced between or among their datacenters?

That's the problem that hit most of the people who got hammered yesterday. They had all of their resources in one or two zones within the region. When those zones had problems, they were dead until resolution.
22 Apr 2011 ~ 2:28pm BuddyJ I've personally recommended Storm on Demand to many people but based on the scale of what we're doing and, most importantly, previous experiences I don't see the decision makers swaying from what they know to work.
22 Apr 2011 ~ 3:28pm ardichoke Doesn't look to me like it's back up. Amazon Elastic Compute Cloud (N. Virginia) is still reporting as down on the status page. Also, http://www.cad-comic.com/ is still down (the only reason I even noticed that EC2 was having problems, other than reading about it here and whatnot).

Also, thanks for the love Brian, we <4 you too.
22 Apr 2011 ~ 4:16pm primesuspect I love the fact that the cad-comic page is "hosted" by ZeHosting, which is one of hundreds of "hosting companies" that just resell cloud services. Even better, ZeHosting's tagline is.... well:
22 Apr 2011 ~ 4:23pm Tushon I noticed because reddit went down. The things my friends get me involved in (I stayed away from it for a long, long time)
22 Apr 2011 ~ 5:20pm Garg I wasn't able to check in on Foursquare yesterday morning for a couple hours. But, I often have errors on 4SQ on my phone, so it took me awhile to realize it was anything other than business as usual.
23 Apr 2011 ~ 11:09am Linc

AlexDeGruven wrote:

Are we load-balanced between or among their datacenters?

No, that would be a massively unjustifiable cost given our size.

Here are my hosting experiences summarized:

Random Rackspace Tech: "Ohai guyz, I restarted ur Apaches, all done."
Me: "You just took down half our websites by making a blatant configuration mistake. Nice."
(This month they are 0/6 in getting tickets right the first time! What a great game.)
--
LiquidWeb Tech I Recognize: "I made the requested change, could you confirm? Also, are you the guys we did X & Y absurdly complex things for flawlessly for last summer? I recognized your hostname."
Me: "Yes. Yes we are. "
25 Apr 2011 ~ 1:28am primesuspect URL shorteners exist almost solely for one reason: Twitter. actly.com = 9 characters. act.ly=6 characters.
25 Apr 2011 ~ 9:32am Garg There are numerous alternatives to Libyan TLDs. I usually use goo.gl