Reading between the lines of the AWS outage – Why you should be worried Part 2

TL;DR: The real problem is there isn’t enough capacity.  The us-east1 region is too big and during an outage there simply aren’t enough resources to allow users to recover their sites.

In part 1 I discussed how the bug-rate on AWS doesn’t seem to be getting better.  A few bugs aren’t necessarily a big deal.  For one, they’re expected given the trailblazing that AWS is doing and the incredibly hard & complex problems they’re solving.  This is a recipe for bad things to happen sometimes.  To account for this the true selling point in AWS has always been “if something goes wrong, just get more instances from somewhere else and keep on running”.

In the early days (pre-EBS), good AWS architecture dictated that you had to be prepared for any instance to disappear at any time for any reason.  Had important data on it?  Better have four instances with copies of that data in different AZs & regions along with offsite backups, because your data could disappear at any time.  Done properly, you simply started a new instance to replace it, copied in your data, and went about your merry way.

What has unfortunately happened is that nearly all customers are centralized onto us-east-1.  This has many consequences to the architecture model described above.

Traffic Load

A very common thread in all of the us-east-1 outages over the last two years is that any time there is trouble, the API & management console becomes overloaded.  Every user will be trying to move and/or restore their services.  All at once.  And the API/console has shown to be extremely dependent on east-1.  PagerDuty went so far as to move to another region to de-correlate east1 failures from their own failures.

Competition for Resources

Once again, by virtue of us-east-1 being the largest region, whenever there is an outage every customer will start trying to provision new capacity in other AZs.  But there is seldom enough capacity.  Inevitably in each outage there is an entry in the status updates that says “We’re adding more disks to expand EBS capacity”, or “We’re bring more systems online to make more instances available”, and so forth.  You can’t really blame Amazon for this one: They can’t keep the prices they have and always be running below 50% capacity.  But when lots of instances fail, or lots of disks fill up, or lots of IP addresses get allocated, there just aren’t enough left.

This is a painful side effect of forcing everyone to be centralized into the us-east1 region.  us-west has us-west1 & us-west2 because the datacenters are too far apart to maintain a low-latency connection to put them into the same regional designation.  us-east has a dozen or more datacenters, and thanks to them being so close, Amazon has been able to call them all ‘us-east’ instead of ‘us-east1’ and ‘us-east2’.

But what happens when a bug affects multiple AZs in a region?  Suddenly, having all the AZs in a single region becomes a liability.  Too many people are affected at once and they have nowhere to go.  And all those organizations that have architected under the assumption that they can “just launch more instances somewhere else” are left with few options.

P.S. I know things are sounding a little negative, but stay tuned.  My goal here is first to identify what are the truly dangerous issues facing AWS, and then to describe the best ways to deal with them as well as why I still think AWS is the absolute best cloud provider available.

Advertisements

Reading between the lines of the AWS outage – Why you should be worried Part 1

TL;DR: There are many very real problem with AWS that are being ignored.  #1 is that they appear to have no plan for dealing with their ongoing software bugs.

There has been a surprising amount of talk about the most recent AWS outage (Oct 22, 2012). In truth, I was busy coding and didn’t even hear about the outage until it was almost over (my company runs partially on AWS, but in another region). From what I read on the Amazon status site, the scope sounded pretty limited; I didn’t see it as a major event. But the talk since then says otherwise.

Amazon has now released their complete post-mortem and in reading I was struck by several hidden truths that I think many people will miss.  I was an early closed beta tester of AWS (when you basically needed a personal invite to get in) and have done over 30 top-to-bottom builds of complete production app stacks while I was with RoundHouse.  So I hope to provide some additional insight into what makes the last few AWS outages especially interesting.

What you should really worry about Part 1 – The bugs aren’t getting better

Widespread issues occurred because of (by my reading) 6 different software bugs.  That’s a lot.  This fact can be spun as both a positive and a negative.   Here’s what we would assume are the positives:

  • The bugs can (and will) be fixed.  Once they are, they won’t happen again.
  • By virtue of AWS’s advancing age, they have “shaken out” more bugs than any of their lesser competitors.  Similar bugs surely lie dormant in every other cloud provider’s code as well.  No provider is immune and every single one will experience downtime because of it.  In this regard AWS is ahead of the game.

But this line of thinking fails to address the underlying negatives:

  • Any codebase is always growing & evolving and that means new bugs.  The alarming part is that each outage seems to have some underlying bugs discovered.  The incidence of Amazon letting new bugs through seems to be disconcertingly high.  It does no good to fix the old ones if other areas of the service have secretly added twice as many new bugs.  If things were really being “fixed” we would expect new bugs to show up less often, not more often.  After a certain point, we have to start assuming that the demonstrated error rate will continue.
  • When bugs like this happen, they often can’t be seen coming.  The impact and scope can’t be anticipated, but it is typically very large.

So far, we have not yet heard of any sort of comprehensive plan intended to address this issue.  Ironically, AWS is consistently dropping the ball on their ‘root-cause’ analysis but failing the “Five Whys” test.  They’re stopping at about the 2nd or 3rd why without addressing the true root cause.

In the case of AWS, anecdotal evidence suggests they have not yet succeeded in building a development protocol that is accountable to the bugs it produces. They’re letting too many through.  Yes, these bugs are very difficult to detect and test for, but they need to have a higher standard or outages like this will continue to occur.