Reading between the lines of the AWS outage – Why you should be worried Part 2

TL;DR: The real problem is there isn’t enough capacity.  The us-east1 region is too big and during an outage there simply aren’t enough resources to allow users to recover their sites.

In part 1 I discussed how the bug-rate on AWS doesn’t seem to be getting better.  A few bugs aren’t necessarily a big deal.  For one, they’re expected given the trailblazing that AWS is doing and the incredibly hard & complex problems they’re solving.  This is a recipe for bad things to happen sometimes.  To account for this the true selling point in AWS has always been “if something goes wrong, just get more instances from somewhere else and keep on running”.

In the early days (pre-EBS), good AWS architecture dictated that you had to be prepared for any instance to disappear at any time for any reason.  Had important data on it?  Better have four instances with copies of that data in different AZs & regions along with offsite backups, because your data could disappear at any time.  Done properly, you simply started a new instance to replace it, copied in your data, and went about your merry way.

What has unfortunately happened is that nearly all customers are centralized onto us-east-1.  This has many consequences to the architecture model described above.

Traffic Load

A very common thread in all of the us-east-1 outages over the last two years is that any time there is trouble, the API & management console becomes overloaded.  Every user will be trying to move and/or restore their services.  All at once.  And the API/console has shown to be extremely dependent on east-1.  PagerDuty went so far as to move to another region to de-correlate east1 failures from their own failures.

Competition for Resources

Once again, by virtue of us-east-1 being the largest region, whenever there is an outage every customer will start trying to provision new capacity in other AZs.  But there is seldom enough capacity.  Inevitably in each outage there is an entry in the status updates that says “We’re adding more disks to expand EBS capacity”, or “We’re bring more systems online to make more instances available”, and so forth.  You can’t really blame Amazon for this one: They can’t keep the prices they have and always be running below 50% capacity.  But when lots of instances fail, or lots of disks fill up, or lots of IP addresses get allocated, there just aren’t enough left.

This is a painful side effect of forcing everyone to be centralized into the us-east1 region.  us-west has us-west1 & us-west2 because the datacenters are too far apart to maintain a low-latency connection to put them into the same regional designation.  us-east has a dozen or more datacenters, and thanks to them being so close, Amazon has been able to call them all ‘us-east’ instead of ‘us-east1’ and ‘us-east2’.

But what happens when a bug affects multiple AZs in a region?  Suddenly, having all the AZs in a single region becomes a liability.  Too many people are affected at once and they have nowhere to go.  And all those organizations that have architected under the assumption that they can “just launch more instances somewhere else” are left with few options.

P.S. I know things are sounding a little negative, but stay tuned.  My goal here is first to identify what are the truly dangerous issues facing AWS, and then to describe the best ways to deal with them as well as why I still think AWS is the absolute best cloud provider available.

Advertisements

Reading between the lines of the AWS outage – Why you should be worried Part 1

TL;DR: There are many very real problem with AWS that are being ignored.  #1 is that they appear to have no plan for dealing with their ongoing software bugs.

There has been a surprising amount of talk about the most recent AWS outage (Oct 22, 2012). In truth, I was busy coding and didn’t even hear about the outage until it was almost over (my company runs partially on AWS, but in another region). From what I read on the Amazon status site, the scope sounded pretty limited; I didn’t see it as a major event. But the talk since then says otherwise.

Amazon has now released their complete post-mortem and in reading I was struck by several hidden truths that I think many people will miss.  I was an early closed beta tester of AWS (when you basically needed a personal invite to get in) and have done over 30 top-to-bottom builds of complete production app stacks while I was with RoundHouse.  So I hope to provide some additional insight into what makes the last few AWS outages especially interesting.

What you should really worry about Part 1 – The bugs aren’t getting better

Widespread issues occurred because of (by my reading) 6 different software bugs.  That’s a lot.  This fact can be spun as both a positive and a negative.   Here’s what we would assume are the positives:

  • The bugs can (and will) be fixed.  Once they are, they won’t happen again.
  • By virtue of AWS’s advancing age, they have “shaken out” more bugs than any of their lesser competitors.  Similar bugs surely lie dormant in every other cloud provider’s code as well.  No provider is immune and every single one will experience downtime because of it.  In this regard AWS is ahead of the game.

But this line of thinking fails to address the underlying negatives:

  • Any codebase is always growing & evolving and that means new bugs.  The alarming part is that each outage seems to have some underlying bugs discovered.  The incidence of Amazon letting new bugs through seems to be disconcertingly high.  It does no good to fix the old ones if other areas of the service have secretly added twice as many new bugs.  If things were really being “fixed” we would expect new bugs to show up less often, not more often.  After a certain point, we have to start assuming that the demonstrated error rate will continue.
  • When bugs like this happen, they often can’t be seen coming.  The impact and scope can’t be anticipated, but it is typically very large.

So far, we have not yet heard of any sort of comprehensive plan intended to address this issue.  Ironically, AWS is consistently dropping the ball on their ‘root-cause’ analysis but failing the “Five Whys” test.  They’re stopping at about the 2nd or 3rd why without addressing the true root cause.

In the case of AWS, anecdotal evidence suggests they have not yet succeeded in building a development protocol that is accountable to the bugs it produces. They’re letting too many through.  Yes, these bugs are very difficult to detect and test for, but they need to have a higher standard or outages like this will continue to occur.

Fixing Passenger error: PassengerLoggingAgent doesn’t exist

While doing a new install of Passenger & nginx, I ran into some strange errors:


2012/09/25 20:09:54 [alert] 2593#0: Unable to start the Phusion Passenger watchdog because it encountered the following error during startup: Unable to start the Phusion Passenger logging agent because its executable (/opt/ruby-enterprise-1.8.7-2011.12/lib/ruby/gems/1.8/gems/passenger-3.0.12/agents/PassengerLoggingAgent) doesn't exist. This probably means that your Phusion Passenger installation is broken or incomplete. Please reinstall Phusion Passenger (-1: Unknown error)

Our environment is using Ubuntu 12.04 LTS & Chef with the nginx::source & nginx::passenger_module recipes from the opscode cookbook. It turns out there were two root causes here that needed to be resolved:

  1. Even though the config explicitly stated to use version 3.0.12, the 3.0.17 passenger gem was also getting installed.  Some things were going to one place, some to another.
    • SOLUTION: I figured it’d just be easier to stick with the latest release.  So I changed the setting to just use 3.0.17 and uninstalled the old version
  2. The PassengerLoggingAgent was failing to be installed (but was failing silently.

SOLUTION: It turned out that we were missing some libraries.  Building the passenger package manually showed the details:

root@w2s-web01:/opt/ruby-enterprise-1.8.7-2011.12/lib/ruby/gems/1.8/gems/passenger-3.0.17# rake nginx RELEASE=yes
g++ ext/common/LoggingAgent/Main.cpp -o agents/PassengerLoggingAgent -Iext -Iext/common -Iext/libev -D_REENTRANT -I/usr/local/include -DHASH_NAMESPACE="__gnu_cxx" -DHASH_NAMESPACE="__gnu_cxx" -DHASH_FUN_H="<hash_fun.h>" -DHAS_ALLOCA_H -DHAS_SFENCE -DHAS_LFENCE -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wpointer-arith -Wwrite-strings -Wno-long-long -Wno-missing-field-initializers -g -DPASSENGER_DEBUG -DBOOST_DISABLE_ASSERTS ext/common/libpassenger_common.a ext/common/libboost_oxt.a ext/libev/.libs/libev.a -lz -lpthread -rdynamic
In file included from ext/common/LoggingAgent/LoggingServer.h:46:0,
from ext/common/LoggingAgent/Main.cpp:43:
ext/common/LoggingAgent/RemoteSender.h:31:23: fatal error: curl/curl.h: No such file or directory
compilation terminated.
rake aborted!
Command failed with status (1): [g++ ext/common/LoggingAgent/Main.cpp -o ag...]

So the libcurl development headers were the true issue.

apt-get install libcurl4-openssl-dev

was the solution.  Or in our case it was to add:

package 'libcurl4-openssl-dev'

to the nginx recipe in chef.   I hope this helps someone else out there!

AWS VPC Error: Client.InvalidParameterCombination

When trying to execute an ec2-run-instances command for a VPC, you must specify both which subnet & which security group you want it to belong to:

ec2-run-instances ami-abc123 \
 --group sg-abc123 \
 --subnet subnet-abc123 \
 --private-ip-address 10.0.1.10 \
 .... your other params

However, doing so generates this error:

Client.InvalidParameterCombination: Network interfaces and an instance-level security groups may not be specified on the same request

I even found one lowly report of someone else with this issue: https://forums.aws.amazon.com/message.jspa?messageID=368030

Luckily, my company has premium AWS support and a quick 10 minute chat got the answer I needed.  You must use the --network-attachment param, which takes the place of --group, --private-ip-address, and --subnet

The resulting command looks like this:

ec2-run-instances ami-abc123 \
  --network-attachment :0:subnet-abc123::10.0.1.10:sg-abc123::
  .... your other params

Good luck, I hope this helps!

Where do you get hosting support?

For quite some time now, I’ve found that the options for good Rails hosting have been significantly lacking.  As a consultant/contractor on a huge range of projects, I’m often asked for advice, guidance, or help in choosing and setting up servers for a client.  Nearly every client or customer wants the same thing:

  1. Stability/reliability
  2. Flexibility/room to grow
  3. Someone to keep things running
  4. Someone to call when they need help

The first two options are met by a lot of providers.  Tier IV datacenters, hardware redundancy, and virtualization are a dime-a-dozen nowadays and building a good Rails stack is just about the same for everyone.

However, the rub is in #3 & #4.  As my colleagues, peers, and I are already work full-time writing new applications, there is precious little time for system administration and support of completed projects and old apps.  Even with extensive automation, a small 1-3 person team can write many more Rails apps than they can support long-term.

Inevitably, the client wants to know “Who is going to keep things going once development is done?” and “Who can I call when things stop working?”.  Set them up on a physical machines, VPSes, EC2, or anything else and the developer is left with little choice but to help keep that server running long-term.  Including late night phone calls when something goes wrong.  And can you honestly say you are regularly doing all the little extra things that need to be done?  General maintenance?  Security patches?  Tuning?

Want an alternative?  AWS Premium support won’t touch your software stack.  Rackspace won’t support Rails.  Slicehost: no managed option at all.  There’s really only one player: just google ‘rails cloud support’.

So my question is: If you can get easy, scalable, on-demand hosting, why can’t you get easy, scalable, on-demand support? My answer to this issue is to launch a service that lets developers keep developing while someone else takes care of the system administration long-term:  RoundHouse Support.  Please read my public release announcement and then come check us out!

Announcing RoundHouse – Managed support for your host

I’m very happy today to announce a new community service available for Rails shops: RoundHouse – Server Management and Support.

RoundHouse is a cooperative solution for getting managed servers and system administration for your Rails stack, no matter what host you use.  We’re gathering a pool of specialists that you can call upon to get the help you need.  Whether that’s emergency support when you’re having server problems, regular day-to-day duties, or assistance in configuring a particularly difficult piece of software.

This is a service that provides freelancers, development shops, and companies alike an opportunity to focus on their product instead of on their hosting.  For developers this means freeing up more time to code.  For those running a website it means reliable service from a great group of experts.  For everyone it means having someone available whenever you need it.

Obviously this is a new offering and system administration (much like your hosting provider) must be utterly reliable.  So we’re beginning to establish a base set of clients to try out our service for free.  This gives us the chance to get established, continue to develop a solid organizational structure, and to expand our brand.  For our customers, it means you’re going to excellent sysadmin support at no charge while you learn about all the great things we can offer (and then hopefully recommend us to all your friends!).  So if you’ve been needing help setting up or running your Rails app, please contact us to get started.

We’re also looking to add additional members to our team as we continue to grow.  If you have expertise in system administration, elements of the Rails stack, or are just a great DevOp, please e-mail us at jobs@roundhousesupport.com and we can talk more!

Finally, please feel free to read my expanded rationale for how this service fits into the Rails ecosystem.

Presenting gem_cloner

Besides being a Ruby/Rails/Merb developer, I’m also a part-time sysadmin for a number of previous clients.  Usually I’m responsible for maintaining Rails stacks, either for apps that I’ve written or just for another developer that doesn’t have as much Linux experience.

Lately, I’ve had to do a move of a number of Rails installations to completely new/clean servers.  I’ve got lots of scripts for doing initial setups of the stack as they need to be.  But one thing that comes up is that, especially with older apps, the gem dependencies can be very finicky.  Installing the latest versions will almost certainly break something.  Plus some times the system can have quite an extensive list.

Yes, I know that the gems should be packaged with the app, but there are a lot of reasons that it doesn’t always happen or doesn’t always work.  To that end, I’ve found the most effective method is just to re-install the exact same set of gems on the new box as the old one.  To automate this process, I present: gem_cloner.

gem_cloner is a very tiny but useful script that will take the text output of `gem list`from one machine and execute the `gem install` command on the new machine.  Usage is very simple:

  1. On the old machine, run `gem list > gems.txt`
  2. Copy gems.txt to the new machine.
  3. Copy the gem_cloner.rb file to the same place
  4. With sudo or as root, run `ruby gem_cloner.rb`

The script will read that file and install the exact same gem versions.  You’ll definitely want to browse and tweak the script.  Possibly by adding ‘sudo‘ in the command call or adding ‘–no-rdoc –no-ri‘ (I personally use a gemrc to eliminate the doc files on production systems).

Fork, patch, & praise ad nauseum on github and drop me a line if you like it.