If been a big fan of Amazon Web Services (AWS) because they lower the costs of startup experimentation. I’ve sponsored their events, judged their startup competition, etc. I have friends on the team. I’ve also had frank conversations with them about service level agreements and what it means to be an infrastructure provider in a mashup world. Mashups increase the need for high availability and uptime. If the user experience of a mashup application requires, say, five web services from three separate companies to be available the overall probability of failure goes up subtantially. it’s the weakest link in the chain argument.The Net learned this the hard way yesterday when multiple AWS services (S3, EC2, SQS, Simple DB, etc.) had a multi-hour outage. The problem was exacerbated by the fact that, internally, various AWS services depend on one another and especially the storage service, S3.It looks like the cause for the outage was a particular use pattern of S3:
What caused the problem however was a sudden unexpected surge in a particular type of usage (PUT’s and GET’s of private files which require cryptographic credentials, rather than GET’s of public files that require no credentials). As I understand what Kathrin said, the surge was caused by at least one very large customer plus several other customers suddenly and unexpectedly increasing their usage.
I would highly recommend for anyone who is building a developer community or providing SaaS infrastructure or relying on SaaS infrastructure to take the time and read the many posts on the AWS forums about the outage. You hear the real pain and frustration of people whose businesses depend on AWS. The key complaint was not that the service failed–failures do happen–but that Amazon was not prepared to engage with the developer community around the failure.
It’s AmazING the fact of having no info on what’s happening. Absolutely unacceptable. Come on, people on this forum are all tech guys, so we understand that bad things happen from time to time. However, you MUST be transparent with your customers and give them details on what’s going on (yes, we want to know exactly what’s happening and not a standard response like ‘The issue is resolved’). In fact, it is not. So please, scale these complaints to the right person and post the technical explanation of the issue as soon as possible.
Jesse Robbins over at O’Reilly has a good post comparing how Amazon dealt with the situation to how Salesforce responded to its infamous outage a couple of years ago. I’ve also blogged before about how SaaS brings increases responsiblities.All in all, Amazon worked very hard to get the issue resolved and the community was thankful for their efforts.
As I said before, you need to be transparent with your customers. No service can provide 100% uptime. It’s a fact. No matter if u have a redundant anycast network or supercalifragilisticexpialidocious elastic clouds. I just want to get notified and know what’s exactly happening. Nothing else. That said, the issue was resolved very fast, so you should be very proud. Hats off to Amazon’s IT staff.