I work for an independent (read: not venture funded) mobile app developer that does some serious traffic (hundreds of thousands of clients that are daily active). Since January, we have had our infrastructure hosted 100% on Amazon Web Services. More specifically, our infrastructure was hosted directly in the afflicted availability zone (us-east-1a). All told, we experienced only about an hour or so of downtime. Given the commentary going on around the web, I thought I'd share some of the decisions we made that enabled us to recover quickly.
1) We paid for a multi-az database. Our database is by far the most expensive component of our infrastructure, and we essentially pay double to make it multi-az. Recently we upgraded to a db.m2.xlarge, and as part of that upgrade it was moved into us-east-1a. Amazon informed us that during the outage, a automatic failover did occur and our database queries were handled by the second copy. I realize multi-az RDS is extremely costly, but most small startups with some degree of traction should hopefully be able to pay for a multi-az small instance and upgrade from there. Anecdotally, it sounds like one of the major problems that hurt Quora was the fact that they had their own DB instances which were both hosted in 1a.
2) We have at least two instances of every core component. As obvious as this sounds, we wasted a lot of time with things like clustered membase and master-slave redis replicas before we finally just stopped and did a minor amount of engineering to vastly improve our recoverability. Here's essentially how it works: we have a memcached machine in us-east-1a and an identical one in us-east-1b. The two machines know nothing of one another. The same is true for redis: there is one in us-east-1a and us-east-1b, but there is no master/slave connection. We have between 6-8 front-end application servers that are split between 1a and 1b. On a cache read (either from redis or memcached), the application server will figure out his AZ and read from the respective redis/memcached machine that is also in his AZ. Cache writes are done to both machines in both AZs at the application level. The cache machines should be in sync, but if they're not, it's a cache, whatever, it'll expire eventually anyway :P
3) The remaining single points of failure have been imaged as AMIs and hold no core data. For example, we have a cron server and a graphite graphing server. Neither was hosted in 1a, but had they been, we could have (eventually) spun up new instances and kept going. I realize there was a window of time where new instance creation was failing in all AZs, so this could have taken a longer amount of time, so I guess on this point we got lucky.
The outage affected several of our 1a web frontends and one of our redis machines. Because our machines in us-east-1b still talk to the redis machine in 1a, it took us some time to figure out that the failures of our machines in 1b were because of a stall at web server startup attempting to connect to the redis instance in 1a. Once we had identified that point, we effectively cut the cord between machines in 1b and 1a (so redis writes would no longer be attempted to the 1a machine) and removed all 1as from the load balancer. There were a few other servers living in both 1a and 1b that handle other functions we also effectively removed. We spun up another 4 app servers in 1b, which took some time but all completed within about an hour, put the new instances in the LB, and were back up. Our remaining memcached and redis instance in 1b were already capable of handling our load, and the other major single point of failure, the database, didn't go down because it was multi-az.
In the aftermath our dashboard showed about half a dozen servers, all in 1a, all terminated. The fact that we didn't have to spend time trying to restore or recover data from these machines, and could instead just wholesale terminate them and redirect our traffic to the healthy AZ, is core to how we were able to survive with a minimal amount of downtime.
Obviously it would be great to have servers in multiple datacenters in multiple regions, but for us the cost in dollars and eng time would be prohibitive. We'd also put the notion of having code to auto-detect and redirect traffic when an AZ goes down in the same bucket - nice to have, but way too complicated for the eng team to build and support right now. Rather than focus on building an all-or-nothing automatic failover architecture, we had an architecture that made it possible - through manual intervention - to cut some cords and recover in the face of an AZ failure. Given how other sites have fared, we feel pretty good about our decisions.
I know there are still a lot of people recovering their data, and RDS still has some unpleasantness to work through, but good luck to every other startup out there in the cloud, especially the small ones.
April 30, 2011 - Update per Amazon's Post-Mortem: While there were multi-az instances that became unavailable during this outage, by selecting the multi-az option, we reduced the likelihood of an outage from 45% to 5%.
Furthermore, while there was a period of time during which it became impossible to spin up new instances in the healthy AZ, already-running instances in the healthy AZ remained healthy. Thus, during the worst-case few hours of the outage, we could only run our site at half-capacity (because half of our servers were in the healthy AZ). Once the EBS control plane API access had been restored, were were able to spin up new instances in the healthy AZ and run at full capacity once more.
Overall, while there was an element of luck in our experience, we still feel our architectural investment paid off during the outage.