Looks like quite a few (if not all) of the Amazon AWS services are down or performance is significantly degraded this morning. This is the first significant outage since we started using S3 to serve images for WordPress.com. Currently we serve about 1500 image requests per second across WordPress.com. About 80-100 per second are served through S3; the rest being served from our local caches. When the outage occurred, our systems detected the errors and automatically sent the requests normally bound for S3 to local image servers that we use for backup and failover purposes. The outage is currently going on 2+ hours. I wonder what impact, if any, this will have on AWS. It seems like quite a few folks are using S3 and EC2 as their sole source of computing power and storage. I wonder if these folks will move to more traditional hosting providers where there are formal SLAs, support, etc.
UPDATE: Looks like after about 2.5 hours of downtime, things are starting to come back online over at Amazon.
UPDATE: I guess there is a SLA for S3.
Scott Beale reports that many Web 2.0 websites were affected by today’s power outage at 365 Main in San Francisco. While unfortunate, as a systems guy I have to assume things like this are going to happen. They shouldn’t happen, but they can and they will. At the data center level, there should be multiple levels of redundancy that minimize the probability of a power outage. Things such as multiple power circuits, redundant UPSes, and generators are standard. For a complete power outage to occur there should have to be multiple simultaneous system failures. I looked for a statement from 365 Main as to what the problem was, but couldn’t find one.
The system architecture behind WordPress.com and Akismet is designed to take entire data center failures into account. For WordPress.com, we serve live content in real-time from 3 data centers (33% from each data center) and in the event of a data center failure, traffic is automatically re-routed to the 2 remaining data centers. Syncing content in real-time between multiple data centers has not been easy, but at times like this I am sure that we made the right decision.