AWS just can’t catch a break

F

Frederic Lardinois

Guest
For the third time this month, AWS today suffered an outage in one of its data centers. This morning, a power outage in its US-EAST-1 region affected services like Slack, Asana, Epic Games and others.

The issues started around 7:30 a.m. ET and the knock-on effect of these issues continues to plague the service as of 1 p.m. ET, as AWS continues to report issues with a number of services in this region, specifically its EC2 compute service and related networking functions. Most recently, the single sign-on service in this region also started seeing increased error rates.

“We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region,” the company explained in an update at 8 a.m. ET. “This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so.”


If this had been the only AWS outage in recent weeks, it would have barely been noteworthy. Given the complexity of the modern hyper clouds, outages are bound to happen every now and then. But outages are currently a weekly occurrence for AWS. On December 7, the same US-EAST-1 region went down for hours due to a networking issue. Then, on December 17, an outage that affected connectivity between two of its West Coast regions took down services from the likes of Netflix, Slack and Amazon’s own Ring. To add insult to injury, all of these outages happened shortly after AWS touted the resilience of its cloud at its re:Invent conference earlier this month.

Ideally, of course, none of these outages would ever happen and there are some ways that AWS users can protect themselves from them by architecting their systems to fail over to a geographically separate region — but that can add significant cost, so some decide that the trade-off between downtime and cost isn’t worth it. At the end of the day, it’s on AWS to provide a stable platform. And while it’s hard to say if the company is just having a string of bad luck or if there are any systematic issues that have led to these problems, if I were hosting a service in the US-EAST-1 region right now, I would probably at least consider moving it elsewhere.