AWS experiences 12+ hour multi-service outage
71 points by adriano
71 points by adriano
The entire thing, as someone who spends most of their day in devops/sysadmin world, is absolutely fascinating. First people have been warned off of being in us-east-1 for as long as I can remember. I've never had a conversation with an AWS employee where, if you mention "oh yeah we're primarily in us-east-1" they don't make a face and encourage you to "strongly consider multi-region". There's clearly some sort of underlying problems or capacity issue there.
Now on the plus side it wasn't that long ago that us-east-1 sneezed and everywhere else caught a cold. AWS made a lot of commitments years ago that they were going to make the regions more independent from us-east-1 and, at least this time, it seems to have worked. That's great and to AWS credit they have been telling people to have a multi-region plan for years. Nobody should have been surprised.
That said, AWS has not made it easy enough for companies to get out of us-east-1 and they haven't done the work to encourage people to get out of there. It's still the default for too many services, it's still the place where too many customers have their primary control plane. Instead of AI, AWS should be investing its effort in getting customers out of us-east-1 and leaving the capacity there for their own internal services. Cause clearly there's some sort of serious problem with having it be the default region for a lot of things and then also allowing customers to consume capacity on those same services.
"strongly consider multi-region"
But multi-region really sucks! By design, there is no real cross-region control plane (and if there was, it would be your new single point of failure). Going multi-region means you have to do all the cross-region stuff yourself, which is bad enough that you might just consider going multi-provider instead while you're at it.
Heck, multi-region is so bad that a few of AWS' own services apparently had dependencies on us-east-1 services, which is why those services were having problems globally in all regions yesterday.
This. I am not a master of our setup at work so I don't know if this could have been avoided, but to move out of us-east-1 on top of actually moving the VM's we would need:
etc etc etc. This can be done incrementally, but it's a shitload of work. And this is for a small company without much IT infrastructure. (In our defense, our company actually is on/near the US east coast.)
My paranoia says that Amazon makes it deliberately difficult to move or rearrange stuff because that'd make it easier to move off of AWS, and making it hard to move between regions is a side-effect, but you'd still think they could put some work into making it a little less of a bottleneck.
I guess „multi-region” is a gentle way of saying „migrate your stuff away from us-east-1”. Might sound more palatable because migration is a lot of work with no direct benefit (things will work the same, just in another region; the risk of outage like yesterday’s is theoretical and not really tangible until you experience one), and going multi-region is growth! architectural improvement! resiliency! redundancy and high availability! And as such it might be an easier to get that project approved.
Maybe your view is more global than mine but in my experience with European companies many hardly rely on us-east-1 by their own choice, unless I misunderstood you.
Like, yes - they have stuff there if they need something on the east coast if they service the US market, but it's not us-east-1 by default or anything (I'm talking about a setup where you start in eu-west-1 then expand to us-east and us-west - but it could be ANY east coast region, I don't have them handy, it's been too long) - but the main problem usually are the AWS internal dependencies on us-east-1 (like CloudFormation).
We were hardly affected on eu-west-1
but some of our upstream vendors turned out to be very dependent on us-east-1
.
There's clearly some sort of underlying problems or capacity issue there.
us-east-1 has long been the "default" and first option in the list, so there are a ton of eggs in that basket, and it's not good PR when your brand gets associated with "the thing that breaks a quarter of the internet periodically". I also think it's the oldest region they have, so there's probably bad decisions haunting them.
Azure kind of has the same problem with the West Europe region, but they've gotten a lot better at steering people away from it - the default region selection takes your location into account, and if you still try to use West Europe, it will tell you if a service is available cheaper from a different region.
The upside of us-east-1 is that if AWS issues cause you to go down — it's fine, half of the rest of the internet is down too.
If you're running on a different region and there's an AWS outage there, good luck! Your customers will blame you and churn.
Flagging as off-topic, but I’d love to see a submission detailing the root cause(s).
This is now posted:
Between 11:49 PM PDT on October 19 and 2:24 AM PDT on October 20, we experienced increased error rates and latencies for AWS Services in the US-EAST-1 Region. Additionally, services or features that rely on US-EAST-1 endpoints such as IAM and DynamoDB Global Tables also experienced issues during this time. At 12:26 AM on October 20, we identified the trigger of the event as DNS resolution issues for the regional DynamoDB service endpoints. After resolving the DynamoDB DNS issue at 2:24 AM, services began recovering but we had a subsequent impairment in the internal subsystem of EC2 that is responsible for launching EC2 instances due to its dependency on DynamoDB. As we continued to work through EC2 instance launch impairments, Network Load Balancer health checks also became impaired, resulting in network connectivity issues in multiple services such as Lambda, DynamoDB, and CloudWatch. We recovered the Network Load Balancer health checks at 9:38 AM. As part of the recovery effort, we temporarily throttled some operations such as EC2 instance launches, processing of SQS queues via Lambda Event Source Mappings, and asynchronous Lambda invocations. Over time we reduced throttling of operations and worked in parallel to resolve network connectivity issues until the services fully recovered. By 3:01 PM, all AWS services returned to normal operations. Some services such as AWS Config, Redshift, and Connect continue to have a backlog of messages that they will finish processing over the next few hours. We will share a detailed AWS post-event summary.
Surprise: it was DNS.
"0" days since it was DNS... (It's always DNS)
The root cause was DNS.
The reason it always causes such a clusterfuck is AWS apparent obsession with dogfooding to extreme degrees.
It'd be like if Royal Canin suddenly said "hey look how good our dog food is, we're also selling beef bourguignon now!"
Are there any plans to fix DNS?
No "fixing" of DNS can prevent someone yoloing an update that points your authentication service cname into the void. (To pick a random example.)
You need an ultimate source of truth & if you tell it to return erroneous data then things will fall over. (Insert quote about every distributed system inevitably having a single point of failure here.)
There are certainly some things that would help (replacing the single record TTL with two TTLs, one for how long to serve a cached record, and one for how long before revalidation should be attempted, for example - this would help with reducing cache misses which lead to pretty noticable latency, right now resolvers can decide to serve expired records, but that's about it) but fundamentally DNS is too widely deployed to be able to do anything on a protocol level beyond adding new record types.
this would help with reducing cache misses which lead to pretty noticable latency
Modern resolvers reduce latency spikes by refreshing cached records if a client requests them shortly before they expire. The client will be answered from cache and the resolver kicks off upstream queries in the background. If the records are not popular enough to be requested in the configurable prefetch window then they are allowed to expire.
There isn’t an RFC for this behaviour because it isn’t deemed to be a protocol change. Unlike, for instance, negative answer synthesis from DNSSEC records, which can help caches suppress spam and typo queries.
resolvers can decide to serve expired records
Serve-stale is subtly different. It is intended to mitigate outages when a zone’s authoritative servers are unreachable. It doesn’t help with latency spikes.