Summary of the Amazon DynamoDB Service Disruption in Northern Virginia (US-EAST-1) Region

31 points by dryya

fanf

One of the under-appreciated features of the standard DNS UPDATE protocol is that it is transactional. (AWS Route53 has proprietary APIs for managing DNS data.)

There are big caveats with DNS UPDATE, tho: the update is atomic on each authoritative server individually, but the secondaries will apply the atomic update later than the primary; and if the update modifies multiple RRsets then resolvers might cache a mixture of old and new states.

The main way transactional updates are useful is to eliminate any window where records are missing. (You can avoid spurious NXDOMAIN without transactions in many cases by adding new records before deleting old ones, but that doesn’t work if you are changing from A to CNAME or vice versa.) This is nice because transactional updates make it really easy to ensure an update is idempotent: the UPDATE message can simply say: delete all the old records regardless of what data they contain, and add these new records; and the change happens as an atomic unit.

The risk with that kind of idempotent update is that last-write-wins is often not what you want. But that can be fixed with a more obscure feature: an UPDATE can include prerequisites. An UPDATE can check that records are present or absent or contain particular data before applying the change, and the prerequisite checks can examine different names or record types from the ones that are to be changed. This can be useful to prevent TOCTOU races.

So, for example, if you have a versioned RRset that might be updated concurrently by multiple clients, you can stuff the version number in (say) an HINFO record next to the A records, and your UPDATE message can have a prerequisite that the HINFO matches the preceding version before replacing the HINFO and A records as an atomic unit.

Sirikon

The race condition in the service updating DNS records sounds obvious. Surprising that no one caught that one before (or during the design phase, honestly). Also surprising that this didn't happen before.

koreth

That was my initial reaction too, but then the thought occurred that maybe it only sounds obvious because they're giving us a highly simplified description of the actual systems, omitting any detail that isn't directly relevant to the failure. So we're left with a streamlined textbook outline of a race condition, rather than a messy system with complex data flows that have a race condition buried somewhere inside.

Or maybe it really was obvious in reality and they just missed it. Hard to say from the outside.
- Sirikon
  
  True, the distilled version might omit the pesky details that made it hard to catch
ThatsInteresting

The race condition in the service updating DNS records sounds obvious. Surprising that no one caught that one before (or during the design phase, honestly). Also surprising that this didn't happen before.

I'm not surprised but then I've run global systems and saw first hand how, "We designed it so that can't happen," actually turns out to be, "That very, very rarely happens," along with, "and only under this obscure event or bug," despite lots of smart, experienced people working on it.
- Sirikon
  
  Sometimes the solutions at hand distile to "the optimal version that will rarely fail" and "the safer version that will never fail but is a bit slower"
  
  No one ever regrets the later.
  - ThatsInteresting
    
    Sometimes the solutions at hand distile to "the optimal version that will rarely fail" and "the safer version that will never fail but is a bit slower"
    
    No one ever regrets the later.
    
    I tend to agree but you can still miss things.
    
    A real example of how things can fail from years ago (not AWS)... Scenario: We centrally generated configs for a system specific to each dc/location and pushed them out to local staging directories with a temporary name, hard link to the new name after completion, unlink the temp. A process would check that the file was newer than the last, perform some validation including that it was not empty and had <10% change, and copy it into place like before, renaming the old one keeping three historical, and send a signal to the system to pick it up. You'd think this is hard to get wrong but failures I recall: fsync result not checked, a SAN that may lie and report success but it's not on disk, duplicate inode # (really! this was a kernel/fs bug), perfect truncation resulting in a valid but incomplete file, a older file arriving out of order, sync taking too long across regions resulting in different values from different places, and partial/intermittent network partition havoc.
    
    Rare things at scale get fun fast 🤣
    
    Sirikon
    
    That's true, you can always miss things, or discover that your foundations aren't as solid as you thought.
    
    My surprise with the issue at hand comes from it just looking trivial. Seems obvious that if you have two uncoordinated services potentially removing and overwriting each other data in a common place you'll have issues like these.
    
    Additionally, due to the nature of the issue, I feel like it should have happened way sooner, way more often. Probably it did happen, but didn't trigger this cascading reaction due to other remediations like retries and healthchecks preventing a disaster.
    
    As I mentioned in another message, probably the postmortem is a simplification of a way more complex issue and I'm overanalyzing anyway.
  - amw-zero
    
    I'm surprised when people use the word "surprised" to describe incidents like this. It can't have been obvious, because then the incident wouldn't have happened.
  - kqr
    
    Classic hindsight bias! https://i.xkqr.org/hindsightbias.jpg
- samof76
  
  I am not trying to get mileage out of this but here are my notes on this RCA.
  
  https://samof76.space/aws-diwali-outage.html