More Than DNS: The 14 hour AWS us-east-1 outage
26 points by davish
26 points by davish
It's unfortunate that AWS postmortems still follow the outdated "5 Whys" approach[0][1].
[0] https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-23/wat.concept.fivewhys.en.html
[1] https://aws.amazon.com/blogs/mt/why-you-should-develop-a-correction-of-error-coe/
what should they do instead?
5 Whys is a form of root cause analysis, which is based on an underlying assumption that there is a single root cause.
A better model is the “swiss cheese” model, which observes that in a complex system the parts can be variously degraded or have weaknesses while the system as a whole continues to work. A failure occurs when the broken parts (the holes) happen to line up in an unfortunate manner.
This model leads to a broader analysis that looks for contributing factors, like accident investigation reports. It was pioneered by air safety; you can see it in rail and industrial accident reports too.
In my (brief) experience doing COEs when at Amazon, they encouraged “branching” whys to allow for multiple causes. However, in my experience, the result tended to be small action items that could be accomplished by the team writing the COE. To me, it wasn’t terribly effective at leading to systematic changes, especially if the changes weren’t technical changes, but organizational ones. I wasn’t in AWS, so I don’t know if their process works better.
I've always been of the opinion that accidents happen because at least two things went wrong. Now I have a word for that, thanks.
Oh wow, i never heard of the swiss cheese model...and makes total sense! Thanks for sharing!
You might also be interested in the Theory of Constraint's "Current Reality Tree" approach.
Any complex system is absolutely crawling with latent bugs. The key to reliability isn't to identify and fix all these latent bugs before they manifest (although eliminating as many as possible is obviously a good thing). It's to ensure that the system is resilient to the failure of any one component (even if it operates in a degraded state), and especially to avoid amplifying local failures into global ones, as we saw in this incident, where a latent race condition that was mitigated in a couple of hours cascaded into a systemic failure that took several times as long to mitigate.