More than DNS: Learnings from the 14 hour AWS outage

https://thundergolfer.com/blog/aws-us-east-1-outage-oct20

More Than DNS: The 14 hour AWS us-east-1 outage

A thorough review of a major cloud outage.

Jonathon Belotti [thundergolfer]

Good to see an analysis emphasizing the metastable failure mode in EC2, rather than getting bogged down by the DNS/Dynamo issue. The Dynamo issue, from their timeline, looks like it got fixed relatively quickly, unlike EC2, which needed a fairly elaborate SCRAM and recovery process that took many hours to execute.

A faster, better-tested "restart all the droplet managers from a known reasonable state" process is probably more important than finding all the Dynamo race conditions.

I was motivated by your back-and-forth in the original AWS summary to go and write this post :)
It's good, and I love that you brought the Google SRE stuff into it.