Watching all of the random things that people have been saying about AWS's #outage yesterday reminded me of a discussion that I had with one of my teams while working at Google.
We were doing longer-term planning, and people were proposing multi-year goals. One (moderately senior) teammate wanted a goal of roughly "MTTR decreases X% per quarter".
So, that sounds nice in theory but it's not how mature services really work. As you fix the easy bugs, you get fewer and fewer trivial outages. "Admin typed dumb thing" mostly goes away with better checking and deployment policies. "Partial backend failure caused cascading failure" is mostly handled by avoiding patterns that cause cascading failures, and then dealing with partial failures as best as they can be handled. "Trivially bad software release broke things" gets handled by improved testing and canarying over time.
Unfortunately, once you get rid of the easy outages, you're left with *weird* stuff. I/O patterns that trigger latent firmware bugs in SSDs, causing accelerated failure fleet-wide, with a multi-year lead time on replacements. Datacenter fires. Natural disasters. CPU bugs. ROMs that get overwritten by excessive reading. Software bugs that cut across 4 or more services and somehow manage to find decade-old fatal flaws. Overloading some resource *that no one knew existed* (per-second-level domain HTTP cookie jar size, undocumented stateless router hardware state limits). Or (one of my favorites) BGP stops converging correctly because several racks were too heavy and their plastic wheels had cracked (yes, really!).
The sorts of things that you *can't* fix fast, because no one even has a good model for what is happening, and none of the usual quick fixes (roll back, drain, loadshed, etc) are helpful.
In this specific service's case, we weren't *quite* to the maturity level where I expected MTTR to start rising, but we were getting close. And, frankly, we didn't track MTTR very closely anyway.
Reading takes on AWS's outage like "when we had our own datacenters, we never had long outages without any ETA for recovery" mostly just means that you never had any of the really *fun* problems.
Remember, the reward for a job well-done is a new, harder job.