Last talk of this conference is "The Power of Stories" by Lorin Hochstein, aka @norootcause
Last talk of this conference is "The Power of Stories" by Lorin Hochstein, aka @norootcause
What kind of muppet are you?
When something negative, surprising, or unexpected (like an incident) happened, the human response is "How could this bad thing have happened????????"
Stories are important because they are a tool that humans have developed to make sense of senseless events.
The incident prevention grand challenge is "How do you get the right information into the heads of the right people at the exact moment when they need it?"
"Stories" stick in your head, which is helpful for getting information into the heads of people who need it.
"Only a fool learns from his own mistakes, a wise man learns from the mistakes of others" -- unknown
We've also talked a lot about vicarious learning, e.g., learning from watching or listening to someone else.
Example: Shoulder-surfing while someone is looking up dashboards.
"Stories" are good when you can't shoulder-surf.
We're comparing "nursing" to "SRE" -- for example, a patient is deteriorating but it isn't showing up in their vitals yet. In engineering, this is like the system is in bad shape but no alerts have fired yet.
Claim: good stories have two properties to be "useful" for some nebulous definition of "useful"
1. The story needs to be anomalous -- something in the story needs to disrupt your mental model of the world. There needs to be a disconnect between your belief and reality. Example: "This should never happen!" which means "I never expected that to happen!"
2. The story needs to be immutable (this is kindof a weird term to use): important details are preserved as the story gets passed on.
We're discussing how this applies to the Therac-25 story.
The simple story is that "it's about race conditions", but the real story is a lot more complicated.
One example: the machine would frequently error and the errors were harmless, and in this case the error meant "the patient's dose is too high" but the operators didn't have that info.
When it comes to incidents, there are different kinds or styles of stories you can tell:
1. "The horror story": we failed over and the problem followed us to the new region! 😱
2. "The morality tale": the engineer ignored the failing test and the bad code made it to production 😡
The details of a story depend a lot on the perspective of the storyteller.
Example: the Challenger disaster. Feynman wrote an appendix to the "official" synopsis of the accident. He said this was a story about "management underestimating risks"
Another story, told by Edward Tufte (of Visual Display of Quantitative Information fame) said this was about "poor information presentation and bad visualization".
Not surprising that an info-vis guy said it was an info-vis problem 🤣
A third story told by Diane Vaughan said that challenger was a story about normalization of deviance.
If you tuned a noisy alert, that is normalization of deviance: if the alert is firing, but the system is healthy, we make the alert less noisy! We do this all the time.
"So what, Lorin, you told us all this stuff, what am I supposed to do with it?"
One take-away: when you do your incident writeups, tell it as a story.
(Is this a good time to plug ACRL's one public postmortem? 🤣 🤣 🤣 🤣 🤣)
https://blog.appliedcomputing.io/p/postmortem-intermittent-failure-in