“when catastrophe turns the corner, and visits our system, who is frequently involved? Again, someone taking actions which make sense to them at the time: pedestrians jaywalk, and engineers hit return. To learn and adapt, we need to understand not just what went wrong, but why we believed it would go right.”
From: How Did It Make Sense at the Time? Understanding Incidents as They Occurred, Not as They are Remembered — https://www.infoq.com/presentations/incidents-investigation/ #sre #devops #sreweekly #reliability
How Did It Make Sense at the Time? Understanding Incidents as They Occurred, Not as They are Remembered

Jacob Scott explores the basics of failure in complex systems, the theory and practice of how it made sense at the time, and actions to take.

InfoQ

The #SREWeekly newsletter is always solid stuff. A few careful chosen links, a couple of paragraphs, it's great https://sreweekly.com/sre-weekly-issue-389/

The #DevOpsWeekly newsletter is also great. Often a bit busier, it's thrown up some great tools over the years. https://www.devopsweekly.com/

I also miss Cron.Weekly which was full of good stuff

SRE Weekly Issue #389 – SRE WEEKLY

Etsy Engineering | Scaling Etsy Payments with Vitess: Part 1 – The Data Model

At the end of 2020, Etsy’s Payments databases were in urgent need of scaling. Specifically, two of our databases were no longer...

Etsy

"We take you step by step as we uncover a three-phase cycle of Redis latency spikes involving two distinct saturation points and discover a simple fix to break that cycle" – Matt Smiley via @SREWeekly

https://about.gitlab.com/blog/2022/11/28/how-we-diagnosed-and-resolved-redis-latency-spikes/

#gitlab #redis #sre #sreweekly

How we diagnosed and resolved Redis latency spikes with BPF and other tools

How we uncovered a three-phase cycle involving two distinct saturation points and a simple fix to break that cycle.