But he's an SRE not a historian. His job is to help people make their services more reliable, and help people understand what reliable actually means. Reliability is very important. #devopsdaysNYC
Use of the phrase has been dramatically on the rise since the mid-1980s according to the Google Books ngram projects [ed: one of my favorite things to have been an SRE for!].
Why? Because everything is a Service. IaaS, PaaS, DBaaS, etc. etc. #devopsdaysNYC
We need a new language around service reliability. What does our stack of reliability primitives look like?
First, we have service level indicators that are metrics that define how well a service is operating (e.g. ratio of good events to total events). #devopsdaysNYC
Important to measure from your user's perspective.
the SLO sets a threshold on the SLIs. Nothing is ever 100% reliable, so SLOs let us pick a more reasonable number.
And finally, the error budget calculates how our SLO has performed over time. #devopsdaysNYC
e.g. "you can have 43 bad minutes per 30 days" rather than thinking in terms of nines.
Finally, the SLA implies that there's some kind of contract or compensation involved. Less important to us as SREs than SLOs. #devopsdaysNYC
And hopefully your engineers will be happier too because they'll stop getting paged about things that are not end user experience problems. #devopsdaysNYC
and hopefully your product teams and stakeholders will be happier too.
Your service has one job: to be dependable. Your users define your reliability and dependability, not your own internal metrics. #devopsdaysNYC
How does this actually look? We don't just want to verify the service is running, we have to make sure that it's available to users and performant enough. and returning correct results. #devopsdaysNYC
There are a lot of things to care about, how can we measure only a few things to get everything? If you start from the most complex thing (correctness), you automatically get availability etc.
You still have to measure responsiveness separately. #devopsdaysNYC
If you do this, users will have their experiences measured better, engineers will get paged less, and product teams will have better metrics for their products. #devopsdaysNYC
Take an example shopping website service: if we just check whether we can login, that doesn't validate whether people can add items to the cart. You need to look at the *entire* user journey. #devopsdaysNYC
It absolutely is more work to measure black-box from users' perspectives [ed: or RUM], but you can measure your service the way users experience it. It's worth it. #devopsdaysNYC
The things people generally measure today like API uptime, error rates, or database query latency don't tell you anything about whether users can log in, buy things, or search for items. Take a step aside and think from your users' perspective. [fin] #devopsdaysNYC