Mastodawn

todsacerdoti Mar 31

GitHub's Historic Uptime

https://damrnelson.github.io/github-historical-uptime/

Historical GitHub Uptime Charts

View GitHub's monthly uptime between 2016 and 2026.

Show thread

mholt Mar 31

Even better IMO is this status page: https://mrshu.github.io/github-statuses/

"The Missing GitHub Status Page" with overall aggregate percentages. Currently at 90.84% over the last 90 days. It was at 90.00% a couple days ago.

The Missing GitHub Status Page

Historical GitHub uptime reconstructed from archived status data.

Show thread

montroser

It has been pretty rough. Their own numbers report just a single `9` for Actions in Feb 2026 with 98% uptime. But that said -- I don't get the 90% number.

Anecdotally, it seems believable that 1 in 50 times (2%) in Feb that Actions barfed. Which is not very nice, but it wasn't at 1 in 10 times (10%).

Show thread

verdverm Mar 31

It looks like the aggregate stats are more of a venn diagram than an average. So if 1/N services are down, the aggregate is considered down. I don't think this is an accurate way to calculate this. It should be weighted or in some way show partial outages. This belief is derived from the Google SRE book, in particular chapters 3 (embracing risk) and 4 (service level objectives)

https://sre.google/sre-book/embracing-risk/

https://sre.google/sre-book/service-level-objectives/

Google SRE - Embracing risk and reliability engineering book

Discover the concept of embracing risk in the context of service reliability and how to effectively utilize error budgets for a more resilient system.

Show thread

mort96 Mar 31

I mean I think it's useful. It answers the question, "what percentage of the time can I rely on every part of GitHub to work correctly?". The answer seems to be roughly 90% of the time.

Show thread

naniwaduni Mar 31

Nobody cares about every part of GitHub working correctly. I mean, ok, their SREs are supposed to, but tabling the question of whether that's true: if tomorrow they announced a distributed no-op service with 100% downtime, you should not have the intuition that the overall availability of the platform is now worse.

Show thread

verdverm Mar 31

I don't use half of the services, the answer is not straight forward

https://mrshu.github.io/github-statuses/

The Missing GitHub Status Page

Historical GitHub uptime reconstructed from archived status data.

Show thread

ablob Mar 31

If you're using all services, then any partial outage is essentially a full outage.
Of course, you can massage the numbers to make it look nicer in the way you described but the conservative approach is better for the customers.
If you insist, one could create this metric for selected services only to "better reflect users".

That being said, even when looking at the split uptimes, you'd have to do a very skewed weighting to achieve a number with more than one 9.

Show thread

verdverm Mar 31

> That being said, even when looking at the split uptimes, you'd have to do a very skewed weighting to achieve a number with more than one 9.

It's definitely bad no matter how it you slice the pie.

If GH pages is not serving content, my work is not blocked. (I don't use GH pages for anything personally)

Show thread

marcosdumay Mar 31

That's how you count uptime. You system is not up if it keeps failing when the user does some thing.

The problem here is the specification of what the system is. It's a bit unfair to call GH a single service, but it's how Microsoft sells it.

Show thread

verdverm Mar 31

> That's how you count uptime.

It's not how I and many others calculate uptime. There is not uniformity, especially when you look at contracts.

Show thread

formerly_proven Mar 31

In a nutshell, why would the consumer care (for the SLO) care about how the vendor sliced the solution into microservices?

Show thread

verdverm Mar 31

It will depend on the contract.

When I was at IBM, they didn't meet their SLOs for Watson and customers got a refund for that portion of their spend