Mastodawn

todsacerdoti

GitHub's Historic Uptime

https://damrnelson.github.io/github-historical-uptime/

Historical GitHub Uptime Charts

View GitHub's monthly uptime between 2016 and 2026.

Show thread

phillipcarter Mar 31

FWIW if people are looking for a reason why, here's why I think it's happening: https://thenewstack.io/github-will-prioritize-migrating-to-a...

GitHub Will Prioritize Migrating to Azure Over Feature Development

GitHub is working on migrating all of its infrastructure to Azure, even though this means it'll have to delay some feature development.

The New Stack

Show thread

nmaleki Mar 31

You'd think they'd do all the testing elsewhere and use a much shorter window of time to implement Azure after testing. I don't think this fully explains over 6 years of poor uptime.

Show thread

hadlock Mar 31

The fact that even they struggle with github actions is a real testimate to the fact that nobody wants to host their own CD workers.

Show thread

esseph Mar 31

> The fact that even they struggle with github actions is a real testimate to the fact that nobody wants to host their own CD workers.

What a weird takeaway

Show thread

phillipcarter Mar 31

It certainly explains the issues _now_, IMO.

Show thread

llama052 Mar 31

It's absolutely this. Our Azure outages correlate heavily with Github outages. It's almost a meme for us at this point.

Show thread

mholt Mar 31

Even better IMO is this status page: https://mrshu.github.io/github-statuses/

"The Missing GitHub Status Page" with overall aggregate percentages. Currently at 90.84% over the last 90 days. It was at 90.00% a couple days ago.

The Missing GitHub Status Page

Historical GitHub uptime reconstructed from archived status data.

Show thread

skipants Mar 31

These are two pages telling two different things, albeit with the same stats. The information is presented by OP in a way to show the results of the Microsoft acquisition.

Show thread

montroser Mar 31

It has been pretty rough. Their own numbers report just a single `9` for Actions in Feb 2026 with 98% uptime. But that said -- I don't get the 90% number.

Anecdotally, it seems believable that 1 in 50 times (2%) in Feb that Actions barfed. Which is not very nice, but it wasn't at 1 in 10 times (10%).

Show thread

verdverm Mar 31

It looks like the aggregate stats are more of a venn diagram than an average. So if 1/N services are down, the aggregate is considered down. I don't think this is an accurate way to calculate this. It should be weighted or in some way show partial outages. This belief is derived from the Google SRE book, in particular chapters 3 (embracing risk) and 4 (service level objectives)

https://sre.google/sre-book/embracing-risk/

https://sre.google/sre-book/service-level-objectives/

Google SRE - Embracing risk and reliability engineering book

Discover the concept of embracing risk in the context of service reliability and how to effectively utilize error budgets for a more resilient system.

Show thread

mort96 Mar 31

I mean I think it's useful. It answers the question, "what percentage of the time can I rely on every part of GitHub to work correctly?". The answer seems to be roughly 90% of the time.

Show thread

naniwaduni Mar 31

Nobody cares about every part of GitHub working correctly. I mean, ok, their SREs are supposed to, but tabling the question of whether that's true: if tomorrow they announced a distributed no-op service with 100% downtime, you should not have the intuition that the overall availability of the platform is now worse.

Show thread

verdverm Mar 31

I don't use half of the services, the answer is not straight forward

https://mrshu.github.io/github-statuses/

The Missing GitHub Status Page

Historical GitHub uptime reconstructed from archived status data.

Show thread

ablob Mar 31

If you're using all services, then any partial outage is essentially a full outage.
Of course, you can massage the numbers to make it look nicer in the way you described but the conservative approach is better for the customers.
If you insist, one could create this metric for selected services only to "better reflect users".

That being said, even when looking at the split uptimes, you'd have to do a very skewed weighting to achieve a number with more than one 9.

Show thread

verdverm Mar 31

> That being said, even when looking at the split uptimes, you'd have to do a very skewed weighting to achieve a number with more than one 9.

It's definitely bad no matter how it you slice the pie.

If GH pages is not serving content, my work is not blocked. (I don't use GH pages for anything personally)

Show thread

marcosdumay Mar 31

That's how you count uptime. You system is not up if it keeps failing when the user does some thing.

The problem here is the specification of what the system is. It's a bit unfair to call GH a single service, but it's how Microsoft sells it.

Show thread

verdverm Mar 31

> That's how you count uptime.

It's not how I and many others calculate uptime. There is not uniformity, especially when you look at contracts.

Show thread

formerly_proven Mar 31

In a nutshell, why would the consumer care (for the SLO) care about how the vendor sliced the solution into microservices?

Show thread

verdverm Mar 31

It will depend on the contract.

When I was at IBM, they didn't meet their SLOs for Watson and customers got a refund for that portion of their spend

Show thread

fontain Mar 31

An aggregate number like that doesn’t seem to be a reasonable measure. Should OpenAI models being unavailable in CoPilot because OpenAI has an outage be considered GitHub “downtime”?

Show thread

fwip Mar 31

I think reasonable people can disagree on this.

From the point of view of an individual developer, it may be "fraction of tasks affected by downtime" - which would lie between the average and the aggregate, as many tasks use multiple (but not all) features.

But if you take the point of view of a customer, it might not matter as much 'which' part is broken. To use a bad analogy, if my car is in the shop 10% of the time, it's not much comfort if each individual component is only broken 0.1% of the time.

Show thread

remus Mar 31

> But if you take the point of view of a customer, it might not matter as much 'which' part is broken. To use a bad analogy, if my car is in the shop 10% of the time, it's not much comfort if each individual component is only broken 0.1% of the time.

Not to go too out of my way to defend GH's uptime because it's obviously pretty patchy, but I think this is a bad analogy. Most customers won't have a hard reliability on every user-facing gh feature. Or to put it another way there's only going to be a tiny fraction of users who actually experienced something like the 90% uptime reported by the site. Most people are in practice are probably experienceing something like 97-98%.

Show thread

fwip Mar 31

Sorry, by 'customer' I meant to say something like a large corporate customer - you're buying the whole package, and across your org, you're likely to be a little affected by even minor outages of niche services.

But yeah, totally agree that at the individual level, the observed reliability is between 90% and 99%, and probably toward the upper end of that range.

Show thread

mememememememo Mar 31

Or if your kettle is not working the house is considered not working?

Show thread

Polizeiposaune Mar 31

I've been on a flight that was late leaving the gate because the coffeemaker wasn't working.

Show thread

wang_li Mar 31

A better analogy is if one bulb in the right rear brake light group is burnt out. Technically the car is broken. But realistically you will be able to do all the things you want to do unless the thing you want to do is measure that all the bulbs in your brake lights are working.

Show thread

Dylan16807 Mar 31

That's an awful analogy because "realistically you will be able to do all the things you want to do". If a random GitHub service goes down there's a significant chance it breaks your workflow. It's not always but it's far from zero.

One bulb in the cluster going out is like a single server at GitHub going down, not a whole service.

Show thread

mort96 Mar 31

As long as they brand it as a part of GitHub by calling it "GitHub Copilot" and integrate it into the GitHub UI, I think it's fair game.

Show thread

mememememememo Mar 31

What is Google's uptime (including every single little thing with Google in the name)?

Show thread

mort96 Mar 31

I don't think that's a fair comparison. Google Maps, Google Calendar, Google Drive, Google Search, Google Chrome, Google Ads, etc. are all clearly completely different products which have very little to do each other, they're just made by the same company called Google.

GitHub is a different situation. There's one "thing" users interact with, github.com, and it does a bunch of related things. Git operations, web hooks, the GitHub API (and thus their CLI tool), issues, pull requests, Actions; it's all part of the one product users think of as "GitHub", even if they happen to be implemented as different services which can fail separately.

EDIT: To illustrate the analogy: Google Code, Google Search and Google Drive are to Google what Microsoft GitHub, Microsoft Bing and Microsoft SharePoint are to Microsoft.

Show thread

Kaliboy Mar 31

Completely agree, it makes it worse actually as Github's secondary functions so to speak are things we implicitely rely on.

When I merge to master I expect a deploy to follow. This goes through git, webhooks and actions. Especially the latter two can fail silently if you haven't invested time in observation tools.

If maps is down I notice it and immediately can pivot. No such option with Github.

Show thread

dogma1138 Mar 31

It depends, for example - I would consider Google Drive uptime as part of say Google Docs’ overall uptime because if I can’t access my stored documents or save a document I’ve been working on for the past 3 hours because Drive is down I would be very pissed and wouldn’t care if it’s Drive or Docs that is the problem underneath I still can’t use Google Docs as a service at that point.

Show thread

shrinks99 Mar 31

I got Claude to make me the exact same graph a few weeks ago! I had hypothesized that we'd see a sharp drop off, instead what I found (as this project also shows) is a rather messy average trend of outages that has been going on for some time.

The graph being all nice before the Microsoft acquisition is a fun narrative, until you realize that some products (like actions, announced on October 16th, 2018) didn't exist and therefore had no outages. Easy to correct for by setting up start dates, but not done here. For the rest that did exist (API requests, Git ops, pages, etc) I figured they could just as easily be explained with GitHub improving their observability.

Show thread

irishcoffee Mar 31

Github actions needs to go away. Git, in the linux mantra, is a tool written to do one job very well. Productizing it, bolting shit onto the sides of it, and making it more than it should be was/is a giant mistake.

The whole "just because we could doesn't mean we should" quote applies here.

Show thread

psini Mar 31

But GitHub actions is not Git?

Show thread

lcnPylGDnU4H9OF Mar 31

The same philosophy would suggest that running some other command immediately following a particular (successful) git command is fine; it is composing relatively simple programs into a greater system. Other than the common security pitfalls of the former, said philosophy has no issue with using (for example) Jenkins instead of Actions.

Show thread

irishcoffee Mar 31

Did you click the link?

Show thread

padjo Mar 31

It feels like they launched actions and it quickly turned out to be an operations and availability nightmare. Since then, they've been firefighting and now the problems have spread to previously stable things like issues and PRs

Show thread

deepsun Mar 31

They rushed to launch Actions because GitLab launched them before.

BTW, GitLab called it "CI/CD" just as a navigation section on their dashboard, and that name spread outside as well, despite being weird. Weird names are easier to remember and associate with specific meaning, instead of generic characterless "Actions".

Show thread

fishtoaster Mar 31

Is the pre-2018 data actually accurate? There seem to have been a number of outages before then: https://hn.algolia.com/?dateEnd=1545696000&dateRange=custom&...

Maybe that's just the date when they started tracking uptime using this sytem?

HN Search powered by Algolia

Hacker News Search, millions articles and comments at your fingertips.

HN Search

Show thread

OlivOnTech Mar 31

Data comes from the official status page. It may be more a marketing/communication page than an observability page (especially before selling)

Show thread

pikzel Mar 31

The status page was often down when GH was down, back in the days.

Show thread

xiaoyu2006 Mar 31

Aha we need a status page of status page.

Show thread

hk__2 Mar 31

It’s biaised to show this without the dates at which features were introduced. A lot of the downtimes in the breakdown are GitHub Actions, which launched in August 2019; so yeah what a surprise there was no Actions downtime before because Actions didn’t exist.

Show thread

cuu508 Mar 31

You can click on "Breakdown" and then on "Actions" to hide it.

Show thread

mbauman Mar 31

Even worse, those features show "100% uptime" pre-existence on the breakdowns page too.

Show thread

siruwastaken Mar 31

This is the real questionable part of the graphic. It seems that no-data pre 2018 was just considered 100% uptime (which is hardly historically accurate).

Show thread

voxic11 Mar 31

Check the breakdown page. Like yes the magnitude is reduced obviously for individual services. But they all show the same trend.