In defense of GitHub's poor uptime
26 points by untitaker
26 points by untitaker
My experience is simply that on a daily basis, something doesn't work, be that GitHub Actions, PRs, Issues, Git operations, Discussions, the API, etc. I use almost every part of GitHub, and I use it near constantly in some capacity (if I'm not browsing, a CI job is running, etc.). I use it enough that I am victim to some outage every single day.
I'm not being hyperbolic here. I mean it quite literally: something fails every single day in such a way that it is disruptive to my work ("oh, I guess I have to do something else.") GitHub is supposed to be a place to do work, and realistically, it isn't today.
If you're curious what it was today, we had a GitHub Action outage for ~15 minutes in the middle of the prime time that me and a group of maintainers were doing PR review and it forced us to delay merging a set of PRs for most of the day (because by the time it came back up, some of us had to go).
For those saying, "why don't you just leave?" Yes, I'm working on it, but both things can be true: their uptime can be unacceptably bad AND customers should leave.
If GitHub services could be considered independent events, then sure. But it's more complicated than that. In a typical git workflow, I might use multiple services:
git fetch (Git Operations)gh command line (API service) to check the state of a run (Actions)When one of those things don't work, the whole workflow falls flat. I can't work on the other tasks very easily while any step of that workflow is broken. So the reliability is an "AND" of all of those services, not an "OR". My interface with GitHub is not fragmented across services, and I honestly don't want to interact with GitHub in such an isolated way either.
So in the worst case, consider someone using all 10 of the services listed on their status page, which Microsoft should want as the ultimate vendor lock-in scenario. That's how you end up at zero 9s of reliability.
I'd also add that the effect of service downtime is not simple binary availability. Some services can go down without much effect, due to redundancy, backups, or just because it's not important. Other services are critical. "Riskiness" of a service going down can be thought of as a product of probability it goes down and severity of the impact when it does go down.
Software Engineering as an industry could benefit from a better understanding of risk management, bayesian statistics, and appropriate mitigations. I took a stab at explaining this in a developer-friendly way: https://ashwinsundar.com/posts/risk-analysis-software/
If those were largely independent services I think this argument would be stronger, but given that using GitHub is frequently using many of their components, that GitHub provides many different features I guess is some consolation, but failures in many GitHub components end up having more or less the same negative effect to me.
Also like, people only harp on this due to how Microsoft is cramming stuff down our throats and it looks like this happening at the same time that one of their services used by most of us is being quite annoying to us.
The truth is I remember being annoyed already by GitHub instability as early as in 2017 and perhaps more. At work, we switched from an internal managed Bitbucket that had virtually 100% uptime to seeing unicorns constantly. They've always had bad streaks of reliability.
I believe it's bad to be mad at stuff for the wrong reasons even if the stuff is bad for other reasons. So yeah, this defense is right. Hate GitHub because they have become and entrenched monopoly with terrible network effects. Secondarily because at some point they had a quite good real-time UI that is slowly sucking more and more (while they seem to be adopting web development practices I dislike).
i wonder how much llms have increased the load on github, both from the direct copilot integration and from the sheer volume of increased activity by llm agents. imagine how many more issues have been created or commits have been made by llms, or repositories created by people who can now, through the assistance of llms, put something online - themselves creating further llm activity. we've seen plenty of other sites have to drastically scale their operations and/or introduce proxies such as anubis to manage load, and that's "only" for scraping. this part amusingly looks sort of self-inflicted.
The reality is somewhere in the middle. Different tasks require different subsets of Github's features. If any feature from a subset you actively need to use is down, you will be unable to complete your whole task.
Doesn't do much to defend the change from 99.99%+ uptime to whatever, shortly after the MS acquisition. What changed, reporting? Practices? Something else?
That's the useful question to ask.
This articulates a nuance I hadn't thought of, thanks so much, @evanhahn!
I appreciate the nuance as well as this not being full GitHub propaganda.
Because of the muddled writing I can't really understand what uptime definition the author is talking about but that's fine, in the wider industry as well there are very few people who can productively think and communicate about these things.
The way some Datadog SLOs calculate this is that for any time slice if any of the entities underneath are down, then that time slice is lost. I think that's fair and then is mostly dependent on which entities are grouped together.
By this day I can’t comprehend why the home for all major open source projects is proprietary, many of these issues could already be solved or not existent… To me, Tangled is very close to what I’d like GitHub be like, if you ignore the “social coding” aspect. You own your own profile data, self-host Git repos, and CI runners that you can easily move between instances, or spin up your own frontend in case of failure. Although, I will wait for it to mature enough, and drop mostly hard dependency on Nix(OS).