Git's HTTP server side design does not scale
31 points by gmem
31 points by gmem
Serving git fetches is fundamentally quite expensive because the server has to generate a packfile. Concurrent fetches and big repos need lots of RAM. (Full clones should be easier since the server ought to be able to sendfile a pregenerated packfile.) A multithreaded server might be able to reduce the costs of the preliminary faff of checking the request for plausibility, but once you get to the point of generating a packfile CGI is not a large percentage of the cost.
I regularly wonder why the dumb protocol isn't more popular, and even deprecated.
Seems that for at least some common use cases, it should be good enough and scaling it reduces to scaling static file fetches.
With some smart organization of the packfiles, e.g. daily or weekly packfiles (maybe similar to a log-structured merge-tree), it should scale nicely.
the fetch process is a series of HTTP GET requests, where the client can assume the layout of the Git repository on the server.
this sounds pretty miserable for server traffic in the probably overwhelmingly common case of "someone does a git clone with no arguments, leading to a full clone"
The fixed layout is not too bothersome but the client-side discovery process leads to a ton of requests, and high network utilisation especially for subsequent fetches.
When doing a clone one might expect the packfile to be cached since every client is getting the same thing (until an update to the repo invalidates the clone). I'm pretty sure github does something like that, I remember seeing different enumeration/compression behaviour (speed) between a straight clone and a more complicated clone request on big repos.
It depends on factors such as the set of objects that the client has
You can cache the packfile. You can also rely on the advertised bundle clone to offset the majority of the downloads to a CDN.
The problem isn't with git per say. It's people are not aware of the latest features in git and best practices when it comes to scaling a git server. Gitlab's Gitaly does a really good job at this while staying open-source. Github scales just as well, but they are closed source, so it's hard to speculate how they do things.
Is that “can” in theory, or are there real implementations of git-upload-pack that cache packfiles? I can’t find any documentation for how a git server can transparently redirect a client to use a bundle instead of a normal fetch. Do you have any pointers?
Yeah look inside gitaly packfile cache. It’s implemented in gitlab a long time ago to help with scaling ci fetches
Do you have a link to the relevant documentation? I’m interested to know how I could deploy it on my own basic git server.
https://gitlab.com/gitlab-org/gitaly/-/blob/master/doc/design_pack_objects_cache.md?ref_type=heads
I would just run gitaly/gitlab ce (i was a gitaly contributor years ago)
Edit: and this is a blog post explaining the second feature i mentioned: bundle clone. https://blog.gitbutler.com/going-down-the-rabbit-hole-of-gits-new-bundle-uri
Again, gitlab and github are able to scale up their infrastructure to handle thousands of ci workers fetching the same repo all at once. I think most folks just are not aware of the latest features in git and all the hooks which power these customizations
My basic git server is basic because it doesn’t run anything like GitLab :-) I understand gitaly is only useful as part of GitLab, not for normal git servers?
I saw the Gitbutler blog post and I looked at the git bundle URI design draft, but as far as I can tell the implementation is incomplete: as I wrote before, I can’t find the documentation that explains how to configure git’s server-side components to advertise bundle URIs so that clients use bundles automatically.
git config says:
The bundle.* keys may appear in a bundle list file found via the git clone --bundle-uri option. These keys currently have no effect if placed in a repository config file, though this will change in the future.
So I guess what will be added in the future is a setting to configure a repo’s bundle list file or URI to be advertised by the server, or it’ll extract a bundle list from the config.
I am not entirely sure what to suggest to users of Anubis that serve git repositories with git-http-backend. My SRE instinct is that the entire model of using fork/exec with CGI is fundamentally broken and the git-http-backend service needs to be its own HTTP server that can run persistently and concurrently handle requests, but that is not something that can be slapped together instantly.
This is the trivially cheap part of serving git. Computing the pack file can take seconds to minutes.
Computing the pack file can take seconds to minutes
Not true. Given a large repo and a client with minimal delta. A packfile can easily take up 40-50 minutes to compute and eat up a ton of ram and cpu in the process. We are talking about tens of GBs of compressed data for the most average monorepos of less than 15 years old tech companies.
There are techniques to speed that up, but you are way underestimating the cost here.
I don't fully understand the story's question.
I thought with Anubis, only real users hit the web server, so I guess the question is of scalability wrt. to "real" users?
I think there are a couple of approaches- I don't think there's a lot documented about their effective performance, but:
You can still use the git daemon or the ssh protocols, which might be more efficient.
They are basically the same or worse than cgi: fork per connection, and run git-upload-pack or whatever the client asks for. An ssh connection is much more expensive than a cgi request.
Oh, dang, I thought the CGI required multiple processes per clone, but it seems it doesn't.
Then, yes, scratch those. Bundles or dumb HTTP :-p
Even when pushing/fetching via SSH, the current git server implementation runs a bunch of separate processes. Git's codebase is pretty ugly, using lots of global variables and calling exit on error instead of doing proprer memory cleanup, so subcommands like receive-pack and upload-pack have to launch lower-level subcommands like unpack-objects and pack-objects as separate processes. It'd definitely be nicer if these were just function calls, but as others mentioned here, this probably isn't the main reason they were having a rough time serving all those git clones.
Back in the day, CGI “scaled” with some version of preforking (resuable!) processes and using something like FastCGI, or SCGI to turn environment variables into framed requests. This is very similar to the way Functions as a Service scale in practice, which I find amusing.
Additionally, for plain ole CGI, you’d set Apache up to fork at most N processes for CGI requests, and queue (with timeout!) until they could be started.
As in most cases for scaling Internet services, the way to start is to limit the amount of expensive parallel work being attempted, and fill the waiting time with concurrent operations. If git clone is expensive, because fork is the only way, aggressively rate limit that operation.
I still serve my private git repositories today with this technique, using nginx to terminate SSL and proxying to FastCGI, which is configured with enough pre-spawned processes as needed.
As said by other commenters, any slowness comes from the various pack operations, which have to happen regardless of how many pre-forked git-http processes are there.
any slowness comes from the various pack operations, which have to happen regardless of how many pre-forked git-http processes are there.
Sure. But my point is that you can’t expect an unbounded number of processes to spawn to handle this, and expect to survive.
As someone who has experienced AI scrapers attempting to download every file of every commit of various Git repos via the forge's single-file web view capability (which would add up to many millions of requests), I would actually be very happy if those lazy bastards would change their approach and just clone each repo instead.
IMO, git should implement a server side option to force clients to use git bundles for clone. You can then serve a snapshot of the repository and let clients top up. This will make it more resource friendly for git forges.
There is already some support for that: https://blog.gitbutler.com/going-down-the-rabbit-hole-of-gits-new-bundle-uri
Currently it needs a client side option enabled to use it (so more suggest than force) and the standard git server only implements enough to read a bundle uri from the config (gitlab has some more support in an experimental option).
But yes, this is particularly great for CI where it should be possible to arrange pre-caching of bundles or similar.
Other users already mentioned that CGI and process forking aren't really the culprit here, but I really like this comment on CGI performance: https://lobste.rs/c/lty5zb
The way I remember it, git's HTTP protocol was meant to be a last-resort for hostile firewall situations, and a sop to svn users ("look, migration is easy, you can drop the CGI into the same Apache you're already using to run mod_dav"). The expectation was that you would be serving anonymous read-only access over git:// and authenticated access over SSH. So it didn't matter that it wasn't using the fanciest high-performance tricks. CGI was perfect because it was drop-in.
But that was before "supply chain security" was really on anyone's mind, and git:// is trivially MITMed if you're in a position to pull off things like that. Using HTTPS instead at least gives you a modicum of assurance that you're talking to someone named github.com or whatever. So now no one respectable even listens on the git:// port.