Package managers keep using git as a database, it never works out
141 points by calvin
141 points by calvin
This seems kind of reasonable to me? "Do stuff that doesn't scale" is common advice for things just starting out. If all these projects had started out by building a system that scales to where they are now before release, A) would it be good? and B) would they have gotten off the ground?
Yeah, tbh "watch out, you might have the problems Rust has" seems like a great situation for any upcoming language to aspire to.
Sure, but for something like nixpgs it seems rather naive to start out with git. Hell, even CVS was somewhat painful with pkgsrc and there you’re not even checking out the entire history of everything that ever happened.
Well, it did not start out with git.
It started with SVN, before git existed. Indeed narrow checkouts (only a subdirectory) were nice and a loss when moving to git. But also Nixpkgs was already super useful while more than a literal decimal order of magnitude less comprehensive, and there was a stage where vendoring boostrap binaries into the repo seemed to make sense.
And of course development needs of Nixpkgs explicitly override non-developer usage of Nixpkgs in case of conflicting considerations…
I've contributed to nixpkgs (once or twice), and as a "developer" I feel it's a damn pain to clone and update the repo. svn was way nicer for that kind of stuff too. But of course you get no PR workflow, local branches etc
You don't get local branches unless you use either SVK or git-svn, you want to say.
I think the most critical complaint against SVN was being janky with merge conflict resolutions in workflows like «merge master into staging periodically, merge staging into master less often but also periodically». Of course, this has nothing to do with narrow checkouts for PRs; maybe Git will even get this eventually…
the most critical complaint against SVN
I had a few complaints about svn and as a result I mostly went straight from cvs to git for my own uses :-)
In the early years, svn’s repo storage was unstable, both in the sense that it was unreliable and corrupted itself, and in the sense that there were several incompatible changes while they fixed the bugs.
For me svn was much slower than cvs. My primary experience of svn was the Apache and FreeBSD repos which are an ocean and a continent away from me; svn has a very chatty protocol without enough pipelining so it handles latency very badly.
Cheap branches are not much use without easy merges, and svn didn’t learn how to merge until years after git came along (iirc, 2005 vs 2008) by which time git was becoming usable by mere mortals.
On the other hand, svn’s selling points were not compelling to me. I had tooling around cvs – partly FreeBSD’s cvs admin scripts, partly my cvslog script – which emulated atomic commits, so svn’s main selling point fell flat. In particular my cvslog script had been formatting logs in a similar manner to git since about 1998 :-)
I think it was around 2007 when I evaluated what I should use as the VCS, I clearly wanted a DVCS by then, and after evaluating quality of «world model» of VCSes is became absolutely clear that Monotone has the best mental-model design, and Git kind of tries to avoid having any… I think I used Bazaar for Windows-heavy tasks.
It is unfortunate that Monotone fell out of use and thus seems to be updated only for Botan versions (not that any vulnerabilities touch such an old part of Botan) but with little perspective for migrating off SHA-1… although for an unpopular DVCS it is no longer a risk as an unknown source of Monotone commits without any kind of track record would be suspicious.
Yeah, to me it's less "don't use git" and more "look at all these successful projects that benefited from the power that git provides and then hit a scaling point where they needed to start building layers over that".
This article reads like a litany of victories more than an indictment really.
Name me a better database that could've gotten these projects to this point.
I'm sometimes tempted about keeping an SQLite database as an SQL dump inside a git repository.
It would allow versioning while also allowing to get the most recent dump only too.
Sure, but there's nothing wrong with using a more efficient package database from the beginning, if it completely avoids all the growing pains discussed in OP. No database is perfect of course, but for example an SQLite metadata store with Litestream VFS to sync updates directly from blob storage: https://litestream.io/guides/vfs/
This sounds like a distribution solution, you often still need something in addition to it as a development solution… and if so, why not have what you need anyway (development DVCS), and solve scaling when it ends up mattering?
Git can still be used to push new releases for ingestion into the database. We don't have to give up one or the other in the first place. Why not solve both from the beginning?
Because you have problems you actually will have in the next five years to spend effort on, instead of a completely orthogonal deployment problem to become relevant a decade later? Because there are things that are sufficiently in flux that you prefer to do the first few iterations on them with a single layout to change, not multiple?
If that's the attitude then this is a solved problem–just pick an existing, efficient package and repository format–eg rpm and DNF which uses libsolv for efficient dependency resolution. Solve the original engineering problems you have now instead of orthogonal deployment problems that will be relevant later.
First. dependency resolution logic is a part of the problem being solved, unlike mass deployment.
Second, integrating with RPM is another thing to solve, and adding a new compatibility problem is much worse for initial work than adding an efficiency problem.
dependency resolution logic is a part of the problem being solved
Why though? People have already solved it. Why reinvent the wheel? What are the chances that we will do it better than this? https://github.com/openSUSE/libsolv
integrating with RPM is another thing to solve
How is that? It's just using the existing tools that are already standard in Linux: rpm or apt or apk etc. These are not novel problems, package repositories and package management tooling is one of the fundamentals of Linux.
What are the chances that we will do it better than this
Approximately 100%, if we were going to do any work on any of the constraints under which it operates. Like reducing the effort needed to have multiple versions usable by different parts of the system as long as they are not in the same address space.
It's just using the existing tools that are already standard in Linux: rpm or apt or apk etc.
Except you now need to generate things in a way those tools like, and wait when you hit any of their assumptions (like an assumption of managing /)
Is it hard to build a package manager without using Git?
Can you think of another database that 100% of your users already know how to use, has an existing UI for submitting changes, and that Microsoft will pay the hosting bill for?
I just find it bizarre that Cargo needed to download "index" for the whole registry, and not only for packages I actually use. Seems obviously very inefficient? As for the bill argument, I think you could probably host packages, or any other data, on Cloudflare for very cheap.
Cloudflare don't host non-web content for free, and I'd be surprised if your system lasted particularly long.
Unless...
What if the canonical format was HTML over the wire? I always appreciated (and still do) the way debian repositories can be investigated with a web browser.
They do sponsor OSS projects, so it wouldn't hurt to ask. Eg: https://blog.cloudflare.com/supporting-the-future-of-the-open-web/
"Cloudflare's support of Omarchy has ensured we have the fastest ISO and package delivery from wherever you are in the world. Without a need to manually configure mirrors or deal with torrents. The combo of a super CDN, great R2 storage, and the best DDoS shield in the business has been a huge help for the project."
– David Heinemeier Hansson, Creator of Omarchy and Ruby on Rails
Not really but it kinda depends on your goals.
For instance Ruby only pulls dependency graph (from API) and it’s like a dozen MBs. That’s because package installation is very standardized and for most packages it’s just “unzip to this directory”.
On the other hand you have Gentoo Portage. Their main repo is like a GB in size and packages contain not only dependencies but also metadata (licenses, homepage URL, etc.), build instructions, patches and additional files like distribution-specific configs, and portage-specific metadata (ebuild and source package hashes). Gentoo uses git for the repo. It used to use rsync but it was still basically the same: the delta is resolved and a partial update is downloaded.
So in general, in my mind, the main aspect is how much your package manager does locally. If it’s mostly dependency resolution (and dep features are simple) you can do with a truncated index fetch (like Ruby). But if the things your pm does are complex you probably want the full repo.
And the size is the deciding factor. I’m sure it won’t be much of an issue to fetch Ruby index from git. But it might be for large checkouts. Though Gentoo switched from rsync because git does a great job at minimizing transfer size and ensuring checkout integrity. Especially with a smart server. Rsync wasn’t particularly fast either.
I’ve spent a fair amount of time working professionally on package manager tooling, and no, I don’t think git makes it easier.
What would you recommend instead?
A CDN with known paths, holding compressed metadata.
$CDN/packages.bin.xz$CDN/packages/some-package.bin.xz$CDN/packages/some-other-package.bin.xzI’m guessing it gets expensive managing the infrastructure when you have tens of thousands of ci runs pulling the world from your packages everyday
I think the Nixpkgs comparison is a bit misplaced. Unlike the other examples, users of Nix and Nixpkgs do not clone the repository, they download tarballs of particular snapshots. There is a large and very active Git repo where people collaborate on the package definitions, and it is stressing limits, but just because it is a large software project.
I... do clone the repository. I use git+https:// in my flakes instead of github: so I don't redownload the full 50MB of nixpkgs every time I update.
I wouldn't clone the repository on a deployed server box or in continuous integration scripts but for like daily use on my machine I always have a local clone because that's the actually reliable way to look up what's going on, experiment with changes, etc. Probably to a first approximation every "serious" Nix/NixOS user ends up doing this but it's not the default nor strictly necessary. I'd guess most of the Nixpkgs traffic is downloading archive tarballs.
Alternatively: Downloading the entire state of all packages when you care about just one isn't a great idea.
It's not a good idea, but it could be much faster and scale beyond git if you used a simpler protocol with heavier compression instead, especially for the initial download.
"Do we move the computation to the data, or the data to the computation" is a classic question across a great many subfields of software development.
If you only want one record out of a large database, fetching the entire database isn't usually a great way to find it.
If you have complex queries to run, requiring some central server to run them becomes a different kind of challenge.
In HTTP/2, parallel requests are cheap and quick. This is what ultimately won in case of Cargo's registry protocol. Header compression and pipelining make requests take only several bytes and hide all the latency, so you don't really need a custom protocol to avoid making lots of requests. Cargo can make 200 requests, and it works just fine!
The article doesn't really get into it (apart from a brief reference to ArgoCD) but this is all great ammunition for my "GitOps is terrible and a huge step backwards for deployment engineering" crusade.
Mind saying a bit more? What are your preferred alternatives to GitOps?
I'll do my best - I think I have at least a full talk in me on this topic, but I will try to be brief. I apologise if this post skims too quickly over key ideas.
I'm going to start by defining terms, because one of the big problems with GitOps is that it means different things to different people. So, for the purposes of this discussion, GitOps is the practice of deploying infrastructure code by polling source control systems for changes.
Now, in the world of web apps, we've known what good release cycles look like for a long time:
But for some reason, when it comes to infrastructure code, we insist on forgetting these lessons every few years. GitOps is just the latest iteration of "YOLO just apply the infra source directly". The major design flaws in Terraform stem from the same problem. As a result, we consistently fall into the same traps:
infrastructure/dev, infrastructure/prod), instead of a single infrastructure codebase + environment-specific configFor what it's worth, I think between some of the improvements OpenTOFU has made to workspaces (via dynamic backend blocks) and projects like Stategraph, I think the terraform ecosystem is getting close to "proper" build/release cycles for IaC - but it has taken many years of work, because it cuts against the grain of terraform's state model.
So in summary, GitOps is a step backwards because it bundles build/release/run into a single "poll VCS and apply" step. This leads to fragile IaC, slow feedback, and code duplication.
I'm just speculating based on my own experiences:
To me, GitOps works great for situations where you want to configure and deploy a lot of stuff that's developed and packaged by third parties. In this case the config is your code.
My problem is with how it's presented as a natural part of a continuous deployment flow. I see the benefit of how you get history/audit built-in, but I also think it's fair to ask if Git is the right database when all your commits are machine-authored and your history is linear. And isn't it really the cluster's job to keep that history in the first place?
And then there's the issue of adding another asynchronous layer to the deployment pipelines. It slows deploys down and complicates providing feedback on deployment progress.
Is the alternative a bespoke deployment operations console, over ssh or web?
Or is the concern that gitops doesn't give enough customization ability to operators over the actions being run?
In build2 we use git as a backing store for our package repository: while a package submission is a git commit with all the benefits the article describes, package consumers fetch post-processed, signed metadata and package archives over plain HTTPS.
This problem is a hot topic in the Nix world, but the generally discussed solution isn't to move away from git, but to break up Nixpkgs. AFAIK, that was the primary motivation for the ever controversial Nix flakes feature, in an attempt to bring independent repositories with Nix expressions in them up to parity with Nixpkgs in terms of ergonomics.
Though there is an argument to be made that if the Nix community actually manages to break up Nixpkgs and turns it into a large collection of related repositories they would have moved away from git at the root of the package database, since in that world you'd need a meta layer on top of all those former Nixpkgs pieces for discoverability and coordinated releases.
I feel like on the download side Nix mostly gets around this. Both when downloading the main Nixpkgs repository and when building individual packages, you basically always download just the repository tarball (this is what Flakes and Channels do, and what the fetchFromGitHub and fetchFromGitLab functions do. fetchgit is really advised against in Nixpkgs!), which is considerably smaller and faster. The whole Nixpkgs repo is 2.5GB, but just the release tarball is ~35MB, which is a lot more reasonable.
I think most issues with Nixpkgs come from the CI side, because unfortunately, since the package manager is effectively code, you can’t easily download just what’s changed. Though nix caches can store the source code as well, so raw git fetches should be really rare with Nix. I find that in contrast to other package managers, using Git for Nix works really well because of the code nature of the whole thing. I hope that we manage to never move away from that.
How does that help if you now have to download same megabytes across 10 repositories?
I guess the idea is that Nixpkgs would be chopped up in meaningful chunks so that you typically depends on and download a fraction of it. Like why do you even need to download a Nixified Haskell package set snapshot if all you're trying to get out of nixpkgs is pkgs.postgresql-18.
Though there's an argument to be made that flakes may not be a realistic nor efficient way to address the client-side of the problem (i.e the megabytes you have to download). For instance lazy trees seem like a much better direction. We should be able to go even further, downloading individual files as the Nix eval requires them (perhaps backed by something like a Merkle-tree in order to preserve determinism, though if you're reading from a git repo, git's commit metadata would probably suffice too), but that would probably require some Nixpkgs refactoring to make sure the eval ↺ download roundtrips are minimized
chopped up in meaningful chunks
It seems more meaningful to chop things up in historical slices rather than arbitrarily cut out entire sets of packages, or am I missing something? I very rarely need the old versions of anything for day-to-day use.
A current snapshot of Nixpkgs is already pretty big.
ETA: tens megabytes with compression, hundreds without, just for a current snapshot of expressions
And then splitting will amplify the CI issues.
Actually it's a good question how large the «expression closures» of a typical system, and of the largest CI jobs, are. The notion is not supported by current tooling, which arguably also makes harder some of the evaluation overhead issues…
Great overview! I think we need something like "versioned git-like namespace + projection to relational DB with indices"
It's probably more of a graph DB semantically, although it might make sense to then project it to relational for implementation reasons.
I think Gentoo's Portage is a good counter-example, as far as I can tell (as a user) the git syncing works very well. For the typical update "Received objects" is a couple MB and the whole process completes in less than 5 seconds.
From the post the Homebrew, Nixpkgs and CocoaPods examples are the most comparable. I am not very familiar with these but based on the descriptions in the post what Portage seems to be doing differently:
git fetch origin --depth 1 for updatesPortage currently has about 20K packages, the size of the checkout is about 680MB with 120K files and my .git is 114MB. I believe the Git infrastructure is self-hosted.
I thought the author meant pulling packages from git forges, but I guess that is fine.
BTW, another package manager using git as a database is vcpkg
Already mentioned in the article.
Author here, I added that in after seeing a number of mentions of it here and on mastodon, so I looked into it and updated the post.
Thanks for clarifying! You might consider summarizing such changes at the top of an article. I do appreciate that post pages facilitate this by linking to their source on GitHub.
The alternative is some sort of centralized service you have to set up and maintain, possibly with its own protocol. For a lot of (most?) projects the cost (both financially, physically and mentally) of such a setup is too great.
For example, the package manager for Inko also uses Git repositories for packages (though there's no central index other than this page). Not because it's the best solution, but because it's the least costly and lest invasive for me as a maintainer and will likely be more than suitable for at least the next decade.
An alternative is to download archives of Git repositories, reducing the amount of data that needs to be downloaded (as these archives don't include the full Git history). The problem with this approach is that every forge has its own URL structure and file structure within the archive, making it a bit more tedious to support different forges. This also means its not "forward compatible" in that you can't just use some random new Git forge, instead support might need to be added explicitly.
Interestingly, Gentoo started distributing its tree via rsync, then switched to git for performance reasons. Users can use a slightly different repo than developers. Third-party repos are common. Gentoo's flexibility makes it too hard to compile a simple dependency index like most other PMs.
I think the proper view is «DVCS are used for development, at some deployment scale you start offering something pre-computed and lighter for least demanding / involved users — downstream of development in DVCS».
Nixpkgs only does it on a deeper level since long ago: Nix evaluates how to build packages but then looks whether already-built cache for exact those build instructions is available. I guess at some point there will be «if you do not do overrides only evaluate direct attributes in Nixpkgs within what CI rebuild counter counts, here, have an evaluation cache of those too».