How I protect my forgejo instance from AI Web Crawlers
42 points by yogsototh
42 points by yogsototh
I just require a login. Public code can be mirrored out to Codeberg, Gitlab, Github, etc.
The trade off is loss of collaboration. OP mentioned federation and probably wants public bug-tracking and PRs.
For friends/family, it's no big deal, they get a login and they can do stuff.
For public code, where mass collaboration is encouraged it doesn't belong on my self-hosted instance, it belongs on codeberg, github, etc. Nobody is going to sign up just to file an issue on some random self-hosted website, even if they did manage to find it.
I have about two dozen accounts who signed up for my forge in the past couple of years. Not much, but it is almost two dozen more than who contributed to any of my projects that were on github.
I have accounts on a few dozen self hosted forges too.
A lot of us absolutely will sign up.
IIRC Forgejo will accept logins from other sites using oauth, so this doesn't prevent collaboration.
However, people may be reluctant to log in just to check whether a bug has already been reported or not.
At this point I think we need to escalate by taking the content through a scrambler so that it potatoes the LLM training data, whenever a web crawler comes by.
I hope somebody makes a nonsense-generator for git forges that generates shitty code and PRs.
I wrote this! https://git.sr.ht/~technomancy/shoulder-devil
I just haven't wired it up yet.
There's probably a never-ending list of rules like this to allow well behaved automated access, for example on mine I allow anything that sends Basic auth, anything that sends a Git-Protocol header or requests a path ending in /git-receive-pack or ?service=git-receive-pack, or where user agent contains node, npm, yarn, or curl, etc.
Honestly feels like a good-ish reason to use Anubis with a very low difficulty challenge - it should serve as a repository to collect a lot user agents, etc, that need to be able to bypass.
Are there any caveats to this method? If I git clone over https, does it work?
EDIT: I see now that git user agents get a pass.
If you decouple frontend from backend and build a SPA that talks only the Git protocol with any Git server that would fix it.
Case in point: https://gitworkshop.dev/
The infamous "cat and mouse" game, as if the AI vacuum hoovers wouldn't cheerfully implement git-remote-vector and avoid having to deal with all that pesky html that surrounds the actual bytes they truly care about
I think a lot of admins would be substantially happier if the bloody scrapers would use dedicated retrieval APIs and protocols such as git rather than just repeated requesting unchanged HTML renderings of the underlying data - it would generally be more efficient and put less load on the servers :( One of the most infuriating things about this AI bubble era is how literally stupid a lot of it is :(
I think a lot of admins would be substantially happier if the bloody scrapers would use dedicated retrieval APIs and protocols such as git
To be honest, I am very happy they don't do that, it's easier to block them from access while they pretend to be browsers. It would be considerably harder if they were using git+https.
I have no desire to let them near anywhere my code, no matter how they try to access it. I hope they keep up the current practice, because while it is wasteful, it is also something I can block, and to me, that matters more.
If they did that it would mean they put a lot of effort and attention into their stuff, which would also mean they probably wouldn't be scraping the same page hundreds of times per day anymore, so I'm fine.