How I protect my forgejo instance from AI Web Crawlers

42 points by yogsototh

zie

I just require a login. Public code can be mirrored out to Codeberg, Gitlab, Github, etc.

ki9

The trade off is loss of collaboration. OP mentioned federation and probably wants public bug-tracking and PRs.
- zie
  
  For friends/family, it's no big deal, they get a login and they can do stuff.
  
  For public code, where mass collaboration is encouraged it doesn't belong on my self-hosted instance, it belongs on codeberg, github, etc. Nobody is going to sign up just to file an issue on some random self-hosted website, even if they did manage to find it.
  - ki9
    
    Presumably when federation is implemented, an account on one forge can "post" on other forges.
    
    And I can't wait because I am the Nobody who has gone through the trouble of creating accounts on small instances just to file bug reports. I probably have 3 or 4 such zombie accounts.
  - algernon
    
    I have about two dozen accounts who signed up for my forge in the past couple of years. Not much, but it is almost two dozen more than who contributed to any of my projects that were on github.
    
    I have accounts on a few dozen self hosted forges too.
    
    A lot of us absolutely will sign up.
    
    zie
    
    WOW, that's amazing! That's exactly the opposite of my and many other people's experience.
    
    algernon
    
    I suppose this is partly due to my projects' intended audience being self-hosters, and other people not very fond of github or centralization in general.
    
    nicoco
    
    I guess I'm nobody then? ^^
    
    zie
    
    Haha ;)
    
    When I hosted a fairly popular for its niche cli tool in fossil-scm , I got 0 via Fossil, but I had a half dozen contributors via GitHub. So it’s effectively zero.
    
    technomancy
    
    IIRC Forgejo will accept logins from other sites using oauth, so this doesn't prevent collaboration.
    
    However, people may be reluctant to log in just to check whether a bug has already been reported or not.
  - ocschwar
    
    At this point I think we need to escalate by taking the content through a scrambler so that it potatoes the LLM training data, whenever a web crawler comes by.
    
    squeek502
    
    https://iocaine.madhouse-project.org/
    
    ki9
    
    I hope somebody makes a nonsense-generator for git forges that generates shitty code and PRs.
    
    danielrheath
    
    A few folks I know are using Claude for that, AFAICT
    
    /s
    
    rprospero
    
    I am now paranoid that someone did and it’s me.
    
    technomancy
    
    I wrote this! https://git.sr.ht/~technomancy/shoulder-devil
    
    I just haven't wired it up yet.
    
    forestj
    
    There's probably a never-ending list of rules like this to allow well behaved automated access, for example on mine I allow anything that sends Basic auth, anything that sends a Git-Protocol header or requests a path ending in /git-receive-pack or ?service=git-receive-pack, or where user agent contains node, npm, yarn, or curl, etc.
    
    srtcd424
    
    Honestly feels like a good-ish reason to use Anubis with a very low difficulty challenge - it should serve as a repository to collect a lot user agents, etc, that need to be able to bypass.
    
    ki9
    
    Are there any caveats to this method? If I git clone over https, does it work?
    
    EDIT: I see now that git user agents get a pass.
    
    fiatjaf
    
    If you decouple frontend from backend and build a SPA that talks only the Git protocol with any Git server that would fix it.
    
    Case in point: https://gitworkshop.dev/
    
    mdaniel
    
    The infamous "cat and mouse" game, as if the AI vacuum hoovers wouldn't cheerfully implement git-remote-vector and avoid having to deal with all that pesky html that surrounds the actual bytes they truly care about
    
    srtcd424
    
    I think a lot of admins would be substantially happier if the bloody scrapers would use dedicated retrieval APIs and protocols such as git rather than just repeated requesting unchanged HTML renderings of the underlying data - it would generally be more efficient and put less load on the servers :( One of the most infuriating things about this AI bubble era is how literally stupid a lot of it is :(
    
    algernon
    
    I think a lot of admins would be substantially happier if the bloody scrapers would use dedicated retrieval APIs and protocols such as git
    
    To be honest, I am very happy they don't do that, it's easier to block them from access while they pretend to be browsers. It would be considerably harder if they were using git+https.
    
    I have no desire to let them near anywhere my code, no matter how they try to access it. I hope they keep up the current practice, because while it is wasteful, it is also something I can block, and to me, that matters more.
    
    fiatjaf
    
    If they did that it would mean they put a lot of effort and attention into their stuff, which would also mean they probably wouldn't be scraping the same page hundreds of times per day anymore, so I'm fine.