Guarding My Git Forge Against AI Scrapers
62 points by technetium
62 points by technetium
Worryingly, VNPT and Bunny Communications are home/mobile ISPs. i cannot ascertain for sure that their IPs are from domestic users, but it seems worrisome that these are among the top scraping sources once you remove the most obviously malicious actors.
Yeah there’s a widespread malware infestation involving adware in bottom-scraping apps and attack-cloaking services such as Brightdata. I have extremely cynical opinions about why and how they are allowed to do such evil things out in the open.
https://lobste.rs/search?q=brightdata&what=comments&order=newest
Ah yes, the Internet as it was meant to work.
I'm about to try running a major code forge and I am not in the least bit excited to face the core UX problem of the modern web: feeding scrapers their alternative slop content at just the right place to keep them gobbling and horfing so that 20% of your resources are still available for people
So... are we bringing back mail order shareware?
I had what I think is a great idea in the comments over on the orange site. The idea is: just serve untrusted users pages without hyperlinks. The text of the page and of the links can be there, but instead of being blue the links would be greyed out and when you click them you'd get a modal that says something like "log in to use hyperlinks".
It's not a strategy for the general web since the general web wants SEO badly, but a private/personal code forge has almost no SEO needs. You don't have to block all links either. Keep-active links to a smaller set of pages that can be easily cached (like READMEs), sure, but break links that would send bots into the code tree and into the project history. The great thing about it is that nobody is blocked from accessing anything they know about, but scrapers that just follow every link (the most damaging kind) are neutralized.
I like it because you can use this strategy but still be fully transparent about what you're doing and why. You can even expose the trust mechanism through a UI element: a stoplight status indicator. Green: full trust. You're logged in or on a trusted subnet or something. Links are clickable. Yellow is semi-trust. E.g. you could have a sharing mechanism by which logged in users can generate links with sharing tokens. A single share token might give the recipient the ability to see 50 pages with full hyperlinking before they need to either make an account or get a new token to see more links. Red means you're untrusted and you're going to get the grayed-out links without any href URL.
I like this because if you know a URL, you're always free to see its public content. With trust comes progressive enhancement of that experience. It won't completely stop scraping, but I don't think it would be possible to completely stop a determined scraper. What it at least does is ensure that the most naive scrapers -- the ones which effectively run DDoS attacks on anyone hosting the Linux kernel repo -- would be flummoxed.
Thoughts?
This is an excellent strategy. The other one is to simply have an IP world set that blocks all Amazon, azure, Alibaba, etc. Since normal users very rarely come from there. There's all kinds of things that'll dig out the as numbers of Amazon and Microsoft and whoever and just block them at an IP level. It doesn't stop everything but it's certainly reduces greatly in addition to what you do.
Many scrapers come from residential IPs.
Yeah, I think I prefer breaking links for a couple reasons.
I don't want to play a neverending game of whack-a-mole trying to keep blocklists up to date. I don't like that if the blocklisting solution becomes popular I think it will mostly just incentivize behavior like using VPNs of residential IPs. If that happened then we'd just be back to square one, except that we'd've made things more complicated and opaque along the way.
The bigger problem for me is that I've invested a lot of engineering hours into actually designing a new code forge experience from the ground up, and to be able to recover the cost of that investment I'm surely going to need to be able to compete with Github.
Many of the people I really want to be interacting with the forge experience on a daily basis (even if it's just to read the source code of their favorite OSS projects) are going to be engineers inside companies like Microsoft and Amazon. I can't win over the hearts and minds of those people if I've IP blocked them...!
What I separately like about this (the version where URLs are even shown but not clickable), is that in Firefox being untrusted is the most minor of speedbumps (for my textmode-browsing setup it is even no speedbump at all, but that's personal)
You mean because you'd just hand-edit the URL to navigate deeper into the file tree you mean?
It goes further, in Firefox you can select a URL, right-click, and say «Open Link» or «Open Link in New Tab». My terminal setup simply always shows link targets and has a Vim binding for opening the URL under the cursor.
Yeah, the way I would do this I would build the link-hiding on the backend not the frontend, so indeed it would break every workflow including yours. That's kind of the point. It's not a gimmick that the bots will wise up to in a few weeks, it's a full roadblock.
Yes there would still be a link, and you could still right click to open it in a new tab too of course. It's just that the content of the new tab would be the "why are my links broken" page that explains why public networks of hyperlinks into a deep graph of data are no longer cost-effective to operate in the AI era.
This only makes sense because there's absolutely no need to scrape the data. Cloning the repo is the same thing, but 100x more efficient! I'm not putting up a roadblock to getting the data. Once I'm not the one paying for exchange bandwidth, anyone can knock themselves out scraping or training on a set of pages they host.
Horrifyingly, I won't be surprised if bots learn to deduce URLs from listings (for popular «starting point» backends) before they learn to use git.
Yeah I thought about that, but I don't think it's actually that bad. If you're smart enough to truly understand the internal URL structure you're probably smart enough to not scrape the same resource again and again, especially if you can see that the URL has a hash in it such that its content will not ever change.
It depends on how messed up their DDoS architecture is: if the left hand doesn't know what the right hand has in the previous-scaped dataset…
The change I fear is a local and stateless change in the URL list extraction, any reasonable change affects the architecture.
I took more or less the same approach to block scrapers I deemed abusive from my Gitea instance (Forgejo was forked from Gitea) also via similar nginx features fronting it. I just returned a simple 403. What I defined as abusive is either making my instance noticeably slower or causing it to OOM, or high enough CPU I see it eyeballing graphs. No comments on OP's site so I'll share my list here:
map $http_user_agent $badagent {
default 0;
~*bingbot 1;
~*FriendlyCrawler 1;
~*ImagesiftBot 1;
~*Amazonbot 1;
~*Bytespider 1;
~*AhrefsBot 1;
~*MJ12bot 1;
~*PetalBot 1;
~*SemrushBot 1;
~*YandexBot 1;
~*DataForSeoBot 1;
~*BLEXBot 1;
~*opensiteexplorer 1;
~*AcademicBotRTU 1;
~*Neevabot 1;
~*ThinkChaos 1;
~*paloaltonetworks 1;
~*Barkrowler 1;
~*ClaudeBot 1;
~*Applebot 1;
~*facebookexternalhit 1;
~*Googlebot 1;
~*ZoominfoBot 1;
~*meta-externalagent 1;
}
I sent you a DM cause I was very surprised to see my username in that list!!
From a bit more digging, it seems someone else used that name for a scraper.
I found a couple sample user agents in various DBs, mostly looking like: Mozilla/5.0 (compatible; ThinkChaos/0.3.0; +In_the_test_phase,_if_the_ThinkChaos_brings_you_trouble,_please_add_disallow_to_the_robots.txt._Thank_you.).
Then going from there, I found that the devilish ThinkBot LLM company used a bunch of very similar UAs: Mozilla/5.0 (compatible; ThinkBot/0.3.0; +In_the_test_phase,_if_the_spider_brings_you_trouble,_please_add_our_IP_to_the_blacklist._Thank_you.) (src).
So my best guess is the ThinkChaos one is a variant of that crawler, nothing I can do about it :(
Ha! What a coincidence. Yep, scraping-like activity from that user agent. Checking logs, it hasn't been around in awhile, so if it makes you feel better, I've removed it.
I found this a very high quality write-up. I've been meaning to set up iocaine in front of my various web servers for quite some time now. My first thought reading this was about how I'm going to translate it all to Apache configuration, but thinking about it again, I may put it on the VPS that fronts for my home server over wireguard, rather than just forwarding ports 80 and 443 like I do now. If I do that, I could either use nginx and follow this guide exactly, or do whatever the standard setup in iocaine is.
Recently I got a wave of (presumably AI) scrapers using (Mozilla/) browser-ish user agents on my git forge.
Since then I have deployed a couple of short nginx rules that present a JavaScript+cookie challenge to those. I didn't want the extra challenge of operating a Iocaine/Anubis/go-away or whatever if I can avoid it.
To get the challenge, you need to have a Mozilla/ user agent and no 'http' in your user-agent (as e.g. UptimeRobot presents a 'Mozilla compatible' user agent but also self-identifies with a URL in its user agent).
I don't mind friendly bots that self-identify and aren't trying to play funny games, so if you don't pretend to be a browser I'm happy. I use IP-based rate limiting (again, nginx provides this).
'It'd be easy for them to work around this', I hear people say. Which is true, but they don't — they seem to be low-effort scrapers, at least the ones that attacked my forge and burnt all my CPU/elec. The 'stealth' scrapers that pretend to be browsers and rotate their IPs frequently, I don't see any reason to believe that they would preserve cookies, because otherwise it'd be easy to trip them up into identifying themselves persistently and using that as a rate-limiting token.
I didn't want the extra challenge of operating a Iocaine/Anubis/go-away
Iocaine 2 was a serious pain to install and configure, but version 3 is basically just "add this apt repo and 5 lines of caddyfile" and you're done; less work that what you've described doing. (Not that there's anything wrong with what you're doing; just don't want someone else to come along and get the wrong idea.)
I like the idea behind Iocaine, but it seems to misclassify me because all pages on come-from.mad-scientist.club and iocaine.madhouse-project.org linked in post render as gibberish for me.
Great writeup. These nginx snippets will be especially time-saving when I improve my own iocaine setup.
The different solutions I see are adding a new server-side proxy between the reverse proxy and the web service. Any idea if there are solutions based on HTTP subrequests (like nginx's auth_request module) and redirections? I'm not sure this would even be less performant than with such proxy.
EDIT: Actually seeing posts related to Anubis doing this, when I checked their docs it was using a proxy like with Iocaine.
I made my own proof of work bot deterrent, and I've been using it for years. I've always been kind of conflicted abt the "proof of work is bad praxis" argument.
On one hand, I can totally see how blocking older computers or nonstandard browsers is bad; On the other hand... It seems to me like using something other than proof of work means either giving up on publishing, going private to some extent, or signing up for an eternal game of high stakes whack-a-mole.
Everyone has different goals and I think that's why we handle this problem differently. I just want to be able to publish. And for me, showing up on search results is a part of that.
So in order to reach that goal, I configured mine to always allow home pages, lists of repositories, repository home pages and readmes, etc. But once you start going into individual source code files or commits, it will demand a PoW, then set a cookie good for another 20 page views.
I tried to make the proof of work as unintrusive to the user as possible while also keeping it as painful for the scraper as possible, memory hard scrypt hash function with a large memory size, but low PoW difficulty. That means it should never take longer than a couple seconds, even on an older phone. On a modern phone or computer, it flashes past before you can even read what the text says. But in an end game situation where the crawlers are actually solving the proof of work challenges, They would not be able to effectively farm it out to asics, potentially not even to GPU.
The main problem I had with this system was that it requires WebAssembly, and GraphineOS has that disabled by default. I was able to solve that problem fairly easily by displaying an explanation of what happened and 3 clicks worth of steps to enable it for whatever domain the bot deterrent is hosted on.
To be honest, I've kind of fallen in love with this thing, and I use it as a captcha as well. (It started as a replacement for recaptcha and then was bolted onto the scraper blocking use case later) I think these kind of solutions have a lot of applications beyond just stopping the AI scrapers, being able to stop bots and automated systems without having to depend on big tech solutions opens up a huge amount of convenience and safety for small scale / home hosted web apps IMO.
But in an end game situation where the crawlers are actually solving the proof of work challenges, They would not be able to effectively farm it out to asics, potentially not even to GPU.
Spammers have been using compromised computers for over 20 years. It’s safest to model an endgame by assuming that attackers have unbounded free compute capacity. If a PoW scheme becomes popular enough that spammers decide it’s worth defeating, it’ll soon become useless. This implies that what makes proof-of-work effective is not the work, it’s the weirdness.
if they have unbounded compute power, why don't they use a full web browser with Selenium or something? If they did, they would be able to crash my forgejo and post ads on my blog comments again without even having to look at my code. Not to mention, they would be able to bypass everyone's Anubis / iocaine and many other protection mechanisms that folks have come up with.
You are probably right that the obscurity ( Have to look at the code in order to figure out how to get the bot to solve the challenge ) Is the main factor here, considering that I'm a tiny fish in a large ocean.
But I am not entirely convinced that it's the only factor, and I guess that's important to me. It's like the concept of "defense in depth", not only am I camouflaged, I'm also spiky and poisonous, with a proof that there's no antidote. If I start getting eaten, I still have some headroom to crank up the poison without having to make any code changes.
Considering that scrapers aren't using real browsers right now, that leads me to believe that enough memory allocation will actually deter them at some point; there are easier targets out there.
if they have unbounded compute power, why don't they use a full web browser with Selenium or something?
Because it's slow, and they can gobble up a whole lot of stuff without resorting to that. Some of them do use Chrome.
Not to mention, they would be able to bypass everyone's Anubis / iocaine and many other protection mechanisms that folks have come up with.
In theory, yes. In practice, they're not getting past iocaine. For the simple reason that the crawlers that piggyback on real chromes use URLs collected by the simple bots. Since simple bots are caught by iocaine, and get served poisoned URLs, the Chromes will hit the poisoned URLs, and end up in the maze too.
As long as they continue this practice, poisoned URLs work great against them. If they ever stop, I have a few other tricks up my sleeve to identify them.
I'm not sure you can assume unbounded capacity, but I think you're mostly correct. Also, I want you to be correct because I would love to see us converge on solutions with fewer Anubis-style challenge pages.
That said, I fear the optimal amount of PoW is non zero. Because sloppy scrapers try to disguise their traffic by spreading it across multiple devices, they end up maximizing the number of distinct (client IP, web server) pairs, something real users don't do. If each PoW cookie is tied to one of those pairs (e.g., by containing a MAC of the client IP), it costs each legitimate user of a website a single PoW unit of work, whereas a sloppy scraper would have to spend resources proportional to the number of pages on each website. This goes back to assuming unbounded capacity. Right now that's a safe assumption because so few websites use PoW, and there's a surplus of devices to rent. But if PoW is many times more expensive for scrapers than for people, and the whole ecosystem adopts PoW, I can imagine scenarios where the supply is exhausted.
My git repos are running on https://pyramid.fiatjaf.com now and they don't have associated built-in webpages, so they can't be scraped unless the scrapers know the http-git protocol.
You can still browse them from a GRASP client like https://gitworkshop.dev/fiatjaf.com which does speak the git protocol directly.