Aggressive AI scrapers are making it kinda suck to run wikis
63 points by jmillikin
63 points by jmillikin
This will ruffle some feathers, but.. If you use these commercial LLMs in any capacity, you're (somewhat) complicit in this.
Having read the article, I don't think it is that binary. Given that the article explicitly mentions the big ones appear to actually respect robots.txt and still have custom user agents.
Even if it is that binary, I think we can agree that the AI market largely hasn't been driven by how user use these products.
I think it would be naive to assume the big ones don't also buy datasets from these less scrupulous scrapers, even doing so because they know sites block their crawlers.
Certainly a possibility, the article also mentions not being able to tell if they are double-dipping. But without actually knowing, it is still an option not a fact. It is easy enough to make an argument that it is unlikely by simply looking at the training data cutoff for current frontier models. Specifically with Google models they seem to be comfortable releasing models based on fairly old training data. For example Gemini Flash 3.5 has a knowledge cutoff of January 2025. Given that Google of all companies likely already has more data than others and that they are comfortable not incorporating recent data, I'd think you can also argue that these rogue crawlers don't belong to Google or their data is being bought by Google. I can't make the same argument to the same degree for Anthropic or OpenAI as their models contain more recent data and they arguably have less data available as Google, so I won't. But either way, if we are calling something naive I think it is making definitive claims about things we have no real data available for.
Regardless, I also made a second point. If you work in IT, which I figure a lot of people on this website are, it has become increasingly difficult to avoid using LLMs. I am not even talking about code related tasks, but copilot and gemini are practically sprinkled in everywhere if your company uses Microsoft or Google for communication. As a consumer it also is next to impossible to avoid to some degree. Certainly as a regular consumer who doesn't know about alternative search engines (or even metasearch like searXNG). The majority of LLM usage is not driven by consumer demand but by VC and other investor money. That is one thing I am actually comfortable making a definitive statement about. As an individual using or not using commercial LLMs barely moves the needle, if at all. So making a grand statement that people are complicit in this, even more so as the only content of a comment, simply smells like a red herring to me. More specifically it shifts responsibility.
This scraping has dumb and could be improved by orders of magnitude without changing what the companies provide. Blaming users for the current market allowing stupid inefficiencies is a bit silly. It's like we if blamed you for internet ads if you're using anything Chromium based.
Advertising on a website does not directly attack the host's infrastructure however. The manner in which these AI textorment nexus companies are scraping can essentially be taken as a direct attack against the open internet.
It's like we if blamed you for internet ads if you're using anything Chromium based.
I see nothing wrong with this. I do ask people, sincerely, to stop using Google products if they are against ad-culture. And when sometimes, the less tech-savvy folks would outline their bottlenecks in response, I delightfully guide them to alternatives.
It can not be improved. The reason the LLM providers want every edit on a wiki page is the same why they want every pull request from the Git forges: The edit semantics.
For example, if you paste some text into a GPT and say "Make it more concise" then you get the blended output of a million wiki edits and Git commit diffs where the previous text matches yours and the edit descriptions or comments were something like "Made it more concise".
So yes, since the crawling at that granularity is necessary to deliver the product, you participate in it by using it.
what also works quite good is to restrict certain special pages to clients having logged in once and have the specific cookie with set, otherwise deny access. Most of the crawlers target special pages to crawl through the wiki, which can be restricted to logged in users. (Wiki does not allow for user creation in this setup)
something like this:
<If "%{REQUEST_URI} =~ m#^/wiki/index\.php(?:/.*)?$# && %{HTTP_COOKIE} !~ /[-a-zA-Z_]+UserID=/ && ( %{REQUEST_URI} =~ m#^/wiki/index\.php/Special(?::|%3A)(MobileDiff|History|Contributions|CreateAccount|ExportTranslations|MessageGroupStats|LanguageStats|Translate|RecentChanges|Log|RecentChangesLinked|WhatLinksHere)(?:/.*)?$#i || %{QUERY_STRING} =~ /(Special(%3A|:)(MobileDiff|History|Contributions|CreateAccount|ExportTranslations|MessageGroupStats|LanguageStats|Translate|RecentChanges|Log|RecentChangesLinked|WhatLinksHere)|action=(edit|history|info|pagevalues|purge|formedit)|oldid=)/i )">
ErrorDocument 403 "Access denied, please login."
Require all denied
</If>
this reduced the load quite heavily of on our system. Before we often had peaks that rendered the wiki unusable due to heavy crawling of the special pages.
other than that:
Funny enough, they noticed this. They used the Diff view to crawl through pages until we restricted it, then they attempted the same by using the MobileDiff View.. Needed some roundtrips but since a few months, its been smooth sailing this way.
Since the scrapers are finding their way onto these pages by recursively following ordinary <a href=...> links, I wonder if they could be gently steered away from doing that by making all the expensive uncached pages like the history diffs be reachable only by submitting a <form method="POST" action=...>? Wouldn't involve blocking anything, (it's essentially in the scrapers' own interests anyway since it helps them avoid recursively ingesting effectively-infinite redundant information), and normal users might barely notice the difference. Might be a good hassle vs effectiveness tradeoff. I think it might be nicer for anonymous users than Anubis.
This hinges on the assumption that scrapers don't submit HTTP forms with method="POST". I don't know if that's true. If it wasn't at least mostly true then I would expect to have seen headlines by now about AI scrapers mass vandalising wikipedia by submitting anonymous edits that replace the contents with random garbage.
That would also make the responses uncacheable, which might be problematic if you're relying on a CDN for that.
ETA: wondering if bots would hit a <form method="GET">? That should work well with the cache and require extra logic from the crawlers.
So, it has been years and somehow nobody sat and sifted through the IPs and found one from a small ISP they could contact and drive to? Nobody drove to that user and politely asked if they can inspect their computer? Nobody figured out what software is actually doing the crawling?
If site operators cannot even do this, I just do not care anymore. They are literally bending backwards to avoid actual, messy human contact. So they get bots.
Also, obviously bots running from residential botnet will have enough compute to occassionally get through captchas or Anubis. And a permanent server-side win is impossible here, because the user of that computer also generates legitimate traffic.
So unless you desire remote attestation, drive to those IPs!
You seem to underestimate the scale residential botnets
I am not. Now please tell me, which particular botnet is doing the crawling and where are the decompiles?
Do we know the operators yet?
Or is this a common offering?
Another thing that can be done and I've already wrote here about: operators could actually pool their info about crawling IPs and instead act cooperatively.
E.g. display a message that follows the user across the web: "Your computer is infected and it keeps connecting to half the Internet, causing us trouble. We will start blocking you in {} days, please contact this local organization that will help you clean your device."
Then after a week actually intermittently block them. The IP space is so trivially small this can be done and it forces local ISPs with CGN towards IPv6 and monitoring their own network.
Especially if you provide each ISP with honeypot hits from their network and maintain separate honeypots for each. Or low-overlap, to prevent tipping scrapers off.
The point is, stop trying to solve this server-side! This needs client-side work. And if we do not do it the messy way, tech megacorps are going to just ask hardware megacorps for attestation and force it down everyone's throats.
Look for "Bright Data". They sell residential proxies. To get them, they use mobile app SDKs (the app owner get paid if they include the SDK, sometimes hidden as an analytic solution), compromised/fake browser extensions and compromised IoTs. I don't know why they are not investigated by law enforcement.
https://en.wikipedia.org/wiki/Bright_Data says both Facebook and Twitter have already sued them, with judge arguing scraping is legal. So there you have it.
So the only remaining vector is to make it a problem of the end user running the "free" app that harasses the commons.
If there is actual evidence of compromised IoT in e.g. EU, that would be a criminal offense and can be trivially reported to police.
https://zanestjohn.com/blog/reing-with-claude-code is an example of a device (projector) sold with pre-installed malware from a residential proxy provider.
I put that product's name ("Magcubic HY300 Pro+") into amazon.de and it came up with dozens of hits from that manufacturer, such as https://www.amazon.de/-/en/dp/B0DPHMRS9V which has 10,544 reviews.
It's possible that the Magcubic devices sold on AliExpress are factory-compromised and the ones sold on Amazon are clean but I wouldn't bet money on it.
So, it has been years and somehow nobody sat and sifted through the IPs and found one from a small ISP they could contact and drive to? Nobody drove to that user and politely asked if they can inspect their computer? Nobody figured out what software is actually doing the crawling?
Amazing. Here it is, a truly terrible take.
The concept that, oh yes, you can just waste an entire weekend driving to visit somewhere specifically to ensure that a handful of bots are a little nicer to sysadmins... especially when most of them are from big and often foreign-owned ISPs, it's absolutely laughable. I want whatever you've been smoking.
Nobody figured out what software is actually doing the crawling?
These days? It's someone telling a chatbot "make something to scrape this" and each scaper is individually made.