Mitigating SourceHut's partial outage caused by aggressive crawlers
135 points by fratti
135 points by fratti
I run an entirely smaller git service for just me, my wife, and some friends. It’s nothing special, and yet the scrapers have set their sights on me too. I don’t know what to do about them except block entire swaths of IP addresses once every few days when I just start getting hit from a different CIDR range.
I have no idea how frustrating it must be to deal with it at this scale, where there are thousands if not millions of links shared all around the public web to your service, and paying users who would likely be a lot less happy than my friends when I tell them I’m disabling web access for a few hours to get them to leave temporarily again.
This shit is exhausting. Good luck y’all. 💕
Yup - I forked QMK into a personal git forge (running on a piddly 1 core VPS) and the Meta/Anthropic crawlers decided to crawl every file in every commit. I ended up just blocking their UA by hand.
It sucks, but this is what Cloudflare is for. We had to pay a lot extra for the Advanced Bot services so we could be more fine grained (our customers want to be crawled by search engines, but not AI bots nor scrapers)
Another approach. If it’s just for a few people, either block everything and have an allowlist the few you’re using OR put it behind a VPN (Tailscale, Netbird, etc).
this is what Cloudflare is for
I am sick of solving hCAPTCHAs for Cloudflare or Google to train their own AI models just for living in the East; so much legitimate traffic is blocked by these proxy services, not to mention a centralized internet where Cloudflare can read everything in plaintext (with them being US-based, also means the NSA, CIA, & FBI can just ask for that data).
They recently published statistics about every login password that goes over their proxy service.
Likewise. I am technically disgusted by blocking /16s at the firewall level and yet there doesn’t seem to be a scalable way to deal with it otherwise.
for https://git.cyberia.club, forest installed https://github.com/sequentialread/pow-bot-deterrent & it seems to work well. we still see periodic spikes of high cpu from the scrapers - but at least we make them work for it.
Couple days ago i couldnt do work for a while because of this bullshit. Its affecting peoples livelihoods.
This shit is (one of multiple reasons) why I despise LLMs and hold not insignificant disdain for the people who advocate for using them.
The way LLMs operate is fundamentally unethical, and most of their operators do more unethical things on top of that, such as destructive, aggressive crawling, stealing and plagiarizing on an industrial scale (see Meta torrenting hundreds of books — not for personal use, like a normal individual would, but to enable a large scale plagiarism machine)
This is not even getting into the unfathomable amounts of electrical power and computer hardware required to train them at scale.
Let me be clear, there is no way to use LLMs that is both:
Yes, you can technically run a model entirely locally, with training data sourced ethically that you have permission to use for that purpose.
But you won’t, because doing so is impossible in practice, except maybe for small tech demos.
You won’t have enough training data, and most people you ask for more will give you the same response they gave e.g. Microsoft, Meta, Google: Fuck no.
That’s why those corporations stopped asking, if they ever did to begin with.
And if you do beg, borrow, and (mostly) steal enough data, you’ll still need an inordinate amount of computing power that is inherently both monetarily and ethically expensive.
This feels very emotionally charged. Saying you hate an ML researcher because some PM decided to poorly instruct infra engs on scraping is really naive.
some PM decided to poorly instruct infra engs on scraping is really naive.
the problem is systemic and not just “one PM” or even just one company. I think it’s perfectly reasonable to hold a grudge against this gold rush because this is the behavior it incentivizes. I am sorry though that the rest of this comment thread went the way it did, particularly about the “following orders” nonsense.
It appears you are the founder of an “AI” SaaS thing. How was your dataset obtained? Was it done ethically, as in, do people whose messages are part of the dataset know they’re in it and gave their affirmative consent? How was the scraper set up? I vaguely seem to recall various twitch bots idling in chats across the site for discernible reason, hopefully that wasn’t you.
I hope you can answer these questions as the CEO, rather than one of your ML researchers who was just following orders.
We don’t do/have either, we use ✨TTS and LLM providers✨.
So as long as you buy something from a third party, it shields you from all ethical questions of how it was obtained?
it is (extremely) common courtesy to disclose one’s business positions within the field if they are related. you’ll see this all over lobsters, and it shouldn’t be a surprise to anybody. you opted not to, and it turns out you happen to have one.
it’s not harrassment, and i’d argue the ‘emotionally charged’ train starts and ends with yourself. nobody else is speaking anything other than material facts, not using weird emojis to try and deflect.
quite obviously, the unethical behaviour being upstream of your business model does not absolve your business model from scrutiny.
maybe look a little harder (within) next time.
What was the harassment? (Just curious, as my understanding of “harassment” requires communication to be demeaning, humiliating or intimidating, and I’m not getting that from the question I read, as a relatively disinterested observer.)
I don’t think you or your site answered the poster’s questions. I looked at the links on your landing page. You say you use “✨TTS and LLM providers✨” but you never answer how the dataset that trained those providers was obtained, or whether the people whose messages were included in that dataset were informed and gave consent.
Can you answer that question?
Paying someone else to use their unethically produced thingy doesn’t absolve you of anything - it just means you can’t answer the providence question.
This shit is (one of multiple reasons) why I despise LLMs and hold not insignificant disdain for the people who advocate for using them.
I’m not enthusiastic about LLMs, but this is a silly reason to disdain people. There’s nothing about LLMs that necessitates “aggressive” web scraping–every LLM could be trained from a single web crawler. The reasons for the “aggressive scraping” have nothing to do with LLMs except that LLMs have become abruptly popular. Dealing with traffic you don’t want is part and parcel of operating a public web service.
Yes, you can technically run a model entirely locally, with training data sourced ethically that you have permission to use for that purpose.
I mean, that’s true for just about any endeavor. You and I are having this exchange on computers which were almost certainly built from components that were manufactured using exploitative labor conditions. Does that mean we should log off permanently? Are we bad for using our computers or is the problem systemic–there is no avenue for computing that allows people to source their computers ethically? Does the fault lie with LLMs and their users or does it lie with the system which prevents users from being able to adjudicate whether the training data is “ethically sourced” or not?
Moreover, what does “ethically sourced” even mean? We can’t defer to copyright, because the copyright system is fundamentally broken.
And if you do beg, borrow, and (mostly) steal enough data, you’ll still need an inordinate amount of computing power that is inherently both monetarily and ethically expensive.
I imagine if the monetary costs exceeded the efficiencies LLMs afford, few would bother with them, so I don’t think “monetary costs” makes a lot of sense. Moreover, if energy use is “ethically expensive” then much of what we do every day is wildly unethical. It seems that 3Wh is a common estimate for a standard ChatGPT query (although that may be an order of magnitude too high), and yet 3Wh is like 0.1 miles of driving in an EV (and much less in an ICE car)–maybe you get twice or thrice as far riding public transit? Playing a video game on a high end console for an hour uses two orders of magnitude more energy. And all of this is presumably dwarfed by the energy costs associated with air travel.
Moreover, the energy efficiency of LLMs will presumably rise for a long time as the hardware and training algorithms improve. It seems pretty silly to object to LLM usage when so many other things we do (often with less utility) consume so much more power despite that those are mature technologies which have already picked their low hanging fruits with respect to improvements in energy efficiency.
Lastly, if you want to be more effective about getting people to make more ethical choices, you need to give them some kind of a framework that allows them to make reliable choices. Telling people that everything they do is hopelessly unethical is not going to motivate anyone to make ethical choices, it’s going to demoralize them and make them feel like there’s no point in striving to be ethical because they have no way of measuring the amount of harm associated with a given decision and everything is equally catastrophic (or in this case, LLMs are somehow even more catastrophic than our wildly inefficient fossil-fuel-powered transportation systems).
There are a number of pretty strong statements here that I’m not sure are really that justified.
The way LLMs operate is fundamentally unethical,
This is probably the one that seems to be the most significantly unjustified. You sort of vaguely say that operators do unethical things like crawling, but you add the word “aggressive”. Are all operators “aggressive” in their crawling? Can one crawl in a way that is not aggressive, and if so, would that address the ethical issue? Lots of sites crawl the internet for many different purposes, it seems like we mostly just want them to be respectful as they do so, but otherwise we don’t typically care too much - at least, that’s my impression. Is a search engine unethical? Is piracy? Are vulnerability scanners? There are a lot of tools out there that are at least contestable regarding their moral value that do things that are at least on face value similar to some of what LLMs do. I would probably say that, at minimum, if a site publishes information and says “this information is published only for X use cases” and requires some sort of acknowledgement of that, then there’s probably an ethical issue with violating that.
Stealing books seems unethical, I think most people would grant that, but stealing books does not necessarily seem to be a requirement for LLMs. If Meta had not pirated them but instead had purchased individual copies, scanned them, and then trained on them, would that have been unethical? I could buy a book myself and then train an LLM on it. Is that unethical?
I think there’s a really large, open question, to me at least, around what the ethical practices are with regards to copyright. I can read a book that is under copyright, I can then produce a new book based on it, even quoting directly from it - for example, to create a critical review of the book, satirize it, etc. I think it’s at least a reasonable question to determine where LLMs fall here.
I feel like your statements are pretty definitive. It’s not just “I feel like LLMs may be inherently unethical” but instead “they are inherently unethical”.
I suppose that’s all to say that I’m not really convinced.
edit: To the people flagging this as “trolling” I genuinely have to wonder what you believe that word means. I’m genuine, I’m being as clear as I can be, I’m open minded, I’ve done nothing to attack anyone, etc. I really feel that people who abuse flags this way should just be banned.
Are all operators “aggressive” in their crawling?
The issue though is that you can’t tell. FOSS projects are getting absolutely hammered by hundreds of IPs from entire IP-ranges in Azure, GCP and Alibaba Cloud. You can’t tell who is who, who perpetrates this nor if they can do better. ArchWiki has had huge issues with uptime lately and we’ve had to put the entire history pages behind login as crawlers are aggressively going through all the links there.
Can one crawl in a way that is not aggressive, and if so, would that address the ethical issue? Lots of sites crawl the internet for many different purposes, it seems like we mostly just want them to be respectful as they do so, but otherwise we don’t typically care too much - at least, that’s my impression. Is a search engine unethical?
robots.txt
has been a thing for 30 years. They are intentionally not following it and whataboutism does not help you here.
I suppose that’s all to say that I’m not really convinced.
Then go engage with the FOSS communities providing you with a free service, because we are really struggling with this bullshit.
The issue though is that you can’t tell.
I’m confused about why this is the issue. That doesn’t seem like the issue if we’re discussing the ethics of performing the act, that seems like an issue with how one would respond to the act. I feel like this is unrelated, right?
They are intentionally not following it and whataboutism does not help you here.
This is not whataboutism. Bringing up symmetric situations and asking for a symmetry breaker is perfectly reasonable. I feel like you’ve also misunderstood that I’ve already granted that intentionally bypassing a site’s desire to limit access is likely unethical.
Then go engage with the FOSS communities providing you with a free service, because we are really struggling with this bullshit.
I feel like we’re talking past each other. For one thing, none of this has anything to do with me, I am not crawling anyone. I also granted that crawling when a site says not to seems unethical. What I am asking for is clarification about the moral judgments regarding the fundamentals of LLMs. I don’t think it’s unreasonable when faced with strong assertions like “this technology is fundamentally unethical” to ask for a justification and provide helpful questions to help guide such a justification.
I think you are asking for some principled way to bucket the act of crawling into “good” and “bad”, but the truth is that crawling isn’t a problem until it is. For more than a decade crawlers like Google’s have crawled the entire internet without taking down any websites. Now that many startups are competing to suck up as much of the internet as possible, from scratch, as fast as possible, it is a problem.
It’s also not about “LLMs as a technology”, it’s “LLMs as in, the industry around them.” One may not be inherently unethical while the other one just is in practice.
For more than a decade crawlers like Google’s have crawled the entire internet without taking down any websites. Now that many startups are competing to suck up as much of the internet as possible, from scratch, as fast as possible, it is a problem.
There has been a standard, for many years, that allows site owners to say which crawlers should crawl which portions of their site. It got developed based on a need for sites to opt out of those search engine crawlers for part or all of their content.
These “many startups” are willfully ignoring that standard. Besides that, they’re not offering another way for sites to opt out. That is why it’s now a problem.
I’d bucket crawlers that respect opt-outs as “good” and crawlers that don’t respect them as “bad.”
We’ve had poorly behaved crawlers before…80legs, for example. This is not a new problem.
True novelty is rare. But I remember writing WAF rules to block 80legs. It wasn’t that difficult, even though they were (are?) indeed very badly behaved.
The new wrinkle on the problem is that these crawlers seem to be more numerous, more persistent, and more badly behaved. They find it cheaper to re-crawl than to a caching strategy so that they’re only picking up new things.
I think what we’re ultimately seeing here is a generalization of the “old” adage, spammers ruin everything. Businesses which externalize their costs ruin everything.
I think you are asking for some principled way to bucket the act of crawling into “good” and “bad”
If the assertion here relies on there being a good crawler or a bad crawler, yes, I would like a way to assess them. That feels reasonable, that’s how I would want every moral judgment to be made. Just as a reminder, it’s the person I responded to who brought up the issue of ethics. Moral judgments should have reasons, I think.
but the truth is that crawling isn’t a problem until it is
Do you mean like morally problematic? That would be fine, I’m asking for when that’s the case. I’ve outlined one case - that if you crawl a site that has stated it is not to okay to crawl it that it’s a violation. But I also said that I don’t think that sort of crawling is necessarily a requirement for LLMs, which is the issue here.
It’s also not about “LLMs as a technology”, it’s “LLMs as in, the industry around them.” One may not be inherently unethical while the other one just is in practice.
Right, a justification for either would be fine. For example, I responded to this statement:
there is no way to use LLMs that is both:
- Responsible and ethical […] […] doing so is impossible in practice
Whether you call this the “technology” or “the industry” doesn’t matter to me, personally - as I said elsewhere, I would grant that if either is true then both can be taken as true, so whoever wants to defend either position is free to choose whichever one suits them. Everything I’ve said applies regardless, my questions remain, and I see no defense so far.
I’m going to opt out of future participation in this thread since it seems fruitless (not you specifically, just in general).
I’m confused about why this is the issue. That doesn’t seem like the issue if we’re discussing the ethics of performing the act, that seems like an issue with how one would respond to the act. I feel like this is unrelated, right?
Not at all, this is extremely relevant.
Here is the documentation for the Google crawler: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot
Now find me the same information from OpenAI or the other hundreds of AI companies. (Hint: you won’t find it).
This is not whataboutism. Bringing up symmetric situations and asking for a symmetry breaker is perfectly reasonable.
If the situation where actually symmetrical, sure. But as I have already explained, they are not. LLM and current AI hype companies are crawling public webpages at an unprecedented rate. Leaving you no ability to block them or opt out of this massive data collection. Then they sell it back to you!
Web crawlers are not doing this.
What I am asking for is clarification about the moral judgments regarding the fundamentals of LLMs.
This feels like the “Motte-and-bailey fallacy” as we are not talking about “moral judgements regarding the fundamentals of LLMs”, we are talking about the fundamental incentives that leads to unethical behavior and operation. Which in this case is mass web crawling of webpages, torrenting books and white-washing copyrighted material.
You can’t have a competitive LLM without collecting massive amounts of data. How do you do this in an ethical manner? Clearly you can’t if the competition is all engaging in unethical behaviour?
Now find me the same information from OpenAI or the other hundreds of AI companies. (Hint: you won’t find it).
So maybe I wasn’t clear enough about this. I’m asking about LLMs being inherently unethical, not OpenAI. I also asked about scrapers that are not aggressive. You seem to just be telling me that OpenAI has aggressive scrapers. I don’t really think this is relevant, sorry.
But as I have already explained
I don’t think you’ve explained this at all (edit: I’ve just looked at everyone one of your posts in this topic and you keep saying that you have clearly explained this, yet I can not find a single place where you have actually demonstrated that web crawlers and crawlers for LLMs are inherently distinct. You even say you have “carefully outlined” this and I’m honestly baffled, have I missed a post entirely? I believe this must be due to the confusion about this discussion, you’ve sort of kind of supported that one LLM based crawler is different from one web crawler, which is not relevant to the discussion). But again, it feels like there’s a misunderstanding where you’re talking about specific crawlers and I’m talking about possible crawlers.
Leaving you no ability to block them or opt out of this massive data collection.
And I’ve granted that this seems unethical…
This feels like the “Motte-and-bailey fallacy” as we are not talking about “moral judgements regarding the fundamentals of LLMs”, we are talking about the fundamental incentives that leads to unethical behavior and operation.
Uhhhh, okay, well it explains a lot about why we’re talking past eachother since we’re obviously talking about two different things. The first poster I responded to said that LLMs as a technology are fundamentally unethical, that is what I am talking about. If you are talking about something else, well, it’s good to know that at least.
You can’t have a competitive LLM without collecting massive amounts of data. How do you do this in an ethical manner? Clearly you can’t if the competition is all engaging in unethical behaviour?
I think these are good questions that are much more relevant. But I didn’t make any positive assertions. I’m asking for the positive assertion that you can’t do this ethically to be justified, it’s not on me to say that it can be because I’ve made no argument that it can be, I’ve said that I’m not convinced that it can’t be.
Go and find an AI company that advertises itself as using ethically sourced data. That would be your existence proof. Showing that its product is competitive would be the next logical step, if we care to be pragmatic and not just engage in philosophical speculation.
I can’t speak for anyone else, but I don’t especially care about all the possible worlds and the potential for the existence of hypothetical ethical LLMs. Extended speculation along these lines is not just boring, it’s in poor taste, given the context of the discussion. This is a crisis (or, if you prefer, an acute but persistent bad situation with identifiable bad actors and abusive patterns of behavior) affecting real projects and communities that Lobsters collectively cares about, and as such it demands pragmatic thinking and action.
So far, under “mitigations” we have
I would be interested in hearing about other practical (technical, economic, or legal) solutions.
. That would be your existence proof.
I guess you’re thinking that if I find an LLM that is ethically produced then I would prove that it is possible. But I never said it was possible, I’m asking for someone else to justify it being impossible. I’m open to either option. If someone provided an ethical LLM implementation that would be interesting but I don’t think I need to go get one in order to ask for the statement made about LLMs being impossible to implement ethically to be justified.
Extended speculation along these lines is not just boring, it’s in poor taste, given the context of the discussion.
So don’t engage? I don’t really care about what you consider boring. There’s a little [-]
button that lets you collapse threads, I find it very useful when people are discussing topics that are on-topic but ultimately not what I want to read about and I didn’t notice it for a while. If you aren’t aware of it, there you go, it’s a really good solution for “this is a boring conversation”.
No one is obligated to participate. Someone made a bold statement, I responded. This is a forum for discussion. I find it interest so I participated, that is all that I think is required here.
and as such it demands pragmatic thinking and action.
Okay, so we can’t talk about the ethics of LLMs because we require pragmatic solutions to the current implementations, that’s the issue? I’m not advocating against pragmatic solutions. I even said I think it’s unethical for anyone to scrape sites when those sites opt out of scraping, I’ve certainly never said anything like “we should not explore other options for blocking that” and actually I would advocate that if we understand the precise ethical issues better we’d have a better idea for things like, say, regulation, which I would support as one option.
So I hardly feel like I’m detracting from the conversation by questioning the poster who made the assertion that LLMs are fundamentally unethical to create/ operate. This is a discussion forum, I feel perfectly justified in this discussion, it feels totally valid and on topic. You not being interested in the discourse is just not important to me.
I reject the “poor taste” comment, I honestly think that’s just ridiculous. Again, this is a discussion forum and I’m not even the one who brought up the issue of ethical models.
I would be interested in hearing about other practical (technical, economic, or legal) solutions.
So would I but I’m responding in a thread about LLMs being impossible to produce ethically so that’s what I’m focusing on.
Your advice about the [-] button is appropriate, and that’s what I will do. Thanks for the reminder!
I haven’t (and never will) flagged your long and philosophical posts as “trolling” or any such thing. But if you are ever trying to deal with a crisis situation, you solicit practical help, and then somebody comes strolling by and tries to engage at length in philosophical abstractions about tangential issues prompted by some (perhaps inexact) language in your cry of distress… well, maybe you can understand the troll flags on that basis. Cheers!
Hm. I didn’t see this post as a cry for help, or anyone from SourceHut here looking for help. To me, a conversation on ethics of LLMs seems entirely relevant and appropriate for a case where we see harm from LLM companies, especially given a comment saying that LLMs are harmful. I really just don’t accept that I’ve done something wrong by engaging in good faith discussion about something that seems so central to the issue. This isn’t the sidewalk, it’s not like someone has run up to a police officer saying “I’ve been mugged” and I interrupt to say that I challenge the ethics of testimony or something ridiculous - again, this is a discussion forum, I am discussing something that someone else brought up.
I also don’t think I was doing any sort of language lawyering or whatever, nor do I see these issues as tangential. But that’s fine if others disagree, the -
button is there for this reason, I assume.
I don’t think anyone should have been flagged in these threads, but I can easily see how this thread could look like sealioning.
I really don’t see how much more clearly I could be acting in good faith. I’ve outlined my concerns with the statements, I’ve stated my own personal position (agnostic), I’ve granted multiple points (it is unethical to access a website where that site has a policy against your use), I’ve framed the argument such that I would accept the broad statement “it is impossible to build an LLM ethically” if any portion of the LLM process (collecting data, building the model, operating it, scaling it, whatever) is justified, etc.
I mean, I don’t know what “sealioning” ends up counting as but I feel like I’ve been pretty gracious and asking for bold assertions to have any justification is a far cry from that.
I really don’t see how much more clearly I could be acting in good faith.
Another option is not acting?
On a more constructive note, you seem to have developed some ideas and have some questions about a topic. Why not write a blog post summarizing some points and counterpoints, ask for clarification on the missing bits, and post it as a new top level submission here and elsewhere?
Another option is not acting?
This is where I’m landing, yeah. I thought the goal for lobsters was in part to have a higher bar than HN but I’m realizing more and more how little I get out of posting here. Low content rhetoric is consistently the highest voted posts here, I just don’t think it’s for me after all.
I can’t speak for anyone else, but I don’t especially care about all the possible worlds and the potential for the existence of hypothetical ethical LLMs.
We haven’t even established a clear consensus for the ethics of LLM data sourcing. That’s the entire point of this thread.
Extended speculation along these lines is not just boring, it’s in poor taste, given the context of the discussion. This is a crisis (or, if you prefer, an acute but persistent bad situation with identifiable bad actors and abusive patterns of behavior) affecting real projects and communities that Lobsters collectively cares about, and as such it demands pragmatic thinking and action.
I hope this doesn’t contribute to the elevated rhetorical temperature–that’s definitely not my intent–but I don’t know how we can satisfy the demand for “pragmatic thinking and action” if any attempt to seriously analyze the problem is ruled out as “boring, poor taste, extended speculation”. If you don’t properly analyze the problem, you’re going to miss out on a lot of the practical solutions:
I would be interested in hearing about other practical (technical, economic, or legal) solutions.
CDN, authentication, firewall rules where possible. It’s also probably pretty easy to automatically identify and IP-ban the overwhelming majority of crawlers by detecting patterns in the requests they make–I imagine someone has already built this. A sufficiently dedicated crawler could convincingly spoof this sort of automation, but I’m not sure how much they’re going to bother if they can just get the same information from CommonCrawl or similar.
CDN, authentication, firewall rules where possible.
In other words, treat this traffic as spam.
Sort of. CDN isn’t just for spam, but yeah, when you’re operating a public Internet service you can either accept the realities of dealing with traffic from the public Internet or you can shout into the void about how you ought not have to deal with said realities, but only one of those two is going to fix the problem. It would be great if everyone on the Internet was well-behaved, but that’s not the case and badly behaved, LLM-feeding crawlers are only one category of unwelcome traffic you’re going to have to deal with anyway.
Thanks for extending my short list! Those are all good things to keep in mind. As for the “probably pretty easy” bit, could you do us a favor and submit a new story link to something like that if you find or build one?
I think the relevant analysis of the problem is pretty simple, and has to do with the economics of hype bubbles and the breakdown of unenforced civil norms like robots.txt. Further than that, you are welcome to go: I hope you and whoever goes along with you comes up with something great. I’m done with this thread, myself.
LLM and current AI hype companies are crawling public webpages at an unprecedented rate.
So to be quite clear, LLMs are unethical because they are popular, and that popularity is driving an increase in the rate at which public webpages are crawled. Is archiving public websites unethical? Does it become ethical if disk prices fall and suddenly everyone decides to archive the public Internet? Similarly, if everyone decides they want to self-host a search engine, does Internet search become unethical?
It really seems like the issues are less with LLMs and more with (1) their popularity and (2) the choices made by large scale LLM producers with respect to crawling (e.g., ignoring robots.txt and the like). Few if any of the criticisms mentioned have anything to do with LLMs in particular.
Now find me the same information from OpenAI or the other hundreds of AI companies. (Hint: you won’t find it).
This in particular is a pretty egregious moving of the goalposts from LLMs to many of the companies that produce them.
So to be quite clear, LLMs are unethical because they are popular, and that popularity is driving an increase in the rate at which public webpages are crawled. Is archiving public websites unethical? Does it become ethical if disk prices fall and suddenly everyone decides to archive the public Internet? Similarly, if everyone decides they want to self-host a search engine, does Internet search become unethical?
I’ve carefully outlined already how LLM crawlers and the usual webcrawlers are two extremely separate things. Yet people still try to whatabout regarding search engines?
Jesus, people. Please stop. They are not the same thing. The argument is a surface level relevant argument. Please.
This in particular is a pretty egregious moving of the goalposts from LLMs to many of the companies that produce them.
You are in a thread of FOSS projects are getting DDoSed because of LLM crawlers, and you think holding them accountable is moving the goal post?
I’ve carefully outlined already how LLM crawlers and the usual webcrawlers are two extremely separate things. Yet people still try to whatabout regarding search engines?
I don’t see where you outlined it. I see you distinguishing between web crawlers and LLM crawlers in that LLM crawlers are crawling at an “unprecedented rate”, but that doesn’t seem like a distinction between LLMs and search engines. I’ve also seen you mention that LLM crawlers are coming from a variety of IP addresses, which also does not seem like a meaningful distinction between LLMs and search engines. Feel free to point me to your careful outline of the differences between LLMs and search engines because I seem to have missed it.
I also think we may be talking past each other, because I don’t think I’m making a “whatabout search engines” argument–I just don’t see how crawling the web to feed LLMs is categorically different than crawling the web for other purposes.
You are in a thread of FOSS projects are getting DDoSed because of LLM crawlers, and you think holding them accountable is moving the goal post?
No, “moving the goal post” was about starting from “LLMs are fundamentally bad because of crawling” to “these LLM companies are badly behaved”. I’m not advocating for AI companies, I’m advocating for understanding this as a specific case of unwelcome Internet traffic–this analysis points FOSS projects toward the tools they need to use to mitigate: auth, CDNs, rate limiting, firewall rules, etc as appropriate. Ideally everyone on the Internet would be a well-behaved citizen and we wouldn’t have to worry about unwelcome traffic, but that’s not how the Internet works.
You’re talking to someone expressing great frustration at the firehose of worthless LLM traffic that’s making it difficult for them to host their community’s services and are tying them up in endless hairsplitting and inane hypotheticals. Like, this, for example. Foxboron writes:
LLM and current AI hype companies are crawling public webpages at an unprecedented rate. Leaving you no ability to block them or opt out of this massive data collection. Then they sell it back to you!
You reduce this to
So to be quite clear, LLMs are unethical because they are popular, and that popularity is driving an increase in the rate at which public webpages are crawled.
which is just insulting—don’t write “so to be quite clear” and then follow it up with something the other person absolutely did not say. Of course LLMs aren’t unethical because they’re popular, they’re unethical because of the immense pressure they put on people providing free resources for their fellow human beings who don’t consent to this selfish exploitation.
I don’t believe for a second that you think this a reasonable way to behave. Please stop it.
Of course LLMs aren’t unethical because they’re popular, they’re unethical because of the immense pressure they put on people providing free resources for their fellow human beings who don’t consent to this selfish exploitation.
You contradict yourself. The “immense pressure” is a function of popularity (LLMs are popular and thus lots of people all over the world are building them at a frenetic pace resulting in a ridiculous amount of crawling), which is precisely what I was remarking about in the comment you found to be reductive.
Anyway, this thread is getting far too heated, I’ll duck out until the temperature dips.
in Azure, GCP and Alibaba Cloud
Is there any reason to allow them on this service? Tor and public clouds is what I remove pretty much immediately for my services and it magically solves almost all spam traffic issues for me. That one person running a custom VPN through an AWS instance gets affected, but that’s about it.
They still get access to webhooks/api/whatever, but that’s not possible to browse, so crawlers just get a few errors and go away. Dropping a request which doesn’t include an authentication header is really cheap.
Blocking Arch Linux services from the public clouds would make the distro unusable in the public clouds. This is do-able if you are running a hobbyservice.
The service mentioned here is archwiki specifically. Servers don’t need to access that one.
(packages are already distributed through mirrors - I really hope that doesn’t get spam crawled if the autogenerated file index is not created on access)
Archwiki is just an example because that is what people know from our project. However other services have very much also been affected in waves. This includes archweb and aurweb.
How difficult would it be to coordinate with the cloud providers and say “We’re blocking this, host a mirror, maybe do silly network routing stuff”?
I can read a book that is under copyright, I can then produce a new book based on it, even quoting directly from it - for example, to create a critical review of the book, satirize it, etc. I think it’s at least a reasonable question to determine where LLMs fall here.
I assume that you aren’t software.
This feels like the beginning of the “LLMs learn, humans learn, what’s the difference?” argument. If so, here are some:
So I think that those are interesting ways to break symmetry between a human and an LLM, but I don’t think they’re necessarily relevant to the ethical issue at hand. A major problem here is that the ethical issue hasn’t actually been spelled out - my questions are not intended as defeators to an argument (no argument has really meaningfully been posed) but instead were intended as helpful guides; a good argument would probably want to address things about LLMs that are distinct with regards to, say, copyright or data collection.
So yes, I can’t be copied. Does that address the ethical difference we’re concerned with? What is it about the property “can be copied” that should impact by moral judgment here?
Perhaps it’s just unclear what I’m after here. Strong assertions were made, I want to see those assertions supported. I brought up potential issues I could see that I would probably want to see addressed by any argument that is presented in the future where the assertions had an argument attached. Ad-hoc answers to my questions isn’t really the point, I can obviously think of many ways in which a person and an LLM are different, but I am asking about the moral differences with regards to the inherent technologies of LLMs.
I think these are interesting problems. The ethics around AI and copyright are really interesting. Someone saying “LLMs are inherently unethical” is interesting to me and I’d be really curious to see such a position defended as I am personally agnostic on the issue.
There’s quite a few spelled out elsewhere upthread and downthread. But, for this thread, I’ll explain my “can’t be copied” respond to the “LLM learn like humans, what’s the issue?”
Copyright is a balance between different people’s rights. LLMs don’t get the human benefits of copyright by virtue of “learning.” Humans can’t be copied and so the outcome of our learning process and our use of knowledge is shaped very differently. LLMs take a very different shape, one that is closer to a derivative work than an educated human. The copyright compromise for derived works is different. Claiming it for an LLM is an attempt at regulatory arbitrage ala Airbnb or Uber in their early days.
“LLM learn like humans, what’s the issue?”
I just want to point out that I never asked this, this was a question you asked (or assumed I was asking, I guess). My questions were just to frame the sorts of challenges I’d expect an ethical evaluation of LLMs to be able to address.
That said, what you’re getting at with regards to rights being ascribed to persons is interesting but I’m not sure I really follow it all the way. Ultimately, it doesn’t really matter, we can find quite a large number of distinctions but I don’t know which ones would be relevant at this point.
The root comment claimed that “LLMs are unethical”, which is charitably interpreted to mean “the practices that produced all of the existing frontier models are unethical, and it is unethical to support them.” You’re objecting to the claim “the LLM/transformer architecture is intrinsically unethical”, but my honest opinion is that that’s a fairly pedantic interpretation. It’s obvious what was meant.
(I didn’t and don’t endorse downvoting/flagging you)
I think even if you say that the operations of LLMs, or the creation of LLMs, etc, is unethical, it requires justification and faces the same challenges I presented. I’m entirely willing to say that if the production of LLMs is inherently unethical then LLMs are unethical, that’s fine by me and I should have been clearer about that (I’m not asking about the fundamental algorithm, I’m asking about the fundamental technology, which includes the building/operation of LLMs).
I’m granting the absolute strongest position here - that if the building, operating, or underlying algorithm premise of LLMs is unethical then all of it is unethical. I just want to see arguments about any of that. It may seem pedantic if I’ve made it sound like I’m constraining here but I think my posts make it clear that I’m actually taking the entire argument (ie: operations, building, etc) into account, not just the transformer model or something - I’ve asked specifically about issues like copyright and scraping, which have nothing to do with the transformer algorithms themselves.
Previously you said
I’m asking about LLMs being inherently unethical, not OpenAI.
But OpenAI’s practices are the topic of discussion. Even if the only unethical part is indiscriminate scraping / ignoring robots.txt, that’s what produced (all of) the current frontier models.
edit: again, the simplest way to interpret the root comment is that “LLMs” refers to the current frontier models, made by OpenAI, Google, and Anthropic.
edit: again, the simplest way to interpret the root comment is that “LLMs” refers to the current frontier models, made by OpenAI, Google, and Anthropic.
Ah, I don’t take it this way at all. I took it as it being inherently impossible to produce LLMs ethically. For example
But you won’t, because doing so is impossible in practice, except maybe for small tech demos.
This to me reads as “it isn’t possible” not “those companies aren’t doing it”.
OpenAI are literally making the case to governments that it isn’t possible to create working LLM products without free reign to ignore copyright law and the wishes of content creators.
Is a search engine unethical? Is piracy? Are vulnerability scanners? There are a lot of tools out there that are at least contestable regarding their moral value that do things that are at least on face value similar to some of what LLMs do.
I think the scale is precisely what makes people say that LLMs are inherently unethical. As-is, the kind of LLM you see in the wild requires such a broad corpus of training data that fine-grained consent gathering strikes me as extremely difficult to pull off (not to mention expensive!)
Is there some version of an LLM that could be trained ethically? Maybe, but I suspect it would require the companies doing the scraping to operate in fundamentally different ways — modes of operation which they have no incentive to pursue.
I would say people probably think you’re trolling because you’re entertaining extremely abstract scenarios which presuppose a situation in which the largest companies in the world act in ways diametrically opposed to the way they are currently acting and have historically acted. It’s pretty clear to me that OP is saying something in the back half of their comment like “given the constraints of the world we live in, this thing is true.” It comes across as trolling if you start pursuing forms of inquiry that ignore those constraints. It’s like someone said “Pigs cannot fly” and you responded “Wait a minute. Many animals have wings. What if a pig was born with some? Do we think it would be capable of flight?”
Back in July 2019 I was investigating some bad bots on my website when I came across the bot that identified itself simply as “The Knowledge AI” that was the number one robot hitting my site. Most bots that identify themselves will give a URL to a page that describes their usage like Barkrowler (to pick one that recently crawled my site). But not so “The Knowledge AI”. That was all it said, “The Knowledge AI”. It was very hard to Google, but I wouldn’t be surprised if it was OpenAI.
The earliest I can find “The Knowledge AI” crawling my site was April of 2018, and despite starting on April 16th, it was the second most active robot that month. In May it was the number one bot, and it stayed there through October of 2022, after which it pretty much dropped—from 32,000+ in October of 2022 to 85 in November of 2022 (about 4 1/2 years). It was sporadic, showing up in single digit hits until January of 2024. It may be still crawling my site, but if it is, it is no longer identifying itself.
I don’t know if “The Knowledge AI” was an LLM company crawling, but if it was, not giving a link to explain the bot is suspicious. It’s the rare crawler that doesn’t identify itself with at least a URL to describe it. The fact that it took the number one crawling spot on my site for 4 1/2 years is suspicious. As robots go, it didn’t affect the web server all that much (I’ve come across worse ones), and well over 90% of its requests were valid (unlike MJ12, which had a 75% failure rate). And my /robots.txt
file doesn’t exclude any robot from scanning, so I can’t really complain about it.
As to you question: “I could buy a book myself and then train an LLM on it. Is that unethical?” Maybe? I know that Hunter S. Thompson literally copied (on a typewriter) books by F. Scott Fitzgerald and Ernest Hemingway, but he never sold them as his own work. Dave Sim liberally copied the art style of Barry Windsor-Smith for his comic book Cerebus but did go on to later develop his own style. But during his “Barry Windsor-Smith” era, it was clear he wasn’t Barry Windsor-Smith. But with an LLM, it can ingest way more material than a single human can. It would be one thing to ask an LLM to write a book about an old man fishing in the sea. It might be another to ask an LLM to write a book about an old man fishing in the sea in the style of Hemingway. And I think it’s that last bit that has most people who hate LLMs—if you can ask for text (or an image) in the style of an artist, why would you need the original artist? Would it be ethical to ask an LLM for “a book about Albus Severus Potter’s first year at Hogwarts, in the style of J. K. Rowling?” Would it be ethical to release it? Selling such a work I think would be unethical—first, you aren’t J. K. Rowling. Second, the property is still under copyright. But that brings up another angle—asking an LLM for “a new Sherlock Holmes mystery, in the style of Arthur Conan Doyle.” Would it be ethical to release it? As a story written by Arthur Conan Doyle? As your own work? Disclosing/not disclosing the use of AI? There’s a line that’s easy to cross if all you are after is money.
I’m not sure if I would label LLMs as “inherently unethical” myself. Can they be used unethically? Sure. Can they be used ethically? Sure? I don’t think we as a society have found the line between the two, and it seems that the companies pushing the AI models aren’t thinking critically about it either—the money is too tempting in my opinion.
In addition to the whole “potable water”, “copyright”, etc. problems, LLMs are inherently unethical because they have completely undercut the normal, convenient processes for learning how to do things where information is handed (socially distributed!) from experts to beginners, with machines that have absolutely no concept of the domain, no concept of correctness, and no concept of the risks if shit hits the fan.
The reason why programming has successfully dodged any kind of industry standard certification process is because while the barrier to entry is low, there are still enough hurdles to even simple tasks like, e.g. “correctly writing a web scraper”, that a random person would be placed in contact with learning materials that would ensure that, by the end of it, they have some kind of understanding of What Not To Do — or, failing that, they would be otherwise inducted by their peers in a junior role and taught it. The hurdles themselves served as a mechanism of protection, barring anyone who hadn’t been socially inducted into industry best practice from being able to successfully write, test, compile, and deploy the next THERAC-25 a web scraper — while also culling the herd of people who only care about the ends, and not the means — people who not only do not give a shit about unknown unknowns, but who also do not give a shit about known unknowns, and just want to finish something.
The group of people who are drawn to LLMs are people who don’t want to learn how to do the thing properly (or otherwise ask for help and communicate with other people(!) to achieve their goal), are fundamentally looking for shortcuts to understanding and knowing, and aren’t willing to put in the legwork to understand enough about the programming task to break it down sufficiently for the LLM to manage it without glaring flaws that will blow up in their (or in this case — other people’s) face. It reminds me of seeing all these home DIY projects where they’ve cut through all the joists, or where a human-supporting structure is suspended on these absolutely horrifyingly thin plywood stilts. Ok, it “works” right this second… but for how long, and who is going to be hurt when it fails?
To be clear: I feel that the same thing that drives someone to just completely undercut the stability of their house, is also the same motivator that drives them to use an LLM for knowledge or programming work. It’s not 100% their fault, because in both cases there’s a whole industry that is telling them “you can do this” that encourages pushing aside the fact that there are things in life that are actually difficult to do correctly, dangerous if you don’t do it correctly, and that will harmful consequences for people — money and resources, if you’re lucky, illegal and life-threatening if you’re not.
On top of this, DIY is actually in a better situation — most people with a basic understanding of physics can look at bad DIY examples and go “wow that’s dangerous” (Of course, the homeowners creating the bad DIY structures often do not understand why it’s bad, or don’t care), whereas programming is actually much harder to ascertain straight off the bat and it can take experts months to realise that there’s something suspicious or wrong that might materially hurt people. There is an entire industry dedicated to avoiding obvious security holes, OWASP and other efforts exist to educate people “what not to do” and how to solve these problems, and yet e.g. Sony still leaked millions of people’s credit info in plaintext, a bug in ChatGPT caused thousands of people’s of credit card data to be leaked, etc.
And fundamentally, I think the underlying rot here is just, a systemic, cultural disrespect for processes and things that are difficult, or hard, or things that take time. LLMs are inherently unethical because they encourage that disrespect and disregard. They give a car to someone to drive without any guidance without them having ever seen a road sign or having had the advice communicated that “breaking before you hit a corner is a really good idea, actually”, and then the people handed these cars expect to be able to run the grand prix.
I do, for what it’s worth, believe in making computer automation more accessible to everyone, but even ignoring that accessibility is contextual (a corollary of disability being contextual), there is a fundamental limit to how accessible you can make a task before you are creating Danger that the person you’re making it accessible for might not even be able to comprehend. Whole classes and categories and taxonomies of danger they have no reference point to understand why that is bad or how! Again, just look at how salient and easily to comprehend and explain a lot of bad DIY is and now think about how many people still don’t understand it, and put their family and friends in literal, mortal danger! Now think about trying to explain like, a really simple post code address leak and the solution? Or GDPR mishandling that puts them at risk of liability?
And sure, ok, fine, the current danger is that the LLM is causing serious financial damage to independent developers, or that it hallucinates an extra rm -rf --no-preserve-root
or rm /path/to/the/customer/database
(an issue that has already cropped up in production software written and vetted by humans[4][5]), or it miscalculating something, or accidentally creating an SQL injection vulnerability in a package that thousands of people rely on.
The future danger is scammers poisoning the inputs to make it call out to a C&C server, or something much worse that we can’t even see at the moment. I mean, we’ve already seen Google Gemini almost kill people twice. Once by giving them instructions that would have resulted in botulism (which is yet another example of a ridiculously easy task where every easily available guide on the internet mentions “do it wrong and you will kill yourself”).
So yes, it’s unethical. What are the failure states with code — are we really sure they can’t hurt people?
- Responsible and ethical
- Worth the time, money, and energy invested
These are extremely hard criteria to evaluate, let alone meet, for huge swaths of human endeavor. I find myself leaning in a consequentialist direction as it comes to “AI” these days.
I think Robin Sloan’s piece, “Is it Okay?”, presents a fairly level headed interpretation of the current state of things. (https://www.robinsloan.com/lab/is-it-okay/)
If an AI application delivers some profound public good, or even if it might, it’s probably okay that its value is rooted in this unprecedented operationalization of the commons.
If an AI application simply replicates Everything, it’s probably not okay.
I think it’s fair to disagree whether the ends justify the means here. I personally do think significant advances in medicine may be achieved on our current trajectory. Is that worth it?
What basis do you have for thinking that? I’ve only seen hype and pie-in-the-sky, myself.
But the harms are real, they are concrete, and they are acute. Trading speculative potential future good against immediate ongoing harm isn’t the kind of slippery slope I’d like to build any such argument on. Sloan’s “probably okay” really doesn’t cut it for me.
So, what is the actual motivation for all of this crawling here? Building and training a SOTA model still involves millions of dollars, so the number of people collecting data for that aim seems pretty minimal. Are we assuming that there are high numbers of bots due to people building independent RAG databases?
One challenge here is that if you do want to narrow down who can access your data, then you’re sort of stuck in the position of defending it, because you also have to maintain control of access and egress. The alternative is that if you want your data to be public, then it can be contributed to something like Common Crawl, or Common Corpus. In that situation, what we have is a coordination problem: how do we explain to these crawlers that they are doing this in an inefficient way, and that they can just collect this information en masse in another, publicly available repository.
Let me be clear, there is no way to use LLMs that is both: Responsible and ethical Worth the time, money, and energy invested
Sure there is.
Every language should collect all of the material it can find in that language, train an LLM on it and then release that LLM as open source.
We recently dealt with what I can only assume are LLM scrapers, though not identified as such. We’d receive bursts of thousands of requests over a short period almost nightly. We hit this with Claude in the past, but this situation was a bit different.
Out of several tens of thousands of requests (during the period I was investigating this):
This meant we couldn’t just block an IP, and couldn’t easily honeypot anything. Despite best efforts, we couldn’t find any useful IP address ranges that weren’t just “all of China.”
We made some progress by blocking requests from unlikely User Agents (e.g., MSIE4, macOS 7, Windows 98 for PPC), but we stopped receiving many requests under these. The traffic began to shift to subsets that were able to get through, using more realistic (if still fake) User Agents. Basically, they were adapting to our blocking methods, and it seemed pretty automated.
Same with blocking specific ranges. We were seeing the traffic pick up from ranges that weren’t being blocked.
None of this acted like an intentional DDOS attack. It really seemed like scraping, but done in a way that was hard to block without blocking all of China.
So we blocked all of China. Well, not really. We had to throw AWS Shield in front of it, and initially we set it so after a certain number of Chinese IP requests in a certain period of time, rate limiting would kick in. We just ended up with smaller bursts more often. Same rough amount of requests in a day, but split up more.
We had to put all of China behind one of those checkbox-based captchas. That pretty much took care of the problem for us. We haven’t had any other type of traffic coming in, and after doing this these requests pretty much went away. Every so often they come back, but seem to run into the captcha and deprioritize.
Been curious how many people have hit this pattern. I’ve seen some discussion online, but not much.
I wonder how much of this is Chinese companies trying to train their LLMs and how much is the CCP trying to get the rest of the world (or west in general) to implement the Great Firewall for them.
trying to get the rest of the world (or west in general) to implement the Great Firewall for them.
That would certainly be an interesting tactic, anyway. But I don’t have the general sense that the CCP wants things like git forges to block them. And I don’t have the sense that sourcehut hosts much that’d be problematic for any government.
Also worth mentioning, the IPs I ran through IP blacklists (random sampling) came back clean.
Operators of nitter instances can tell you a very long tale about this kind of home-routed-VPN bots coming from various random IPs and shifting their traffic every time you intervene.
Update from the owner of Sourcehut: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html
This article states that the volume of traffic from these kinds of bots is several orders of magnitude larger than that from traditional search engines. This puts another spin on the comments here that glibly state “we’ve seen this before, GG scrub webmaster”.
I would imagine if the git web viewer was purely static this wouldn’t be such a big issue for srht. This is a great reason why I’ve been trying to decouple a git web viewer and direct access to the underlying git repo.
Both tools do not have direct access to the underlying git repo but still contain most of the important bits.
When you build a crawler for a large swath of the web, is it so inconvenient to avoid making your traffic to any given site a single intense burst? A shared work queue and distributed BFS seems like the least they could do.
I do occasionally wonder whether the bots themselves are built at least partially with LLM-generated code, or by engineers who lean heavily on such tools. It would explain the incredible lack of, uh, common decency on part of the bots and their owners. I don’t have any hard evidence though so it’s all speculative ¯\(ツ)/¯
If the people who are trying their best to hoover up the entire internet in their quest for a better ELIZA gave a single damn about whether they inconvenience others they’d find a different line of work.
I vaguely recall hearing that there is an open crawling initiative that aims to basically crawl the Internet and make the results available to anyone who wants them. I’m not sure if they charge for the results or not, but one can imagine an initiative that makes the results available via torrent or similar so that the service itself is cheaper than crawling the Internet on one’s own.
I wonder if the the operators of these crawlers could devise some sort of technology that identifies what the text on the page means and whether following the links is likely to yield useful or unique information. It’s probably too hard a problem unfortunately.
This situation is just plain stupid. Does anybody has experience with the tools that, for example, Cloudflare bot fighting tools? I know that a CDN wouldn’t help with dynamic content but I’m curious what are the experiences with those tools
Cloudflare’s bot fighting tools mostly work, but I had to make my own thing to really get relief after I cut cloudflare out of my stack.
A friend has a small media wiki instance that is getting pummelled. I’m looking into adding a response time delay to the first access of a uri within some given time window. The only pattern I can discern at the moment is they’re hitting URIs almost nobody else visits. In our case, they’re faking UA and using botnets.
One thing I don’t understand is what it is about AI crawlers that make them this pesky? I’d imagine crawling was already being done by companies like Google well before the unfortunate descent into LLM mania.
One difference is the incentive structure. A person who hosts a website getting crawled by search engine indexers wants to get their website into search engine indexes, so they can get visits from search engine users. They therefore have a good incentive to only disallow stuff in robots.txt that are of no value to search engine indexing anyway - and the crawler writer also has a good incentive to honour robots.txt, precisely because it will usually contain things of no value to them. This is different for AI scraping: The AI crawler writer simply wants to take everything, giving nothing back to the host (in fact, as per OP, the host loses operational stability - and, depending on their hosting solution, might even have to pay money for the privilege of being abused by AI companies). So the relationship is necessarily adversarial: The website owner wants to prevent AI crawlers from 1. stealing everything, and/or 2. effectively DDoSing them, and the AI crawler writer has no incentive to even try to play nice - especially considering that their entire economic value proposition is putting said website owner out of work.
It’s also possible that the crawlers are simply incompetently made. But part of it, it seems to me, is that there is no reason for them to try to not be abusive, unlike earlier web crawlers.
It’s also possible that the crawlers are simply incompetently made. But part of it, it seems to me, is that there is no reason for them to try to not be abusive, unlike earlier web crawlers.
I think it’s both. They’re incompetently made (no appropriate caching scheme, etc.) because there’s no incentive for them to be competently made. In that specific case, bandwidth is so cheap for them that there’s no reason to spend the effort caching. But I agree with you about the other misaligned incentives that encourage abusive behavior.
I think you’ve hit the nail on the head.
Previous well-behaved crawlers (like for Google’s search) were in symbiosis with websites / content providers.
Recent scrapers for LLM mills are parasitical.
As a site full of structured data and programmers, Lobsters maybe sees more than its fair proportion of bots that are the first bot someone has written. But my hunch has been that a lot of the AI crawlers are just immature. Search engine crawlers in the 90s were not particularly smart about how often they recrawled sites, with similar effects. It took each 2-3 years to learn that their crawl could be much more efficient with some heuristics about how likely page content is to have changed and better internal feedback for throttling their crawl rate when server responses slow down. It could be the current crop of AI crawlers is too new and too flush with cash to have that level of maturity and it’ll be a similar period for these to mature. But this is just a hunch, I don’t really have any way to confirm or disprove this hypothesis.
makes me want a distributed system for hosting git repos
sucks that LLMs make the tubes spammier. but it’s also a bit weird that we have a distributed vcs, and everyone uses it via 3 centralized websites
(myself included. i know iroh and pkarr exist and would help solve this, yet i haven’t invented distributed github yet)
We get this a lot at $DAY_JOB, too … and our content is creative commons licensed! I guess they’re so used to scraping copyrighted content that it’s not worth looking for nicer ways of handling the legal stuff.
A plague on all the scrapers’ houses. I’m still hoping that a court will judge LLM models to be derivative works of their training data, but that hope is waning. And even then, such a judgement would have no impact on scrapers operating from rogue states like China.
This isn’t loading for me.
Is it that sourcehut does not want llm crawlers at all or is it just that the llm crawlers are overly aggressive?
No LLM crawlers at all. Here’s the start of the sourcehut robots.txt where it’s explicitly stated.
# Our policy
#
# Allowed:
# - Search engine indexers
# - Archival services (e.g. IA)
#
# Disallowed:
# - Marketing or SEO crawlers
# - Anything used to feed a machine learning model
# - Bots which are too agressive by default. This is subjective, if you annoy
# our sysadmins you'll be blocked.
Yes. Both.
Nepenthes is a system designed to trap and distract/disable crawlers by feeding them garbage. If it’s cheaper than them crawling your actual site then it can help keep them from realizing they’re being blocked.
That’s the trick, though, isn’t it? Until relatively recently I did such a thing, and found they’d simply keep crawling the garbage anyway. Now I just drop their packets on the ground. It cost me more to foul them than to filter them and I suspect that will be true for many sites.
I suppose you’d have to do testing.
If you serve solely static content, dropping is the only effective strategy.
If you serve dynamic content that’s poorly cached or hard to cache and your problem is CPU time, then it depends on what happens when you drop their packets. If they immediately come back on a new IP as I know many do, then it probably makes sense to serve them cheaper garbage since effectively blocking them entirely is impractical.
Same for me. There are archived copies available. https://web.archive.org/web/20250316000305/https://zadzmo.org/code/nepenthes/
We had spammers and crackers DDossing our network connections. We had cryptominers burning CPU cycles at insane rate.
Now, you can have it all in one package thanks to bullshit generators: welcome to the future!
I simply locked up my cgit instance behind http auth, and moved the URLs. Anything hitting the old URLs has its IP banned for at least a year.
I run a comment aggregation service which was getting hammered pretty hard by these crawlers. In the end I decided to serve all public links from a cache and paywall new aggregation requests. It’s a shame, because the way the service worked before this, with each new person being able to update the cache for the next person, was really nice.
I don’t see a more practical solution, especially for people building on a smaller/independent scale, than this right now. Hopefully someone smarter than me will come up with something in the future.
I can’t help but see similarities to https://www.schneier.com/blog/archives/2023/12/ai-and-mass-spying.html
Sadly, I think it’s time for me to move on from Sourcehut. What are others who rather not use Big Git doing?