A thought on JavaScript "proof of work" anti-scraper systems

20 points by runxiyu

icefox

Repeat after me everyone: the problem with these scrapers is not that they scrape for LLM’s, it’s that they are ill-mannered to the point of being abusive. LLM’s have nothing to do with it.

The purpose of PoW anti-scraper systems is not to get rid of the LLM’s, it is to slow the scrapers down to a rate that is less abusive. I have no idea why someone would write an abusive scraper system in the first place, but I expect that it comes down to economics. Most search engine spiders try pretty hard to be well-mannered, because the website is their product; they need the website owners to want their site to be spidered. Criminal scrapers try to be difficult to notice so they don’t get blocked as much, and/or have fewer resources available at one time to cast over a wider area. Meanwhile, I expect somewhere in most LLM companies is a sub-department whose productivity is measured by “new training data acquired in TB” and nobody making decisions for that department cares about long-term sustainability as much as they care about a quick buck.

gcupc

Repeat after me everyone: the problem with these scrapers is not that they scrape for LLM’s, it’s that they are ill-mannered to the point of being abusive. LLM’s have nothing to do with it. The purpose of PoW anti-scraper systems is not to get rid of the LLM’s, it is to slow the scrapers down to a rate that is less abusive.

Hans Moleman saying “I was saying Boo-urns!” I mean, speak for yourself. The fact that they are ill-mannered is the thing that makes blocking them urgent. But it’s perfectly fine to not want LLM scrapers hoovering up your site.
- icefox
  
  ofc! We just already have ways to tell LLM scrapers not to hoover up your site: robots.txt and other ways to opt-in and opt-out of scraping. The problem that the PoW systems are trying to solve is that the scrapers don’t respect robots.txt. Why the scrapers don’t respect it is a significant but orthogonal issue.
  
  If you don’t want a scraper to hoover up your site and do some particular thing with it, like feed it into an LLM, then a PoW system won’t stop it. But it can slow it down so it doesn’t bog down your web server.
  
  The point is that it’s not a captcha, so people need to stop treating it like one. If you want a captcha, make a captcha, not a PoW system. How do you make a captcha that can defeat an LLM? I dunno. But if captchas are “problems that are hard for computers but easy for humans”, then a proof-of-work is like, the exact opposite of that.
  - gcupc
    
    Oh, I agree that PoW systems are the wrong solution inasmuch as they exacerbate the problem (too much computing being done). They’re a wrong answer that kinda works right now, but won’t work for much longer. robots.txt is a right answer that doesn’t work. So now we’re flailing looking for a real solution. Probably some kind of behavioral fingerprinting, maybe something as relatively simple as fail2ban.
    
    spc476
    
    Or, you know, legislation with real teeth (read: criminal court and jail time) for violators, with the ability to break the corporate veil to follow the money to the perpetrators. It’s a social problem, not a technical problem.
    
    icefox
    
    thank you, I’m glad someone gets it.
    
    fanf
    
    the problem with these scrapers is not that they scrape for LLM’s, it’s that they are ill-mannered to the point of being abusive. LLM’s have nothing to do with it.
    
    The arrow is in the other direction: most LLMs are created by evil people for evil purposes, and the behaviour of their scrapers is yet another example of their turpitude.
  - kornel
    
    The risk of a scraper replacing JS-based implementation with a more optimized one leads to obfuscation and randomization of the JS challenge code. The next logical step from that is a JS challenge that uses really complex browser APIs to make it infeasible to use anything other than an actual full-fat browser engine.
    
    However, I don’t see any of this surviving long term.
    
    LLMs are already computationally heavy and wasteful business, and the only barrier here is a difference in computation spend by individual users vs scrapers.
    
    It’s already feasible to do scraping with a headless fully-featured browser. We already have LLMs capable of driving browsers based only on graphical rendering of the page and mouse clicks.
  - waelk10
    
    This is still bad IMO just because it locks out a lot of legitimate UAs, e.g.: lynx, w3m. I personally found out that most of these scrapers are originating from a few hosting providers (at least in my case) and a few specific ASNs, hence the logical step of banning entire ranges thereof.
    
    inb4 “too harsh”: I am losing no legitimate users by banning OVH, Hetzner, M$ or Facebook, especially when talking about small community-lead projects and not-for-profit websites.
    
    kyrias
    
    This is still bad IMO just because it locks out a lot of legitimate UAs, e.g.: lynx, w3m.
    
    Anubis does not. The default configuration only triggers on UAs that pretend to be Mozilla, lynx and w3m do not unless you configure them to.
    
    einacio
    
    So then the scrapers can just set their user-agent to lynx and continue reading?
    
    kyrias
    
    They could, but they don’t because they want to be indistinguishable from regular clients so they’re not blocked.
    
    valenterry
    
    Your reply doesn’t make sense. I think einacio is correct - anubis currently uses the UA but also notes that this might not work in the future if scrapers adjust. I’m afraid that we will eventually have to face this problem.
    
    kyrias
    
    I feel like you’ve fundamentally misunderstood the pressures involved here.
    
    They’re not using a Mozilla UA by chance or anything, they’re doing it because the way they DDoS websites would be immediately blocked by large swathes of the Internet if their requests were in any way distinguishable from normal web traffic.
    
    Could they change the UA to something else? Of course. But they have literally no pressure to do so. If they set their UA to e.g. lynx’s you know what would happen? The “important” parts of the internet that they’re DDoSing would immediately block all requests from that UA, because out in the real world no one uses lynx. There would be absolutely zero negative outcome for them in blocking it.
    
    The only reason their DDoS approach to scraping works at all, which is the whole reason these defensive approaches became necessary for so many websites, is because there is no reliable way of detecting specifically their traffic. As soon as they change that and it’s possible to reliable detect the traffic then it’s also going to become significantly easier to block it.
    
    Gaelan
    
    I think kyrias is arguing that if the scrapers set their UA to (say) Lynx, they’d get blocked by a different set of non-Anubis-using sites which refuse anything that doesn’t look like a Mozilla UA (eg I suspect Cloudflare does this In some configurations). But the scrapers could do something like detect Anubis, and switch to lynx only in those cases.
    
    valenterry
    
    Ah I see.
    
    But the scrapers could do something like detect Anubis, and switch to lynx only in those cases.
    
    Yeah just retry with a different UAs (from different IPs) until one works. Still more work, but much less work than the challenge/reponse of anubis.
    
    zladuric
    
    I’m probably not your legitimate user, but I did this myself, and I know others do as well, set up a VPN node on a Hetzner VPS to act as exit node when I’m not in Germany. I don’t do that any more, but I used to. Still, it’s an edge case and you probably wouldn’t lose more then one or two users, if even that.
    
    freddyb
    
    I just find it so eternally sad sad sad that the counter to these world-burning compute hungry stupid machines is a defense mechanism that is asking them to compute even more (even if the ask from proof-of-work is significantly dwarfed by the actual costs of those who are building LLMs).
    
    I agree with the author’s point that this is an arms race and scrapers will just give in and spend more compute on these proof-of-work systems but limited to a some sort of upper bound. I disagree that scrapers will build recognizers for the proof-of-work system, I think they will reach a relatively stable equilibrium because Anubis and the eployments (as an exmaple) can’t move as quickly as the scapers can.
    
    algernon
    
    This is one of the (many) reasons I am not a fan of proof-of-work “solutions”. For any crawler that doesn’t run JavaScript, it is just garbage. For any crawler that does, they have more resources to throw at it than the average visitor, so they’ll get to the content anyway. Big Tech-backed abusive tech will always have more resources than the average visitor, so this is a race towards a distopia where PoW “protected” websites are only served to the most abusive crawlers, and no human ever sees them, because they’re not gonna wait minutes for a page to finally load.
    
    PoW makes very little difference for scrapers, but they place a lot of burden on legit visitors.
    
    On the other hand, I am quite optimistic, because there are ways to fight the crawlers, ways that we can win. Passive identification can already get rid of the vast majority of them, in my experience (I’m serving ~50 million requests weekly, of which only ~20k are legit, and about 49 million are caught passively, and only about a million get a challenge - and even that challenge is as simple as requiring a simple, static cookie). A diverse set of tools, highly varied ways of combating them is how we can make it expensive for them to adapt, not with PoW.
    
    PoW is a simple thing to deploy in the short term, but it annoys the heck out of legit users, and can be trivially circumvented by throwing more money at the problem. Big Tech can do that.
    
    Teckla
    
    Could the “work” being done be something that benefits humanity, such as protein folding? That way the “work” isn’t “wasted.”
    
    valpackett
    
    IMO the forward-looking approach would be a much stronger separation between requests that are for “simple static pages” (where the whole cost of serving could be optimized down to one memcpy from an in-memory cache to a NIC (with TLS offload) on the CDN/“edge” server) and requests that invoke git blame and such. The latter category should, from the beginning, not be a free-for-all. Various measures could be used to protect those, more than just PoW and CAPTCHAs. How about authentication, even for guests: a one-time emailed token without permanent registration is definitely an underrated method.
    
    hyperpape
    
    I think comparing the costs borne by individual users vs. LLM companies is complicated in a way that leaves me without a very strong intuition.
    
    In principle, it seems simple: to make the LLM scraper use 1vcpu-second of computation, you also have to make each legitimate user pay the same cost[0].
    
    But this ignores the different access patterns of legitimate users vs. scrapers. If legitimate users are accessing more pages than scrapers then you’ll impose higher costs on legitimate users, and lower costs on scrapers. On the other hand, some sites seem to be reporting more traffic from bots than from legitimate users. I don’t know how this breaks down on average.
    
    [0] There’s also the aspect that users typically pay in time, LLM companies pay in literal dollars.
    
    forestj
    
    I can’t access the site because of the user agent block :(
    
    I think using SHA256 PoW makes no sense but memory-hard might work better, especially if the algorithm is designed to prevent it from working on GPUs.
    
    I also built a fallback for non-js browsers, where they can simply send an email to a given address and the http response will be resumed the instant the email is received. Yes this is 100% security thru obscurity but at least I can be confident that legitimate users aren’t being blocked.
    
    I think pow is popular because its a “broad stroke” solution that discourages scrapers from trying to evade, you dont have to debug using wireshark or httpflow to be able to figure out how to block. Its sort of one size fits all.
    
    In general I think everyone has the PoW requirement set way too high, if scrapers aren’t solving it now, they won’t solve it if its 10x easier either.