Bro, ban me at the IP level if you don't like me
105 points by classichasclass
105 points by classichasclass
Yes, I’ve seen this one in our logs. Quite obnoxious, but at least it identifies itself as a bot and, at least in our case (cgit host), does not generate much traffic. The bulk of our traffic comes from bots that pretend to be real browsers and that use a large number of IP addresses (mostly from Brazil and Asia in our case).
I’ve been playing cat and mouse trying to block them for the past week and here are a couple of observations/ideas, in case this is helpful to someone:
As mentioned above, the bulk of the traffic comes from a large number of IPs, each issuing only a few requests a day, and they pretend to be real UAs.
Most of them don’t bother sending the referrer URL, but not all (some bots from Huawei Cloud do, but they currently don’t generate much traffic).
The first thing I tried was to throttle bandwidth for URLs that contain id=
(which on a cgit instance generate the bulk of the bot traffic). So I set the bandwidth to 1Kb/s and thought surely most of the bots will not be willing to wait for 10-20s to download the page. Surprise: they didn’t care. They just waited and kept coming back.
BTW, they also used keep alive connections if ones were offered. So another thing I did was disable keep alive for the /cgit/
locations. Failed that enough bots would routinely hog up all the available connections.
My current solution is to deny requests for all URLs containing id=
unless they also contain the notbot
parameter in the query string (and which I suggest legitimate users add in the custom error message for 403). I also currently only do this if the referrer is not present but I may have to change that if the bots adapt. Overall, this helped with the load and freed up connections to legitimate users, but the bots didn’t go away. They still request, get 403, but keep coming back.
My conclusion from this experience is that you really only have two options: either do something ad hoc, very specific to your site (like the notbot
in query string) that whoever runs the bots won’t bother adapting to or you have to employ someone with enough resources (like Cloudflare) to fight them for you. Using some “standard” solution (like rate limit, Anubis, etc) is not going to work – they have enough resources to eat up the cost and/or adapt.
FWIW, most bots that pretend to be browsers utterly fail at doing so properly: most of them don’t send Sec-CH-UA
while pretending to be Chrome or a derivative. Hardly any of them send sec-fetch-mode
while pretending to be Safari or Firefox. A lot of them screw up the Gecko/
component while pretending to be Firefox, and so on and so forth (the relevant parts of how I’m doing this is documented here). I’ve been gently guiding them into an infinite maze of garbage for a couple of months now - they have not adapted yet.
There is a lot of identifying information in the headers, not all of them in the user agent.
FWIW, most bots that pretend to be browsers utterly fail at doing so properly
How do you know this is not the toupee fallacy at work?
How do you know this is not the toupee fallacy at work?
Because I researched how the major browsers construct their user agents, and what other headers they send, and how they normally behave. And the bots are very different.
When the same IP requests only HTML resources, 50 in a row, with exactly 2 second delays, that’s unlikely to be a real browser, when the HTML page has images, CSS and JavaScript too. When the same IP requests two resources from two hosts, within the same second, with wildly different user agents (one pretending to be MSIE 3.0 on PPC, the other pretending to be Firefox 3.3.6 on i686 Linux), that’s unlikely to be a human. When these patterns correlate with the badly constructed user agent and other headers, then yeah, I’m pretty darn sure these aren’t false positives.
Also, some are very obvious. Have a look at this little browser quiz I made a week or so ago! It highlights some of the crazy user agents, and some of the ways the bots fail at their pretense. Like, look at this one:
Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_5_9; rv:1.9.2.20) Gecko/4804-07-19 08:40:42.679886 Firefox/3.6.7
That’s still the toupee fallacy though, isn’t it? You’ve seen bots be bad at pretending to be browsers and concluded “most bots that pretend to be browsers utterly fail at doing so properly”. But you can’t conclude it’s most of them based only on examples where they’re not doing a good job. It’s not an issue of false positives.
Toupee fallacy, per wiktionary
A form of selection bias in which a thing whose quality is measured in terms of being difficult to detect is wrongly judged to be of poor quality in general, caused by the fact that most people only notice poor quality instances of it.
Let me rephrase it for you:
Out of 10 million requests, we have about 3.5 million where the agent might be pretending. About 200k of those were various fedi software & expected tools that don’t try to pretend to be browsers, so we’re left with ~3.3 million. Out of those 3.3 million, about 2.3 million identified as Chrome, yet, only about 22k sent a sec-ch-ua
header that included Chromium
(and before anyone tries to play the “But Safari has a Chrome/ component too!” card - I know, Safari is not included in the above numbers). So the vast majority of agents that told me they’re Chrome, really weren’t. Similar patterns apply to the other major browsers.
When there are 10 million requests, and ~97% of them are provably garbage, there really isn’t much else to conclude that the bots utterly suck at pretending to be real browsers. Not because I only looked at the bad requests, but because I looked at them all.
My assertion that they fail at pretense is not based on only bad examples. It is based on all of my logs, all 10 million / day of them, over the span of 6+ months. The detection isn’t even difficult: for Chrome, it boils down to checking whether sec-ch-ua
contains Chromium
. 99% of the requests I receive that say they’re chrome, don’t have that header, thus, fail at pretense. For Firefox, checking whether Gecko/
matches the version of Firefox/
, or if it is a static, hard-coded date similarly identifies many of the pretenders. It really isn’t hard to spot them.
I do notice good pretenders too, mind you. I know of a handful who got through my defenses, but that’s like 0.01% of my daily visitors. I also know how to catch those too, but I’m not caring about 0.01%. I will, if/when they reach 1%.
For the purpose of filtering out the bulk of the bot traffic and saving resources, this doesn’t matter.
But it does if falling to the toupee fallacy means you’re not filtering out the bulk of the bot traffic
The toupee fallacy isn’t about false positives, it’s about false negatives. How do you know there aren’t bots that you’re not able to distinguish from browsers operated by meat agents?
I looked at the logs.
Meat agents will not enter the maze at URLs that only ever existed within the maze 2 months ago.
Meat agents will not spend 50 requests in the maze, requesting at a very consistent 1 request every 5 seconds, only requesting HTML, while there’s CSS and JS and images involved too, and then never come back.
Meat agents will not go to great lengths to disable sending headers their browser usually does.
Meat agents will not change their user agent for every request (well, some do, but that usually involves changing the user agent only, and will not change sec-ch-ua
and similar headers, so these primitive UA randomizers are easy to spot too - I’ve seen maybe a handful of these in the past few months).
There might be cases where I falsely identify a meat agent as a bot. It happened before. But that doesn’t diminish the fact that 90+% of them are still easily identifiable bots that aren’t good at pretending to be browsers.
that’s unlikely to be a real browser, when the HTML page has images, CSS and JavaScript too.
SPAs win again ;)
From my logs:
Sec-Fetch
headers in a way that makes all requests use their override. It’s highly unlikely someone decided to load the CSS of a page by navigating to it, from Google :)I could probably confidently block most of them, if I set up something to actually read the logs from Caddy and block based on it :P
Quite obnoxious, but at least it identifies itself as a bot and, at least in our case (cgit host), does not generate much traffic.
True, but it ignores robots.txt so I banned it.
I’m by no mean a networking expert: What is the cost of these kinds of filters on the CPU when having tons of requests vs letting them in? Like if I filter by user agent, filter the referrer URL, throttle bandwidth..etc is there a point where these heuristics have negative returns?
Depends on how cleverly you do it. I have a 3k LoC filter I use to fend off the crawlers, and it happily chugs along at ~10-15% CPU usage on my tiny 2-core VPS (for comparison, my reverse proxy is at ~20-30% usually). In short: you can happily do filtering on a Raspberry Pi, and still have CPU for other tasks. Compared to what it would cost to let them through… well, it takes less CPU time on my VPS to decide what to do about a request and serve it garbage, than it would take to read a static file from SSD and serve it. Serving it from cache would win, of course, but my VPS doesn’t have enough ram to hold all files in cache, so…
The remaining question is bandwidth: I serve randomly generated garbage to the bots (like this). The average page size there is ~2.5-3KiB. While most of the sites I host are static sites, their average page size is usually 2-3 times larger than that, and dynamic sites like my forge, or fedi instance are at least an order of magnitude larger. So serving a few KiB of garbage to unwanted visitors is saving me a ton of bandwidth.
I receive around 10 million hits / day - see my daily stats bot on fedi if you want some numbers (and Grafana dashboard snapshots). If I were to let the crawlers through, I’d exhaust the 20TiB monthly bandwidth I have on the €5 VPS that fronts all this. With the filters, I usually end the month at 0.5TiB. I’d like to think that is significant.
Very insightful, thanks. The AI bots aren’t sophisticated enough to detect random garbage? (or they maybe don’t care)
I don’t know. It doesn’t matter much, either, because they’re not getting anything else. The reason I serve them randomized garbage is because I found that it keeps them slightly better behaved. When I tried serving them 403, they came back in disguise, and increased their crawling rate. When I served them static content - same thing. So I started serving them random garbage, and they’ve been less of a problem since.
In many applications the limiting factor in “expensive” endpoints is storage access, not CPU. All of the stuff you mentioned can be done with CPU and in-memory resources.
almost all of these filters are doable (in linux) on netfilter level, that is well-optimised by sheer volume it sees.
I went looking for an RBL that covered this sort of thing - just, hosted, not residential - and came up empty.
The closest thing I know of is Spamhaus DROP but that’s only the worst of the worst so it’s probably not going to include scrapers.
RBLs are mostly used for email. You could probably just register as a webmaster on AbuseIPDB and look up IP addresses, you get 3000 lookups per day.
From my logs:
;> _time:24h {service="caddy"} | uniq by(request.remote_ip) | count();
{
"count(*)": "2215797"
}
That’s over 2 million unique IP addresses seen in the past 24 hours. 3k lookups / day is not gonna be nearly enough.
There is a check-block endpoint that lets you look up the reputation for all addresses under a /24 for one lookup. Don’t know how many blocks that would be, but most likely a lot less.
These are also cachable for at least two weeks, assuming you get a result.
Also, you are on a way different scale than most, I don’t even see 10000 requests per day, much less 10000 unique IPs. There is an enterprise option available.
There is a check-block endpoint that lets you look up the reputation for all addresses under a /24 for one lookup. Don’t know how many blocks that would be, but most likely a lot less.
Not really, no. Most of these are residential IPs from around the world. Last I counted, there were a couple of thousand ASNs involved (and that’s ASNs, which can span many /24s).
These are also cachable for at least two weeks, assuming you get a result.
That’s great and all, but most of the IPs here visit once for 50 requests then disappear for months, caching them for 2 weeks is useless.
Also, you are on a way different scale than most,
I wouldn’t be so sure. I’m not hosting anything big or important. Most of the requests I receive target my forge, where I host my personal projects mostly. All the things I front for are small, niche sites.
I’m not a business. I’m not a big enterprise. I self-host a dozen or so sites, mostly static pages of little interest, mostly for myself & the family.
There is an enterprise option available.
…or I can filter by easily identifiable properties that don’t rely on IPs, for free.
…or I can filter by easily identifiable properties that don’t rely on IPs, for free.
Yes, and that’s a better option, but this thread specifically started because of someone asking for an RBL :P
Also, you are on a way different scale than most, I don’t even see 10000 requests per day, much less 10000 unique IPs. There is an enterprise option available.
I run a small website that has been on the internet for several years and has some highly-branching links in its dynamic content that bots love to crawl. Yesterday I got 1.2 million requests from 700 thousand unique IPs. I suspect the total number of legitimate users is under 10,000.
I’ve been fighting with these exact IP blocks owned by Tencent, and most of them do not have a user agent that identifies itself as a bot. They were using random, and I mean really random user agents like claiming to be anything from a Wii browser to being Safari on a mac, to to being Lynx in Linux, all from the same IP in the span of minutes.
AWS has “bot control” paid rules for their WAF service which helps for some, but does not catch these at all.
The web’s “worse is better” nature here is working against it: without structural support for immutability and strong naming, we can’t effectively apply Content-Centric or Named-Data Networking techniques that let whole-Internet automatic caching take the load off upstream data sources.
Maybe not in the case of the original article, but many of the really awful hellscrapers are hitting git forges, which would be trivial to fetch and cache using git instead of a scraper. But the people running “AI” companies are either too stupid or too indifferent to do it the easy/cheap way with git, and they waste tons of resources (theirs and others) on scraping instead.
What’d be nice would be:
a tool which can look up all networks belonging to some of these mega-corporations such as Amazon, Google, Tencent, worldhost.group and so on, and give us a nice, long list in CIDR notation,
some way to rate-limit all connections from a group of those networks
This way, I could say I want my web server to only serve 10 requests a second to all of Tencent, or I want my mail server to accept only 1 request a second from all of worldhost.group.
That would work in the case of Tencent described in the article, but most “AI” companies use scams like Bright Data to work around IP level blocks; this allows their requests to appear to come from residential IPs because they’re routed thru malware embedded in random mobile apps.
So it wouldn’t really help much in the big picture.
Holy moly, you’re not kidding about Bright Data being completely unethical!
Thanks to the growing popularity of web scraping, CAPTCHA-solving solutions now use machine learning and artificial intelligence to identify and effectively bypass CAPTCHA challenges. One such solution is the Bright Data Web Unlocker … [which automatically] solves CAPTCHA challenges of any type.
https://brightdata.com/blog/brightdata-in-practice/how-to-bypass-captcha-using-web-unlocker
this tutorial will teach you how to web scrape without getting blocked by your target website by fully avoiding detection so that you can easily find your treasure on the internet … … A useful tool in your web scraping toolkit is … rotating proxies [which] mask your IP address, making your requests appear to be coming from different locations. … When web scraping, it’s crucial to … [mimic] a real user [to] effectively sidestep detection mechanisms and reduce the likelihood of getting blocked.
https://brightdata.com/blog/web-data/web-scraping-without-getting-blocked
What the actual fuck!
edit: I got nerd sniped a bit here, specifically related to Bright Data’s residential proxy service, which provides a way for paying customers to do web scraping thru a network of residential IP users (essentially a botnet) who, they claim, have ethically and explicitly opted-in to this service. It turns out residential proxies are done thru something called the BrightSDK which is essentially something that app developers can build in to their apps such that the users who install those apps will provide their bandwidth to Bright Data’s paying customers under various conditions. Bright Data makes a big show about how everyone who opts-in to this SDK does so knowingly and explicitly and after reading and accepting a very clear TOS that says their bandwidth will be used for other purposes and they can say “no” and it’s all very much above board. One of the successful case studies they highlighted on their website is this completely throw-away junk app called “Spot the Odd Emoji” on iOS (warning: probably do not click that link). I searched for and found and installed this app, and, surprise, there is absolutely no mention of, opt-in, opt-out, setting, or anything whatsoever, anywhere in the app, for anything even remotely related to this kind of stuff. Absolutely criminal behavior.
edit 2: At bright-sdk.com they highlight many apps including “Burger Please” which I (against all good judgment) also installed and, surprise surprise, definitely does not disclose, much less get the consent for, any kind of idea that the user’s IP address and/or bandwidth becomes forfeit to Bright Data as part of any of their proxy programs. Admittedly I might have missed BrightData in this deeply-entrenched list of their ~350+ advertising partners but even if so that wouldn’t matter since nothing indicated any of those partners would literally co-opt my IP address and/or network traffic. Again, just absolutely totally criminal.
edit 3: for posterity, the leadership team
Bright Data makes a big show about how everyone who opts-in to this SDK does so knowingly and explicitly and after reading and accepting a very clear TOS that says their bandwidth will be used for other purposes
My guess is that the Bright Data TOS requires the app developers who embed their malware to include opt-in as part of their TOS, but no one actually does it, and Bright Data is strongly disincentivized to enforce that part of their TOS.
But yeah, anyway everyone in leadership at this company belongs in jail.