We need to start doing web blocking for non-technical reasons
39 points by carlana
39 points by carlana
How do you say "we should block for non-technical reasons" without giving a single example of a non-technical reason and how to block it?
And on the flip side, I think most people here probably block ads and spam. So we already do this...
Examples are in the earlier linked post: Crawler blocking is editorial
This is talking about servers blocking clients, not the other way around.
You're right, I didn't give any examples of what I was thinking of because the context was fresh in my mind. Part of what I am thinking of here is things like high volume or misbehaving feed fetchers (cf), high request volume in general even if your website can handle the volume without particularly caring, or software with 'anti-social' behavior such as frequent HEAD+GET requests.
(I'm the author of the linked-to article. In another blog entry I had a more expansive version of what I'd personally block, but I'm not advocating for that in this article, just for thinking about blocking things that are busy violating the implicit social consensus for how web agents should behave that keeps the web viable. LLM crawlers are the big violators today but they're far from the only badly behaving software out there.)
One simple example I saw recently is a “robots,txt canary”: put a disallow path in your robots.txt for the specific purpose of perma-blocking anything that tries to retrieve that path. You also have to link to that path somewhere that won’t be seen by a human, of course. And the definitions of “perma-“ and “thing” are an interesting exercise.
You also have to link to that path somewhere that won’t be seen by a human, of course.
Surely I'm not the only person that looks at robots.txt sometimes to check out if there's anything interesting there.
You also have to link to that path somewhere that won’t be seen by a human, of course.
I would put good money on this not bing required. Really evil bots probably deliberately scan things marked as disallowed in the robots.txt as a matter of course, even without links on the off-chance that something tasty is there.
Well, it depends on what degree of bad behavior you're trying to block. If you're just looking for very bad behavior, then that would be enough, but I think the goal here is more a policy of "no littering" rather than "no breaking and entering".
The reason for linking to that path is to catch bots that don't even look at /robots.txt (which is most of them). "I told you this is off limits, you didn't even check the house rules, no gtfo".
To me these all still read like technical reasons, although they do fall short of "my server literally can't handle the requests". Mostly they read like that to me because there's a range of increasingly less technical reasons. For example, a number of U.S.-based news publishers geoblock all non-U.S. IPs in an attempt to reduce exposure to foreign laws.
Agree with the core idea of needing to socially enforce moral standards, just not sure we can do that without also blocking a significant number of legitimate users that are just privacy-aware (which manifests as bot-like behavior for a sufficiently naive server).
I tend to browse in private mode and block 3rd party scripts and cookies by default, and as a result, a non-trivial portion of the Cloudflare-fronted internet thinks I'm a bot. I'd very much like to avoid a future where you effectively need to authenticate (explicitly or through fingerprinting) just to access public content on the internet.
Without getting too onto social media, would this be blocking because someone disagrees with me politically? Would we start to include hate speech in this? With all these brouhahas related to, I believe this you believe that this kind of proposal could actually be quite disastrous.
Plenty of US sites (mostly local news) block all traffic from the EU for vague GDPR reasons. It’s been like that for years.
That doesn't strike me as grotesquely unreasonable, honestly. I'm working on what's probably, if I'm honest, about 70% thought experiment and 30% maybe gonna be a product. It's going to be something adjacent to a social bookmarking site, if I like the shape it starts to take. It's going to start out with a focus on letting users who are local to each other share/discover each other's bookmarks. It may never grow into something where I can have customers; I'm scratching an itch that I feel for myself and the people I like to talk to, and I'm not sure it can be or needs to be turned into a product. But I think it could go that direction if I see the right reactions from people I know.
At least for my first trial balloons, I'm not planning to accept signups from the UK or the EU. I don't plan to block them from public site content, but both places have rules that are different and hard for me to learn to a degree where I can be confident I'm not breaking them. If it turns into something that people local to me like, and it starts to look like something that people in Europe would like, I'll certainly revisit that.
I absolutely expect that my handling of people's data will respect the intent of the GDPR; I won't collect more data than the site needs to fulfill the purpose for which people choose to use it. I'm not interested in sharing it with business partners. People will be able to download a dump of their data when they want to. And when they delete their accounts, they'll be really gone, delta some buffer timeframe against accidental deletions.
But not accepting data from people until I'm confident I can do so in a way that respects their local laws doesn't seem outlandish to me. It seems like a sensible way to build.
I was asking myself the same thing. The author beats around the bush. They need to give a few examples of what they're on about. They link to another of their posts here. Initially this seemed to confirm my suspicion that they're more worried about technical malfeasance, than they are about "bad thoughts". But then I read...
Are you building a free search site for a cause (and with a point of view) that I don't particularly like? Almost certainly blocked.
So yes, I think the author wants to use technical infrastructure to block bad thoughts.
wants to use technical infrastructure to block bad thoughts.
What thoughts? He's blocking robots from reading his thoughts on his site. It's not possible to block people with any particular views from visiting your site (I'll also note that you can access the site just fine over Tor). He's instead controlling which companies can access his site.
The post you've linked seems mostly focused on LLM scrapers. They have no regard for rights of creators at all - genAI is known to output training data verbatim (random example), but usually stripped from any authorship information. In theory we have a mechanism to defend against that - copyright - but evidently it's not doing anything. Blocking LLM scrapers for being LLM scrapers is IMO perfectly reasonable.
I don't see how blocking LLM scrapers is policing anyone's thoughts. Nor how blocking a "brand intelligence" company is policing thoughts, etc, etc.
What thoughts?
I dunno. At first I was curious. But now I'm less worried about his specific thoughts that he'd block traffic over. I think the real question is: should you police the thoughts of visitors to your content? My gut reaction is no. The topic of policing traffic over technical issues such as "AI scrapers are killing my server" is technical, not philosophical. I think that is less controversial.
I really don't understand this idea. I thought I misunderstood it and was trying to argue with a strawman argument (that's why I deleted my other comment, it felt like it didn't add much to the discussion) but... that doesn't seem to be the case.
If I choose not to share text that I've written with people I don't like, how am I "policing" their thoughts? This isn't a commercial exchange, where anti-discrimination laws, for good reason, require me to provide services to anyone who's legally entitled to them. I'm talking about my personal website, with material that I wrote, and that I share with other people just because I like to. I can choose not to give them the things I've written, just like IRL I can choose not to speak to them (and, in fact, IRL I get to call the police if they don't leave me alone, too). Why would it be compulsory to give everyone free access to what I do in my free time, with no rights to recourse?
More caustically:
pf rule set!?I’m the last person who’s going to tell you what you should do with your server or content, in any sort of legal or compulsory sense. I think I was more interested in figuring out exactly what the author was suggesting, than attacking or defending it. I do think it’s better for people to talk and share, especially across lines they disagree about. And I think “policing” is a reasonable word to use to prevent that communication from happening. I’m sorry if I made it sound like I was attacking him or you.
As a server operator, I reserve the right to block traffic from any source for any reason. I don’t see why this should be controversial.
This opinion feels like the paradox of tolerance. Right now, we tolerate everything and that's bad, but if we tolerate nothing, that's also bad. The right answer here seems to be a middle place where we can somehow ensure only good faith actors.
Blocklists don’t need to be universal. Embrace the bubble and discard nonsensical moral imperatives that say you have to read three things that make you mad for every two that don’t.
I have no idea what you're trying to say.
I am saying people can, individually or in groups, create & apply their own blocklists. There does not need to be one universal blocklist applied to the single web service everybody is on that we must continually debate the merits of. Critics of this many-blocklists approach accuse the people applying it of choosing to live in a bubble.
Okay, but it still feels very random to my comment where I'm opining on what the article feels like or what it's describing. This technical problem of blocking bad faith actors, the bots, scrapers, DOSers, whatever, is a paradox of tolerance because the past technical advice was as the author said, a technical decision to otherwise allow everything, but maybe the future needs to be something different for the sustainability of the web.
Something can be free in two ways: worthless or invaluable. There be misery and suffering when the invaluable is treated as worthless.
Is it good time to start charging money per web page visit? Fees get waived for personally known human visitors, starting by issuing certificates to friends and acquaintances, allowing them to "invite" their friends and so on; those abusing their certificates get theirs revoked and must pay. There goes away anonymity, if there were some. Regarding collecting money and issuing certificates, make a small collective, cooperative society or equivalent..
Alternatively, there is much to learn from Wikipedia (and its sister projects) about how they have been dealing with abuse for so long.