You cannot have our user's data
136 points by bezdomni
136 points by bezdomni
I have my own reasons to be critical of Drew and talking w/ him about issues we had while trying to self host sr.ht services only reinforced those, however:
Its weird and unnerving to me that mine is the first top-level comment which is not explicitly condemning this post.
This post and sourcehut’s actions seem quite sensible to me. It may be the first time I agree with what Drew said.
I thought the real gem in this post was hidden in the footnote. (I do use the service, so of course I’m interested in the upcoming ToS changes, but those would not be on-topic by themselves.)
The footnote mentioned and linked to go-away. The variable and configurable gates there seem more appealing to me than Anubis. Though I hope no incident bumps it up my priority list to try either one out.
The README also neatly summarized my thinking on the specific issue of AI bots hammering archives and blame pages on software forges:
If AI is so smart, why not just git clone the repositories?
If AI is so smart, why not just git clone the repositories?
That’s something I’ve been wondering about as well. Put another way: if these crawlers are really operated and written by LLM companies, their poor performance is massive anti-marketing for Coding AIs.
Very strong beliefs very tightly held here by DeVault, shooting from the hip without any filter, and everyone catching strays: Cloudflare is a racketeer, rationalists are in a cult, etc. They might be that, I have no idea, but I’d focus on hating one thing at a time.
I fully support Sourcehut’s fight for a web without assholes, but man, he’s making it so easy for people to dislike him as a person.
If your service is constantly being hammered by “AI” bots, that causes your service to act up and be unreliable for it’s actual users, then I expect there to be some emotion there.
DeVault is correct, people just don’t like hearing it. Especially from a human with human tone, and not some corpo double-speak.
Meh. I didn’t even noticed it was him. As far as I’m concerned, every stray caught was deserved (except maybe cloudfare, but even then, they’re too big and monopolistic for me to care about their feelings).
Why? Those you’ve mentioned seem spot on. Maybe we should actually talk some politics here.
No thanks. This post is pretty low quality to be honest. I think Anubis is great software and the issues associated with LLM scraping are very real; however, the rest of his “opinions” are certainly not on topic and/or even correct.
The orange site exists if you’d like politics along with your tech news. Please let’s keep politics as off topic as feasible here.
Let me preface this comment by saying that I don’t want to start a flame war, or unhinged reply fest: but what should count as off topic politics in this space?
A discussion on the relative merits of open source vs free software would most likely be welcome. As would, I think, a criticism of a company taking a previously open service private. Or a blog post discussing the privacy practices of a tech company that presents themselves as privacy focused business but who are selling their data to advertisers—so long as their is some technical discussion on that point.
This post, yes, is a rant. But it does discuss some technical practices—namely the use of Anubis, and their robots.txt policy; and frankly the issue of LLM scrapers affects a lot of us at the moment. It’s interesting to see what others are doing to combat it both technically and politically.
I won’t argue that this is a very high quality post, but I will argue that it is on topic in this space. I’d like to see it generate some discussion here that doesn’t start with the typical criticism of the author.
@pushcx, if I’m going off topic please let me know and I’ll delete this comment.
but what should count as off topic politics in this space?
This is an impossible question to answer as anyone would like. Answering “What counts as political” is itself political. Further, blocking politics full stop is dangerous for a forum. In many places, the complaint / rule is used as a convenient thought-killer to suppress any discussion or person they just don’t like. A common mind trap we should avoid is “if I agree, it’s just opinion. If I disagree, it’s politics”
I feel like we have other metrics that are useful here and in most other cases. Mostly keeping things in the spirit of curiosity, civility, etc. I could see arguments for against the topicality of this article in that vein without attacking the author.
Not wanting to talk about politics is a political stance. Considering you can discuss tech without discussing its political implications is a pretty strong political bias. (or crude naïvety )
You cannot not do politics, especially when talking about technologies which have a worldwide impact and are changing people lives.
Note that I said “where feasible”. Those words were to imply that we cannot avoid politics, whatever the definition of that you hold. Most of us want a better world, but differ in our views of how we get there. My strong beliefs (yep I have them), may or may not align with yours. My point was though that encouraging such introspection and discussion will likely make this place less focused and a less pleasant place for many people.
Luckily we have guidance for what Lobsters is about, which I think helps keep this focus away from the specifics of political opinions.
Topicality: Lobsters is focused pretty narrowly on computing; tags like art don’t imply every piece of art is on-topic. Some rules of thumb for great stories to submit: Will this improve the reader’s next program? Will it deepen their understanding of their last program? Will it be more interesting in five or ten years?
I like to think this guidance is enough to keep us relatively close to the technology itself. Yes there are licensing and ethical issues which can fit within these vague guidelines. But I suspect if we stray too far into any extreme it will be a net negative.
I think then, that we may have a major disagreement on what constitutes an extreme as opposed to reasonable discourse.
My next question is very off topic, and extremely political. @pushcx again (just want you to have a heads up, apologies. Let me know if I overstep)
It’s out of curiosity, and not meant to antagonise, but just question where that point of view comes from; feel free to reply over a PM rather than in the comment instead.
Are you from a country where there’s a two party system? I’m not, and I don’t think that politics is something you can hope to keep out of a discussion. It’s almost inherent to every opinion one has that has anything to do with other people. I work with people who vote very differently to me and each other. We can still talk about it without it getting heated.
EDIT: I said I wasn’t going to instigate a flame war, but I think this comment is antithetical to that. I’ll keep it here for posterity, and do feel free to post a response to it (either here or PM to me), but I won’t continue beyond this, as it will get out of hand.
I’d say that the content and message could be the same without what mariusor mentioned. The post would actually be better. Not that I disagree (I even get why cloudflare is called a racketeer), but every such thing dilutes the message and distracts the reader, while offering easy ways to criticize his post without talking about its content at all.
This seemed pretty straightforward. They’re planning an update to their policy, and this explains what and why in clear language.
I feel like I frequently read blog articles linked here in which morally gray entities catch strays. It’s just part of the scene. I suspect what people don’t like this time is the personal blog tone used for justifying a terms of service change. To me it feels down-to-earth, but it can also be taken as attention-seeking and unprofessional. I think the rant tag is about right.
I think that in an otherwise factual article, name calling without providing any context or citations, and seemingly for no apparent benefit other than venting, is nothing more than a show of pettiness and, like I said previously, detracts (and distracts) from the actual subject of the article.
Flagged; company TOS updates and reverse proxy configurations are not really good submissions to Lobsters.
Drew DeVault’s political rants are off-topic.
Eh? Lots of similar articles get posted here without anyone complaining. Is it off topic just because the author is Drew DeVault?
Some things that are off-topic here but popular on larger, similar sites: entrepreneurship, management, news about companies that employ a lot of programmers, investing, world events, anthropology, self-help, personal productivity systems, last-resort customer service requests via public shaming, “I wanted to see what this site’s amazing users think about this off-topic thing”, and defining the single morally correct economic and political system for the entire world when we can’t even settle tabs vs. spaces.
https://lobste.rs/about (emphasis mine)
Drew DeVault-related submissions just happen to be off-topic more often than submissions about i.e. APL or TypeScript.
I think each of these examples refers to a different kind of post than this one. Not that this couldn’t be off topic, but I don’t think that’s supported by what you bolded.
Given how bad the front page has gotten overall, spam rules being largely ignored, just like the placeholder of the submission text area, and don’t even get me started on the comments section[1]. I think rant being posted under the rant tag, when there are so many postings about Anubis, etc. feels like quite the stretch.
In other words: Just put the rant tag on ignore if you don’t like to read rants. :)
[1] Just to be clear this wasn’t about bashing lobste.rs, even though I’d find it nice if rule violations would not be largely ignored. What I wanted to say is that given all of that this seems to be a reasonable entry for the rant tag.
It didn’t have the rant tag when it was posted (I suggested that). Generally agree - lots of self-posting on the front page, lots of summaries in submission text, but it gets upvotes, so it’s what the people want.
You may … [have user] data for one or more of the following purposes … Search engine indexing … You may [not have user] data for … training machine learning models …
Assuming reasonable client access patterns (RPS, rate limits, etc.) I’m not really sure how these two things are different. Maybe said another way: it seems to me that a sufficiently advanced search engine is indistinguishable from a LLM, no?
Ahhh, I suppose one marked difference between them is whether the output is links to the original content or not. I haven’t spent a ton of time with LLMs but my understanding is that they don’t generally annotate their output with links to the original source data and if you ask they may actually hallucinate citations that don’t exist.
While at a technology level I can definitely see a grey area between the two, in practice there is generally a pretty clear distinction.
It does get a little blurry though when the same page is doing both. Google, for instance, has started putting ML-generated summaries at the top of some search results. Whether or not that would fall under “training” or not isn’t clear, or whether they’re just taking the content of the pages and asking an existing model for a summary… hard to say.