Histomat of F/OSS: We should reclaim LLMs, not reject them
41 points by hongminhee
41 points by hongminhee
The article claims there's a gap in the license. Not so: no LLM complies with any open source license requiring attribution (which is all but public-domain-equivalent licenses like 0BSD or the CC0 fallback license). If this mattered, they’d be sunk. Rather, they claim “fair use” exemption from copyright protection. Whether this is a valid claim is not yet settled; there are some court cases in progress. There is also one alternative for them: to obtain a different license. Forges like GitHub have terms of service that require you to grant them a different license. Sometimes such license demands are outrageous. It’s good to read and be aware of then.
You cannot fix this with licensing. You will need to fix it elsewhere in the legal system.
Analogy: someone got into your house, so you think about how to upgrade the padlock on the door... but they never went in through the door, they went in the window.
In the USA, it is settled; see Google Books and Kadrey v. Meta. Note that fair use is affirmative; the fair user admits infringement but gives a justification for why their infringement is permissible.
You cannot fix this with copyright. You will need to look elsewhere.
The analogy fails: Breaking and entering is not copyright infringement.
I agree with you that US courts have ruled on this, but I'd hesitate to declare the matter settled given how different LLMs are from Google Books and the IMO increasing pressure on some legislative changes.
Kadrey is about training LLMs. The only open question in court right now, Bartz v. Anthropic, concerns whether shadow libraries are an acceptable source of books; Google Books says that building a library from purchased second-hand books is acceptable.
I wouldn't read Kadrey so broadly, it was only a district court, which means there's minimal, if any, precedential value. Not enough IMO to call it "settled".
I think you're agreeing with the person you're replying to; in their analogy, a restrictive license is the padlock and fair use is the window. It doesn't matter what license you use if fair use applies.
Have you found anything bad in the GitHub terms?
Not explicitly bad, but License Grant to Us is, in my reading, rather vague, this part, in particular:
You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time.
As that Service includes CoPilot, improving the Service over time means training, so in this reading, using GitHub grants them a license to do so.
This is further touched upon in the next sentence, which contains:
This license includes the right to do things like [...] or otherwise analyze it on our servers
First of all, "things like" suggests the list is not exhaustive. Training may be included! And analyzing it can be interpreted in this vein.
Granted, they also say that the license does not grant GitHub the right to sell this content. But they're not selling it as-is, they're training on it, and that's Different™.
What withdrawal prevents is projects like Llama, Mistral, and the broader ecosystem of open source LLMs from having access to quality training data.
I see this as a hugely important point. By blocking new training you’re likely not impacting the big players who already gathered the data. Instead you’re cementing their power and preventing competition.
The problem are the big players, who continue to try and scrape data. Anthropic sends 3+ million requests / day my way, OpenAI does it's 1.5M+, Facebook & Amazon add their share too. Receipts here. Considering the natural amount of requests I expect are at best in the thousands / day, having to fend off tens of millions of scraping attempts a day is all kinds of silly.
If they have gathered the data they need, why are they usually in the top #5 of my visitors? Ever since I started collecting these metrics about a year ago, the top spots are virtually unchanged. The order sometimes shifts, but the same big players are all there.
I don't give two cents about any "open source" LLMs either, if they engage in the same scraping that ignores /robots.txt, and DDoSes small sites off the 'net. (And before anyone comes at me to serve static content: serving the requests isn't the problem; at ~1000 requests/sec, the TLS handshake eats up the two tiny shared vCPUs my server has, whether I serve static content, generated garbage, or abort the connection makes no difference, TLS at that point is the bottleneck.)
Some LLMs scrapers do respect robots.txt don't they? I've heard the claim that ClaudeBot does, though I don't know that that it still true.
The comment you are responding to says Anthropic sends them 3+ million requests a day. That would imply ClaudeBot does not respect robots.txt, no?
Only if their robots.txt does exclude ClaudeBot, which isn't explicitly said, so I thought it worth confirming.
Some do (Google seems to). Most don't. Claude for example, does not. My /robots.txt reads:
User-Agent: *
Disallow: /
Any agent you see in my list with more than a handful of requests does not respect robots.txt. And before anyone points out that Perplexity on my list is way at the bottom, below a handful of requests: that's because Perplexity's crawler does not identify itself. They're probably a significant contributor to "Disguised bots".
If we look at the history of free and open source software through a materialist lens, we see a clear pattern: technological change creates new forms of exploitation, which in turn necessitate new forms of licensing to protect the commons.
Widen the lens. The goal of the GPL is not only to protect the commons from exploitation; rather, the goal is to destroy copyright, so that no software is held privately. Copylefted software is licensed in ways that erode the ability of corporations to use the commons, but the licenses are only effective because corporations respect copyright.
a training copyleft would require model weights for trained systems
Model weights are COMPILED OUTPUT BLOBS!! This wouldn't be copyleft. No, real copyleft should argue that the entire training dataset becomes a derivative work, and therefore the actual source of the model must be released under the same license.
and the broader ecosystem of open source LLMs
Please just stop calling them open source.
LLMs aren't inherently exploitative any more than compilers or web servers are
Compilers and web servers did not require sucking in massive amounts of data to be built, and were not imposed on everyone in a rapid authoritarian top-down push by the big bosses against all resistance in order to rent-seek on basic things everyone has been doing with their brains.
Model weights are COMPILED OUTPUT BLOBS!!
fond memories of Meta marketing llama as "open source", and to download the weights you had to agree to terms that included restrictions on commercial usage.
That’s because the term “open source” was developed to appease capitalism. “Open Source” can mean virtually anything, including “you can look at the source code, but only when squirrels are eating peanuts at your feet.”
(Yes, there is a definition developed by a body. But it can also be applied as just two words used together that can mean anything.)
"Histomat" seem to be a contraction of the term "historical materialism". Googling suggests it is not widespread in English.
I personally think that a license as proposed in this piece is an excellent idea, not because it would work as suggested, but in that it would instantly exclude any material so licensed from ingestion into LLM models.
but in that it would instantly exclude any material so licensed from ingestion into LLM models
The main claim of those building LLMs is that copyright isn't involved, and therefore any kind of license doesn't apply. This would be the case for this license as well.
You might have better luck sprinkling your material with anti-capitalist, anti-billionaire, anti-corporation content and hope that they filter it out.
Yes, "hope", because they might just not care, just like they don't care about anything else except their egos. See https://archive.is/NylzP to learn whose likeliness Grok is not allowed to mess with, nevermind free speech... Sadly I'm not sure if the approach to just insult the Elons of the world in your code would scale. There are just too many clowns you'd have to pick on to get your works excluded that way.
It's not even common in Marxist discourse. You hear diamat sometimes but this is the first time I am seeing histomat.
Edit: I should have mentioned that I meant English Marxist discourse. Seems like the author is from RoK and has some China-related interests so maybe the term is more popular there.
It was fairly common ~20 years ago when I was reading English (mostly American) secondary criticism of literature, philosophy, and history, or at least it seemed so to me. I have not followed the discourse closely the the last decade or so tho.
Related discussions continue here: https://come-from.mad-scientist.club/@algernon/statuses/01KF2TRVFG69ECNQE15Y8FCHYT.
I've been thinking about removing all open source code I have ever written from the Internet, and re-uploading it exclusively under licenses that prohibit mixing it with code generated by a statistical model
-- a thread
LLM advocates and their enablers will hide the fact that they are posting stolen code— they brag about this, their most widely-effective argument for LLM code being allowed in OSS is "if you tell us we can't, we'll do it and not tell you"— and attempts to track which open source projects use stolen code will be harassed off the Internet.
Good thread.
This is a response to the submission at https://lobste.rs/s/z1re5b/on_floss_training_llms. There's not much discussion there but maybe this submission should be merged anyway? @puschcx
Author of the other article here: I agree, the two are essentially the same discussion. Wouldn't mind them merged.