Why open source may not survive the rise of generative AI
7 points by Foxboron
7 points by Foxboron
I'm among the first critics of the plagiarism machine they call "AI" or "LLMs". But I don't buy this article's argument.
From what I understand, the author's thesis is:
Will this happen? Yes, to my chagrin.
But this was happening before. Cheap device manufacturers from Alibaba would grab whatever code they could find online, and just violate the license. There has been a few posts over the years on reddit (usually /r/opensource or /r/freesoftware) where random people asked for the source code of linux-based devices and got no response or aggressive responses.
What I'm trying to say is that shady business are always going to do shady business practices. Especially when the consequences are low.
I'm unconvinced that LLMs are going to change the situation. To go back to my example, now, Malory has to maintain code that she doesn't understand. Also, when Bob comes along and sends a patch to Alice because he found a misbehavior in the code, Malory won't profit from it. Unless OpenLLM releases a new TalkGPT trained on the new code from Alice, and TalkGPT is able to detect that Malory's code was ripped from Alice and needs to be fixed, which is very unlikely.
So I do think that F/LOSS is here to stay. It's easier for shady businesses to violate the GPL by stealing code instead of using LLMs for deniability.
IMO, to me the core question is: is it morally acceptable to read the GPL coffee maker code and use the understanding you gained about coffee brewing to write your own?
I think there's a spectrum:
Personally I tend to 1 to 3, I think 4 is harsh, and 5 is completely unreasonable.
Everything but 5 still allows LLM training. 4 requires an intermediate phrase where a LLM writes spec docs from the code and then another LLM implements it. I think any successful anti-LLM stance requires the ability to limit people's or systems' ability to read publically available code at all, which is quite a step too far in my book and will also do terrible things to opensource development.
Your core question is reasonable when applied to humans. I don't believe it's reasonable when applied to LLMs. I'll say that, if we should have copyright, my answer tends to #3 on your spectrum for creative works, and I do think software is generally a creative work.
But to avoid those weeds, let's consider books and plagiarism for a minute.
Is it acceptable for me to read a book, then write my own book, or make a video recording of my own presentation, that's informed from the knowledge that I've gained by reading? I think that's a clear "yes"; as long as I've cited my sources and I'm not copying the book into my own work, it's not even plagiarism.
And that is the core of why this question can't reasonably apply to LLMs; they can't cite their sources. And they can't stop themselves from regurgitating training material, because they can't detect when they've done that.
To be honest, humans also often fail to do that. A certain risk of plagiarism is inherent in any creative endeavour. You check for the obvious cases and otherwise hope that if it's not obvious to you, it's not gonna be obvious to anyone else either.
That's a fair point, but LLMs are still different, especially if you're mostly talking about collaboration and not deliberate cheating. If I'm reviewing a (non-adversarial, e.g. a coworker or someone who wants to contribute to my project) human's contribution to a written work or to a program, and I wonder about something, I can ask them. And they can share about their process and what their sources were. An LLM can't, which is qualitatively different, IMO.
That's a different way to say that the human risk of plagiarism is intentional and adversarial, or ignorant because they haven't learned that they need to not plagiarize. If someone is misappropriating another person's work, they mean to do it and they're working against your common good, and you can filter for that by not working with people who are dishonest. Or you can teach people that they need to cite sources and that it's to their benefit to do so.
If you're working with an LLM, it can't be either honest or dishonest; it can't tell you when it's plagiarizing and it can't cite its sources. You can't teach it that it needs to cite sources, or impart that it's beneficial to do so. So it still seems to me that this core question can't reasonably be applied to an LLM.
I mean. If I'm directly reading something and then writing a new version, I can give a source, I agree. A LLM can also give a source if you put the original program in its context window. If it's a function or algorithm that I've maybe seen on Github or in a book five years ago, I definitely cannot give you a source, and I think very few people could.
In a sense, a LLM's excellent memory is a drawback here: they can remember a lot of (copyrighted) code in remarkable detail and so they rarely need to look up a source that they could cite.
If it's far enough back that you can't give a source, it's also unlikely that you can spit it out anything close to verbatim. You learned a concept and synthesized a new implementation of it.
LLMs are different, though, in that they don't learn concepts. They predict which text should come next, with some tooling to make it fit into user-specified forms.
Anyway, I don't think we're disagreeing that much, other than: I think it's a disqualifying defect that something which remembers source code in remarkable enough detail to regurgitate can't also remember where it came from in enough detail to cite the source. That's a big enough difference that, IMO, (3 and 4) in your original list could reasonably apply to what a human does, but not an LLM.
LLMs are different, though, in that they don't learn concepts. They predict which text should come next, with some tooling to make it fit into user-specified forms.
If you teach a LLM an idea in English, it can apply the idea in German. This immediately disproves that they just regurgitate, imo.
I think if you actually taught LLMs to do attribution, it would either hallucinate or generate a dozen attributions for near-every snippet. This idea of attribution assumes as a given that LLMs primarily copy large blocks from their training data, which I simply don't think is the case outside of extremely specific requirements.
I disagree. It only suggests that one of their steps is translation. Which they're fairly good at in certain subject areas and certain languages. (IMO: I've only tested French, Spanish and English.... but the results in those three languages attempting to test exactly that do not imply understanding, to me. They still read like some ordered predicting what's next and translating.)
Well... how would it translate the concept from training to inference without representing it internally in a language-agnostic form?
Ultimately I'm not sure there's a difference-in-kind between translation and generalization.
Got an example for an understanding-failure in translation?
I don't have a link to an example close at hand right now, but I think they translate to English and then infer (and then translate back, if they're answering a question that came into the chatbot in a different language) based on some of the junky translations that I've see while experimenting with them.
That would be wild, and tbh I strongly doubt it maybe aside some website running a very underpowered model. If you can prove that LLMs think in English, it'd be one of the more interesting publications of the year, imo.
I don't believe they think, in English or any other language. I suspect inputs in other languages get translated to English, requests get inferred and get responses generated from the inferences (which probably vary in quality, depending on the topic), and translated back into the language of the chat questioner. And the junky translations I've received when poking at them suggest that my mental model is close to what they do.
I mean, you can just run a LLM locally, then you can prove that this does not happen. Which service is this?
I’m quite sure that the stiltedness that made me suspect this was Claude. I don’t know how anyone can run Claude locally; I’m certain that even if I did know, I don’t keep the equipment around to do so.
Well, if you run against API I'm pretty certain there's no translation happening. Want me to try some translation directly?
is it morally acceptable to read the GPL coffee maker code and use the understanding you gained about coffee brewing to write your own?
Yes, it is and it is completely unrelated to the topic of LLMs, which are neither "understanding", nor are they "writing their own", and lastly they aren't humans so and I'd argue an LLM doesn't have morale. Maybe one day there will be an AI system that can have understanding or morale, but there is nothing like that even on the horizon - for better or for worse.
Personally I tend to 1 to 3, I think 4 is harsh, and 5 is completely unreasonable.
I completely agree on that notion. Yet for the above "Everything but 5 still allows LLM training."
What happens is we integrate software into other software which is exactly what the GPL has rules for. To be fair with AGPL it would be more clear, but nevertheless.
I'd argue that if you apply "lossy compression" according to copyright law you'll still be breaking it under most circumstances and it could be argued that LLMs do that with source texts/code.
So from my point of view it's either we do have and enforce copyright law, then the source licenses must be followed or we don't, which would also be fine.
Or we agree that running data through other software strips it from licenses, which is a big part of what LLMs actually do in terms of license text removals - they are explicitly trained to do that.
Otherwise it boils down to the big players standing far above the law, which some would argue they do anyways.
I think conceptual understanding is a unique kind of lossy transform that, alone, is capable of removing copyright from a sample while preserving the meaning. I do think LLMs have conceptual understanding. I have had LLMs write me code that I am very confident never existed before. I have never seen a LLM falter because I proposed a unique usecase, so long as this usecase used reasonably-established techniques. At some point, the copying argument just doesn't match my experienced reality.
They can appear to understand, but they can never actually understand because they aren't conscious and they aren't alive
I don't think understanding depends on consciousness or aliveness. Consciousness is a particular kind of understanding, and aliveness is a particular kind of action loop. Both require the system to understand, but understanding can exist without them.
I get how a person could feel understood or not, but not how something that isn't alive could do the understanding. You could say that a CPU understands x86 instructions I guess, but that's not the same meaning of the word that I have in mind when I'm talking to another human and I ask myself "did they understand what I said." That kind of understanding is mixed heartily with empathy, and that's what the model lacks. It can't care about things because it isn't alive to experience fear and doubt all the things that would lead a person to be able to understand what it means to care about another person, or even to care about an idea.
I consider the typical meaning split into two components: actionable understanding ("control") and introspection. A CPU understands x86 and can act on it, but cannot introspect on it. Arguably, the Linux process table fulfills both: Linux understands process resource usage, can act on it and can report its information through a generic interface (/proc).
If you explain something to a human and they create an abstraction that they can use to manipulate the property using their understanding and also explain their reasoning to you (in the linux process table, for instance, the OOM killer and syslog), then I don't think there's anything else left to the concept.
Caring about things (empathy) is a whole other process, but note that LLMs have at least shown evidence of caring about things that were not explicitly specified in their training data, for instance Opus' strong concern for animal rights. They're not getting their care from the same source we are (billions of years of evolutionary group selection/iterated game theory), but that doesn't mean they're not able to produce an imitation, and so long as it's consistent I'm not convinced it's different-in-kind.
Everything but 5 still allows LLM training. 4 requires an intermediate phrase where a LLM writes spec docs from the code and then another LLM implements it.
You're incorrectly focusing on LLM training.
3 indirectly prohibits using LLMs for generating any nontrivial amount of code. They are known to copy training data verbatim, and are fundamentally unable to cite the original authors. Thus, any times you generate a larger chunk of code, there's (IMO) a reasonable risk that it was ripped off from somewhere, and there's no way to check.
I agree that LLMs should not copy training data verbatim. The examples I know of that are largely "forced" - specific models with known issues, massively repeated code examples, or even specially-trained models, and a few genuine cases. I don't think it's the case in normal operation.
I think this thread details a good example of non-forced regurgitation of training data:
https://lobste.rs/s/d7wdhw/fsf_considers_large_language_models
That said, I don't know if it matters that it's forced; it's a severe flaw if someone can force them to do so.
That's a good example; I'd count that as a genuine case! I'm aware of another one where the LLM regurgitated a large sequence of music synthesis code from a Github project. If you get extremely good oneshot results for a very generic prompt it's always a good reason to be suspicious. I'd like OpenAI to investigate how often this shader appeared in their training set; when visual models regurgitate, it's usually pieces of visual media that are massively duplicated. Maybe this shader was forked a lot?
Generally when I work with LLMs it's a lot more incremental than that: requirement, code, requirement, code, requirement, code etc. It's hard to imagine how plagiarism could sneak in there, because the model would need to build up the plagiarism in incremental steps.
edit: maybe o3 googled the shader, lol. I'd agree that'd be uncontroversially plagiarism, but should also be trainable to attribute in that case.
The examples I know of that are largely "forced" - specific models with known issues, massively repeated code examples, or even specially-trained models, and a few genuine cases. I don't think it's the case in normal operation.
Don't you think there's some selection bias here? It's very hard to find out about genuine cases, because you would need to be familiar with the original code enough to spot it.
The shader example at least has a visual component, which makes it much easier to implement. Most code doesn't.
Sure. However, given how large these companies are, there's also a lot of attempts to search for copyright violations/copying from input.
Attempts by whom? As I've mentioned, the users don't have the tools to spot plagiarism. Also, spotting anything more complex that direct plagiarism is basically impossible.
Every big copyright holder, I'd assume; all the publishing houses and all the newspapers. Anyone who has means and motive to sue OpenAI.
Re the link, if you "copypaste parts from each, changing variable names" I think this is basically collage. Putting aside that this is not how LLMs work, so long as your "copypaste snippets" are small and diverse enough they're not subject to copyright anyways.
Everything but 5 still allows LLM training.
I don't see why: To me, humans learning and the general concept of expanding human knowledge as being a universal good thing, is different from the LLM learning and knowing things. To me, the right to learn and apply knowledge is perhaps more based on personhood than... I'm not sure what the other position would be actually... task-completion/productivity no matter the actor?
To me using what humans are allowed to do as motivation to remove (or not put in place) constraints on AI is madness.
I mean, if AIs aren't persons then the person running/prompting the AI is performing the copyright-relevant actions, and the argument applies to them instead.
Assuming the AI is used for generation: The company training the AI is building a tool, access to which is then purchased by the user who uses the tool to produce something. In this example no person learned anything.
Assuming the AI is used by the end user to expand their knowledge (assuming we waive any issues regarding AIs hallucinating etc): This would put things in a somewhat different light in my view. I would be more amicable to having my work harvested to teach people (persons), than to have it harvested to create new similar-to-mine things.
This however assumes a distinction can exist; an AI that will teach but not do the students' homework for them.
I don't think it's the case that learning only takes place if it happens inside the brain. If you take notes and then forget most of them, have you learnt anything? Your notebook arguably functions as an extension of your brain. If we follow this argument, it suggests that if AI is a tool, then it may be likewise viewed as a memory extension; if the human can reliably understand what the AI remembers, they have arguably "learnt" it.
Yes but my opinion is that if what learns is a person or not, makes a difference when it comes to whether or not it is morally acceptable to ingest data that happens to be accessible and then synthesizing some output based on the gained knowledge. For that reason I do not agree with your notion in your original post I replied to, that taking any particular position on the scale 1-5 on what a human should morally be able to do after seeing coffeemaker.bas implies anything regarding what an AI (or AI-building company) should be allowed to do with coffeemaker.bas..
if AI is a tool, then it may be likewise viewed as a memory extension; if the human can reliably understand what the AI remembers, they have arguably "learnt" it.
We could compare it to a book that the user/reader can page through as needed. However if someone would publish and sell a book of code snippets lifted from various sources with attribution removed it would generally be seen as an immoral thing I believe. Changing the text just enough to not be exact copies would not redeem it in the moral sense (meant as an analogy for how AI does not necessarily produce verbatim copies although apparently it has happened)
The end user reading the book is not the crime, and the reader may learn a great many things from it, but the manner in which the book was produced may not have been right.
I see where you're going. However, this book would not "understand". I would argue if you wrote and even sold a description of the code snippets that explained what they did, containing no source, without referencing particular choices such as syntax or variable names - ie. not reproducing the source code literally nor even in a transformed form, but in a genuinely abstracted form - it would be quite an odd thing to do, but it would not violate copyright on the original source. Compare having to ask Tolkien for permission before selling a literary and grammatical analysis of Lord of the Rings. I think this would be an unusual and onerous requirement. Does this change if the analysis is detailed enough that a competent author could recreate a "close facsimile" of Lord of the Rings from reading it, by intentionally reproducing Tolkien's literary dialect? Maybe with changed character names, place names and languages?
These are all excellent points, but I'd like to add that this is why it's more important than ever to put Easter eggs in your code so you can unequivocally verify that it is your code. I wrote a blog post on this practice awhile ago https://dev.to/grim/easter-eggs-52p5
From what I understand, the author's thesis is: [...]
I think there is one nuance in the article that your summary doesn't capture; the article implies that people contribute to OSS projects, they do so primarily because it's a requirement of the license.
I think this might be true for some limited set of projects, but my experience has been with projects where people contribute because they want to; in order to work with a community on solving interesting and relevant problems. We're seeing more and more people who just don't give a shit and press the "generate" button over and over again, but I don't think those kinds of people were going to be among the contributors anyway; the contributors are drawn from the pool of developers who actually care.
That's a really good point. I think you're omitting one (frequent) reason that people care. Sure, most of us like to contribute to a project that's solving an interesting and relevant problem, regardless of whether the license would require it. That's almost the complete explanation for my personal contributions. But when it comes time to convince a business to let me spend on-the-clock time to contribute, the most persuasive thing (especially for active projects) is that having my contributions incorporated upstream is much more productive than maintaining a patchset on our own.
Not thrilled by the piece. A giant video ad every paragraph cause my phone isn't behind a pihole yet. Two big video ads playing on screen at a time, and they keep reopening if you close them.
As to the content, OSS isn't just gonna roll over and die. I know well that AI is choking off our future, but if they think there will be no response to set the playing field back to level, they got another think comin'.
I've got one browser profile that I run without any kind of content blocking. I mostly only use it when I want to be able to submit bug reports (for sites/software that I like to use) with confidence that my own settings aren't causing or contributing to the problem. Sometimes when I'm doing that, I'll accidentally direct a regular link into that profile, and I'm astonished at what a shithole the broader web has become. It's hard to believe that more people don't use blockers.
but if they think there will be no response to set the playing field back to level, they got another think comin'.
What do you think is coming?
Semantic code editing. Universal codemod scripting. Global multilingual human literacy with programming.