The FSF considers large language models

22 points by runxiyu

dzwdz

[snip] asked whether the FSF is working on a new version of the GNU General Public License — a GPLv4 — that takes LLM-generated code into account.

Like, allows it? IANAL but I think LLMs trained on GPL code are in pretty clear violation of that license already - just as LLMs trained on MIT code, MirBSD code, etc.

All these licenses require attribution, which LLMs notably don't do, even if they output some training data verbatim (as the article itself acknowledges).

If the point isn't to allow LLM training, I don't really see what's there to gain by explicitly disallowing it. That just seems like a stance that the attribution clause isn't enough - whereas, if the FSF does care about this stuff, maybe they should just double down on this?

There is also, of course, the question of copyright infringements in code produced by LLMs, usually in the form of training data leaking into the model's output. Prompting an LLM for output ""in the style of"" some producer may be more likely to cause that to happen. [snip] suggested that LLM-generated code should be submitted with the prompt used to create it so that the potential for copyright infringement can be evaluated by others.

Key words: "more likely".

The example I've been using recently is this guy vibe coding an exact copy of someone's existing shader. I don't think the prompt had anything about it being "in the style of" anyone. Any time you generate and use a nontrivial amount of code with an LLM, there's a risk of plagiarizing an existing work. I suppose the question is how high of a risk do you accept - but for me the very obvious answer is "I don't accept any of this risk".

I think it's a bit weird for the FSF in particular to be taking a different stance here?

A member of the audience pointed out that the line between LLMs and assistive (accessibility) technology can be blurry, and that any outright ban of the former can end up blocking developers needing assistive technology, which nobody wants to do.

Oh hey, this shit again - no, it isn't, and the line is very clear. I'm yet to see any assistive technology just spit out a copyrighted work without explicitly being asked for it. This is not what the discussion is about.

simonw

I've had a look at that shader example a couple of times now and I can't quite get to "an exact copy of someone's shader" from it.

The Reddit post doesn't actually share the code for both examples so how can we be certain one is a copy of the other - is it based on the fact that the rendered outputs look the same?

The person who said it was a copy even says "looks like o3-mini has modified the code a bit" - did it, or did it generate fresh code that has a similar visual effect?

I recognize those clouds! This is a GLSL shader by Jeff Symons. The original code is here: https://www.shadertoy.com/view/4tdSWr It looks like o3-mini has modified the code a bit, but it is basically the same.

If the code itself is a carbon copy match then yeah, that shows that the model regurgitated training data directly. Has that been demonstrated somewhere and I missed it?

I remember the John Carmack Quake 3 square root thing being shown as a prominent example of regurgitation a few years ago, but model vendors have put a bunch of effort into avoiding regurgitation since then. I'd be interested to see counter-examples showing that they failed at that.

The NYT lawsuit documents from nearly two years ago did include significant demonstrations of regurgitation of NYT content, but they had to prompt very deliberately to get that: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf page 30 onwards - it's also likely that OpenAI have taken additional steps since then to prevent that from happening.
- dzwdz
  
  The Reddit post doesn't actually share the code for both examples so how can we be certain one is a copy of the other - is it based on the fact that the rendered outputs look the same?
  
  The original source is on shadertoy, and the source is visible in the video (albeit the formatting is very crappy - it uses literal \n instead of newlines for some reason?)
  
  But, hey, you've made me actually curious about this - I was already sure the LLM had used that code (because come on, it looks identical), but I didn't know how similar the source was. I thus manually retyped it and made a crappy side-by-side comparison. It's the same code, except some variables were changed, and a comment was removed[1].
  
  [1] Incidentally, if I have some code and want to find some copies online, I tend to search for exact matches for comments on Github. Obviously the comment being removed is just a coincidence, but honestly I find it a bit funny in this context.
  
  I remember the John Carmack Quake 3 square root thing being shown as a prominent example of regurgitation a few years ago, but model vendors have put a bunch of effort into avoiding regurgitation since then. I'd be interested to see counter-examples showing that they failed at that.
  
  See, I think any mitigations are just fundamentally bullshit?
  
  How, exactly, do you detect plagiarism? Sure, you can detect very blatant cases such as this one, where the code matches exactly. What if it's less clear?
  
  If I want to create a shader, and go about it by finding a bunch of shaders from other people, and copy-paste parts from each, not making any substantial changes except matching up variable names etc - I think that's pretty clearly plagiarism[1]. However, such a shader would also be probably pretty unlikely to trigger any automated plagiarism checks. The matches with any particular prior work wouldn't be strong enough.
  
  I don't think it's too far-fetched to view (some) LLM-generated code this way. If they can output a single work from their training data verbatim - why can't they mix works? They're already smart enough to fix up variable names etc.
  
  [1] unless I credit the authors, of course. Thankfully for my argument, I don't think LLMs can do that.
  
  Also, to the best of my knowledge, this sort of thing is just what LLMs do. If you're training an LLM on a bunch of existing works, why is it at all surprising that their output is similar to what they were trained on?
  
  The NYT lawsuit documents from nearly two years ago did include significant demonstrations of regurgitation of NYT content, but they had to prompt very deliberately to get that: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf page 30 onwards
  
  Well, the (user) prompt here was "animated procedural clouds using simplex noise in a gradient blue sky". That isn't really specific.
  
  Jumping to another sort of genAI, Midjourney generated a lot of obviously copyrighted content off generic prompts. That article also cites how someone accidentally generated a close duplicate of a Getty Images photo of Paul McCartney (the cited tweet has since been deleted :/).
  
  The blatant violations such as it generating Batman or Elsa out of the blue can probably be fixed[2], but... We can recognize Batman - but can we recognize when a smaller artist is ripped off? Can a safeguard recognize that?
  
  [2] ...not that I care about these film companies. I don't really care about copyright in general either, but I do think crediting artists is important. These cases of infringement are the most obvious, though, and show that genAI can rip off existing works - so it's a wider problem we should worry about.
  - simonw
    
    OK yeah, your side-by-side comparison there has me convinced: this was absolutely regurgitation!
    
    I know from conversations with AI lab employees that this something they take seriously, to the point that they sometimes cancel potential model improvements if they are found to make the problem worse. Clearly whatever they are doing isn't as effective as they hope though.
    
    The image generators produce output that appears to me to be a clear infringement of copyright, as do the audio generators. There's a new lawsuit against Suno about that and the music industry lawyers have teeth.
    
    dzwdz
    
    to the point that they sometimes cancel potential model improvements if they are found to make the problem worse
    
    Meh - I think think they're just covering their asses. I don't think LLM plagiarism is a solvable problem. All they can do is paper over the most obvious/damning cases of plagiarism.
    
    Even without safeguards the odds of finding out about plagiarism are negligibly low - ignoring "famous" code such as Q_rsqrt, you can only really spot it if you just happen to know the original work. With shaders there's at least a visual component, but I really don't know how you'd spot it for normal code.
    
    If they somehow add safeguards that hide the most obvious matches against the training data... the problem still remains, but at this point it becomes almost impossible to spot and prove.
    
    See also: how the Common Pile includes code even under copyleft licenses such as the GPL (I'd argue they shouldn't even be able to use code under more permissive licenses such as MIT). It's really hard for me to see this as anything but an attempt to evade lawsuits.
    
    If you compare it to existing datasets: the only thing it really excludes is proprietary code? I'd risk guessing that a substantial portion of proprietary code online is leaked[1]? I don't think using that code is any less unethical than using GPL code, but it sure is much more likely to get you sued.
    
    The random maintainer in Nebraska won't have the means to sue you, but Microsoft sure will (well, if they weren't all in on AI themselves). It's a bit insulting, to be honest.
    
    [1] Most of the source code leaks I have heard about have ended up on Github, usually in many different repos. They don't seem to be moderating it all that much.
    
    duncan_bayne
    
    but Microsoft sure will (well, if they weren't all in on AI themselves).
    
    If I had the money to hire the required compute, and more importantly an army of lawyers, I'd train an LLM on the leaked Microsoft Windows source code, and release the resultant model for free. Would be fun to watch how MS responded.
    
    alexandria
    
    no-value comment specifically to favourite it, as lobsters lacks this feature at the moment.
  - zesterer
    
    As someone that's been writing shaders - and in particular, cloud/noise shaders - for a long time, I find it incredibly difficult to believe that the LLM's output wasn't primarily influenced by the original. There are so many possible permutations of even a simple example like this: it's vanishingly unlikely that any one implementation conjured up by a human would be visually identical to another. This isn't bubble sort: by definition, it's extremely high entropy code.
- kuijsten
  
  just as LLMs trained on MIT code, MirBSD code, etc.
  
  I read in the linked MirBSD license:
  
  Output from an “AI” whose “training data” contains This work is to be considered a derivative thereof;
  
  I don't think it's up to MirBSD to decide what is and is not a derivative work and to me simply being included as a tiny fraction of the huge training corpus of a model reads like license overreach. I think (at least in the EU) this is up to the courts to decide.
  - dzwdz
    
    For the avoidance of doubt: Licensor explicitly reserves the use of a work under this Licence in text and data mining, under Article 4, p.3 of directive 2019/790/EU, §44b (3) UrhG and similar. These terms only permit such use if the entities involved both in text and data mining and reproduction of material obtained alongside such agree that their “model”, i.e. [...], is indeed a derivative of its inputs
    
    They cite the relevant laws. I'm not a lawyer, but it does sound like they talked to a lawyer about this (I hope).
    
    Also, come on, models can and do output training data verbatim. Why wouldn't they (both the models, and the output copies) be derivative works?