On the use of LLM assistants for kernel development
12 points by gnyeki
12 points by gnyeki
I’ve clicked through to The Linux Foundation’s generative-AI guidance.
If any pre-existing copyrighted materials (including pre-existing open source code) authored or owned by third parties are included in the AI tool’s output, prior to contributing such output to the project, the Contributor should confirm […]
I would ask how the contributor is supposed to know that – do they just need to be aware of all open source code in existence? However, they do actually have an answer:
[…] some tools provide a feature that suppresses responses that are similar to third party materials in the AI tool’s output, or a feature that flags similarity between […] materials owned by third parties and the AI tool’s output and provides information about the licensing terms that apply to such third party materials.
This seems very naive. This will only detect the most blatant cases of plagiarism similarities. For the most part AI doesn’t output training data verbatim. It’s similar, but not an exact replica (see: that one IEEE article). Will this detect variable names being changed to fit the surrounding code? Coding conventions translated to whatever the context is? The output being in a different language than the original?
What if the AI plagiarizes multiple codebases at once, taking pieces from each? I think this is still very problematic. It pretends the problem is solved, while it very much is not (and might never be?)
On a completely unrelated note, have you seen the list of corporate members of The Linux Foundation? Meta and Microsoft are platinum members, Google is a gold member, and there are probably more AI companies that I’m missing.
This is a very obvious conflict of interest. Their guidance won’t ever paint AI tools in a negative light, such as acknowledging the risk of accidental plagiarism. They ever won’t say that The Linux Foundation doesn’t condone usage of AI tools.
I hope the kernel maintainers won’t fall for this. This feels like another attempt at manufacturing consent for LLMs.
On a completely unrelated note, have you seen the list of corporate members of The Linux Foundation? Meta and Microsoft are platinum members, Google is a gold member, and there are probably more AI companies that I’m missing.
I’d like to have been at the meeting at Google where they decided that the platinum tier of membership was too expensive for them and that they should settle for gold.
This seems very naive. This will only detect the most blatant cases of plagiarism similarities. For the most part AI doesn’t output training data verbatim. It’s similar, but not an exact replica (see: that one IEEE article). Will this detect variable names being changed to fit the surrounding code? Coding conventions translated to whatever the context is? The output being in a different language than the original?
If a human would rewrite code to a different language or adapt an existing algorithm to fit the surrounding code would you raise the same objections?
If I see an implementation of open source code, read it and implement the structure in my own code at which point do I plagiarize in your opinion? There are just so many ways to implement an algorithm. It could very well happen that given one algorithm two independent people implement it in nearly the same way.
Having any kind of guiding and guard rails is better than having none and disallowing verbatim and similar code seems to me to be a good starting point.
What if the AI plagiarizes multiple codebases at once, taking pieces from each? I think this is still very problematic. It pretends the problem is solved, while it very much is not (and might never be?)
I think in your definition it sounds like code that doesn’t resemble any single codebase, but could be constructed to be a combination of many different code bases with variable renames counts as plagiarization to you.
If a human would rewrite code to a different language or adapt an existing algorithm to fit the surrounding code would you raise the same objections?
Missing keywords: code written by others, rewrite without giving credit to the original authors. If it’s a nontrivial piece of code – then yes, obviously I would.
If a human would translate a book to a different (human) language and then published it as their own, that would very obviously be plagiarism.
It could very well happen that given one algorithm two independent people implement it in nearly the same way.
Indeed.
LLMs aren’t people, though. It’s not hard to get them to reproduce training data verbatim.
An LLM can’t independently approach a problem in the same way a human does. If it has some code in its training data and it generates some similar-looking output, that’s not just because it’s solving the same problem. It’s literally in the training data.
I think in your definition it sounds like code that doesn’t resemble any single codebase, but could be constructed to be a combination of many different code bases with variable renames counts as plagiarization to you.
No, it’s about the process. Let’s say I write (e.g.) a FAT32 driver by opening up the source of a bunch of existing FAT32 implementations, copying and pasting pieces, and then just adjusting then slightly so it works. Then, I don’t link the original sources.
I think that’s pretty clearly plagiarism. I’ve used the work of several other developers in a very substantial way.
It’s also possible to write code yourself that would end up looking like a mashup of existing sources. You’re solving the same problem, so it’s probably pretty likely. That’s not an issue, though, as you didn’t take the work of others and pass it off as your own.
code written by others
I would say every code is written by others. There is generated code (I mean algorithmically, not through “AI”) but even that has the base structure defined by a programmer. But yeah, code written by others.
If it’s a nontrivial piece of code
That’s the keyword for me. nontrivial.
If a human would translate a book…
Indeed it would be. I’m not sure myself how I would rate a book compared to code, because depending on the (programming)language the way a program works can change drastically (for instance from OO to actor based languages) for the same functionality, but I have to think about my opinion on this.
I think of programs more like images or music or recipes, where you can capture the tone and everybody knows where your inspiration came from, because it does the same for you or looks similar, but isn’t so clear cut.
Especially recipes resonate, because you can have a base recipe that you modified over time to the point where the meal is of course the same meal, but entirely your own. (Sidenote: If I modify a recipe I always link to the original recipe. Not only so that others can use the original, see and compare, but because it’s the civil/right thing to do)
For drawings I know many people will raise an objection here since I saw outcries on twitter over “tracing” where the image is of a completely different subject, but the pose was copied which seem to outrage some which I don’t understand, but that’s another matter entirely.
Like I said I have to think about the problem for myself more, because I don’t yet have an opinion on it.
LLMs aren’t people, though. It’s not hard to get them to reproduce training data verbatim.
And that’s what the setting (as far as I understand) tries to avoid.
No, it’s about the process. Let’s say I write (e.g.) a FAT32 driver by opening up the source of a bunch of existing FAT32 implementations, copying and pasting pieces, and then just adjusting then slightly so it works.
Yup that’s bad. (Note to the reader. I had 5 paragraphs here about rewriting an existing implementation. I was rambling and in the end I didn’t know what the point I was trying to make actually was. So hopefully the new approach will work better)
What would you say if you took a good commented codebase, deleted all the code and only left function definitions and the comments? The structure will stay the same since you implement it as the other person implemented it, but the code is entirely your own. The inner workings stay the same, the order stays the same, but you didn’t copy a single line of code. Would this still be plagiarism in your opinion?
In my mind we are not talking about copy pasting and renaming when necessary, but copying the mechanism if that makes sense.
And that’s what the setting (as far as I understand) tries to avoid.
My point is that this will only prevent the most blatant plagiarism. There are so many ways in which the output of an LLM could differ from the original training data, just like how a person plagiarizing something will change a bunch of stuff here and there to make it less obvious – except in the case of LLMs that’s not intentional, that’s just how they work.
The Linux Foundation instead just pretends there’s no risk there, downplaying the issues LLMs have with plagiarism.
The inner workings stay the same, the order stays the same, but you didn’t copy a single line of code. Would this still be plagiarism in your opinion?
…yes. You copied all the comments, which (since it’s a well commented codebase) presumably a lot of work went into.
If your point is to try to ship-of-theseus yourself into using someone’s work in a way that won’t technically count as plagiarism… who cares? That’s not what LLMs do.
If your point is to try to ship-of-theseus yourself into using someone’s work in a way that won’t technically count as plagiarism… who cares?
My point is that especially in hardware-land some code works the way it because it’s the only way it can and that reproduction of algorithms can take many forms. A working samba implementation? Plagiarism of microsoft. Copy how Windows manages sleep so that it actually works under linux? Plagiarism of the windows driver.
I have done too much reverse engineering in the past and for that reason alone feel that I have done too much copy pasting (even if I only looked at the stream of data I captured). Yes I copied the principals of most things, but I think (or hope) that I didn’t plagiarize, just because I wanted an open alternative.
That’s not what LLMs do.
I’m not an expert, and yes most do right now since it’s just stochastic at work, but I would hazard the guess that it really depends on 2 things. (Note to reader: Assume I don’t know what I’m talking about. Please take the following with a grain of salt, please correct me if I’m wrong)
Token lengthSure if a “word” is a token and you write something unique it will reproduce your text verbatim. Flibberflop qwer sovervault.
Training dataYup qwer has to follow Flibberflop. But if the token length would be let’s say 1 byte. Reproducing this should become impossible if it’s just once in the training data and the trainingdata is a few PB.
If your code is stored once in the training data reproducing it verbatim will be harder. That’s why it’s relatively easy to reproduce the first page of Alice in Wonderland or Harry Potter since it’s the first page it will be quoted mutiple times, but picking a random page and trying to reproduce it is harder/not possible.
Oh and my point is to not have double standards and try to promote transparency. I always fear that one side gets their will instead of both sides working together and finding a middle ground. I would really like a informed and honest rules. I want to know what is ok to reproduce and what tools I can use. Right now coding with LLMs seems kinda pointless and it’s main purpose is to write boiler plate code to get you started or convert some code you wrote to fit in your new schema, but I don’t know how it will work in the future.
Anyway I try to don’t take a side and am still in the process of forming my opinion so I really appreciated to talk about this with you. I think I have a few things to reconsider.
A working samba implementation? Plagiarism of microsoft.
Wikipedia (sorry, they don’t link to a primary source) says that Samba was initially developed by analyzing packet traces. The first recommendation in the “hack Samba” section is to get a copy of Wireshark. They also ban contributions from people who have read Microsoft’s source code, to avoid licensing issues.
Presumably that very code is in LLM training data, as people have mirrored it on Github.
Furthermore, the first sentence on samba.org is
Samba is the most feature-rich Open Source implementation of the SMB and Active Directory protocols for Linux and UNIX-like systems.
They’re very upfront about what exactly they’re implementing, so even if you think this approach still falls under plagiarism – they’re giving credit to the original protocol.
FWIW, I don’t think projects should need to follow Samba’s standards. For example, re3 was a project that tried to reverse GTA 3, directly translating copyrighted binaries to C++. They got sued, the project got shut down – but I’d argue they weren’t doing anything unethical. It was very obvious who the original authors of the code were, you still required a copy of the original data files so it didn’t enable piracy, etc. If you took some functions from there and reused them in another game, you’d step into plagiarism territory.
My point is that especially in hardware-land some code works the way it because it’s the only way it can
In some cases it’s possible to write an implementation from scratch, based just on the specification/datasheet, and reverse engineering the device as a black box. If you’re using an existing implementation, it deserves to be credited. This goes for both open source implementations, and reverse engineering of closed source ones.
Yet, with an LLM, you don’t really know how much of the source is original. Maybe it’s mostly novel, and it just translated the spec into code. Maybe it’s a mashup of some drivers from your kernel, and some other kernel that already has support for this device. Maybe it will be similar enough to be noticed by automated tool, but maybe it won’t. You don’t have a way to make sure that the output is original.
I just wanted to make the point that reproduction of other peoples software can exist and be based on reverse engineering. The Windows sleep example is probably a better one.
If you’re using an existing implementation, it deserves to be credited. Yup it does.
Yet, with an LLM, you don’t really know how much of the source is original. And that was one of my points. If you can’t trace it to a specific implementation isn’t it effectively a reimplementation? Maybe the code is really unique and based on assumptions. Most code I got out of LLMs was so unusable that it would be embarrassing if it was a copy of something.
Edit: redacted a statement about samba over missing source (citation). Conflated the french-cafe analogy and the US export regulations to something else in my head over the years
Interestingly enough, “is code like a book or like a recipe” was a key question when formulating copyright law, because the former can be and the latter cannot. It was decided that it’s closer to a book. But we almost had a world where copyright didn’t apply to code.
If I see an implementation of open source code, read it and implement the structure in my own code at which point do I plagiarize in your opinion
Open-source projects which replicate some closed source counterpart have contributing clauses such as “if you’ve ever merely laid eyes on closed-source competitor codebase, you are not allowed to contribute” - wine, reactos, etc. In the past people used to do clean room reverse engineering to circumvent that: https://en.m.wikipedia.org/wiki/Clean-room_design
As far as I understand that is mainly to avoid lawsuits and there are people even claiming to have found verbatim recreations of microsoft code (from a leak) in the original reactos codebase, but there are enough projects that don’t have those clauses (like reactos and wine do).
Ladybird is written by Andreas Kling who actively worked on Browser before and often refers to how he did the same thing at Nokia/Apple before. He is not using the code he wrote, but is still applying the knowledge he gained. There are probably some lawyers out there that would sue him into the ground for that (with or without clause in the original contract), but I hope they don’t (not only because Safari is based on webkit which is open source).
My problem is that copyright can take too many forms and copying basic principals can be seen as bad as well. That’s why apple integrated a string in hardware that is checked (which I won’t reproduce here) that is needed when emulating the hardware. I don’t know (and don’t want to know) what is protect-able in which country. It’s arbitrarily chosen how much of a work I can quote and how much I must add to be allowed to do so. I can’t (easily) build (and sell) my own hardware if someone patented the working principals behind it. AV1 needed to pool patented algorithms if I by accident invented something that works the same way I’m still liable. Even if I do a clean room implementation if the working principal is the same I can still get into trouble (in some countries).
NOTE: I’m not a lawyer, I can be wrong with statements made above. I just wanted to make the point that there are fine lines that can be crossed by having something (code), knowing something (having seen the code) and reproducing something (disassembling, or guessing [black box] the function)