Linux kernel community discussion on ML/LLM tools in kernel development

5 points by jaguar

This is a long comment thread responding to “Toward a policy for machine-learning tools in kernel development,” debating whether ML/LLM tools should be allowed in patch creation and review, and what policy constraints would be needed. LWN.net

Blintk

I don't find these discussions useful or interesting. What matters is the outcome.

Do LLM find bugs?
Do LLMs write useful code?
Do LLMs provide useful insight into codebases I don't have the time or willingness to explore every branch of?
Do LLMs possibly/probably/almost definitely infringe on copyrights of millions and millions of people?
Do LLMs use a disproportionate amount of energy to train foundational models?
Do I care?
Will I continue using them?

The answer to all of those questions for me and millions of other people is yes. Every place code can be written an LLM is going to touch at some point whether we know it or not and I don't consider that a negative. My point of view is that we should strive to achieve outputting good code that's net helpful or fun to the people we care about. Any tool we can leverage in furtherance of that goal is good in my book.

alexandria
The answer to all of those questions for me and millions of other people is yes

Is it? The answer to all of those is "It can and will hallucinate and have random behaviour, leading to you possibly wasting time or, worse, messing up things that other people have to then fix, and those could be things that essentially fuck over real people", is it not?

Take, for example how LLMs summarize information, specifically:

I just realised the situation is even worse. If I have 35 sentences of circumstance leading up to a single sentence of conclusion, the LLM mechanism will — simply because of how the attention mechanism works with the volume of those 35 — find the ’35’ less relevant sentences more important than the single key one. So, in a case like that it will actively suppress the key sentence.

I first tried to let ChatGPT one of my key posts (the one about the role convictions play in humans with an addendum about human ‘wetware’). ChatGPT made a total mess of it. What it said had little to do with the original post, and where it did, it said the opposite of what the post said.

For fun, I asked Gemini as well. Gemini didn’t make a mistake and actually produced something that is a very short summary of the post, but it is extremely short so it leaves most out. So, I asked Gemini to expand a little, but as soon as I did that, it fabricated something that is not in the original article (quite the opposite), i.e.: “It discusses the importance of advisors having strong convictions and being able to communicate them clearly.” Nope. Not there.

To millions of people the answer to the question "Can LLMs accurately summarise data" is "yes, I can get a useful summary out of an LLM", but taken in context of the quotes and article above, the reverse seems to be the case — why should I think of this technology as useful for this task? It seems more like a liability — a tool that makes you think it's useful, rather than being or doing something useful, and this is what I see with most LLM reviews. The author will mention spending hours trying to get the LLM to do a thing, or "it made xyz, but it was so buggy that I found it difficult to edit it after, and contained lots of redundant parts", or "it incorrectly did xyz", and they usually come away with the conclusion "I'm so excited to see how it does in the future", but every time I read an article like that I think — wow, if a junior dev did that the number of times the AI did, they'd be fired on the spot.

See also, something like this article, "Roko's dancing basilisk" where the author tries to use an LLM to deal with a codebase, and comes away with a semi-positive conclusion. However, taking a look at some of the points mentioned:
1. I do not have “Unsupported markdown: blockquote” or “Unsupported markdown: list” unary operators.
2. Oh my God! I can't say how bad this backend matrix table is. It's all sorts of wrong. It's not that it got the supported/non-supported markers backwards, it appears to have just made up the results! [...]
3. The example of writing an instruction to the various formats is wrong for the RS-DOS version—the type and length should be two bytes each, not one.
4. The output format for -t is incorrect—it doesn't show a trace of the code being run unless the TRON directives are in use.
5. Every example of the .ASSERT directive is just wrong as it did not use the proper register references, and memory dereferences need a @ (8-bit) or @@ (16-bit) prefix.
6. Where you can use the .TRON direcive is wrong—it can be used anywhere; it's .OPT TEST TRON that can only be used inside a .TEST directive.
[...]

Overall, this was less obnoxious than having the LLMs write code, but I feel it's still too inaccurate to be let loose on unfamiliar codebases, which I suspect is the selling point.
... most of these things are something that any person would get fired for, and are things that are not positive for industrial software engineering and design, where reliability is important. LLMs appear to do a "lot", but still confabulate and repeat incessantly (and there is some degree of evidence showing that hallucinations are inherent to LLMs as a whole), making it worthless to depend on for practical purposes unless you want to risk spending hours chasing your own tail over something it hallucinated, or behaviour that you relied upon being lost in an update (as I previously argued here). What's confusing here, is that I thought the 2010s push towards static verification and compiler assurance was because we were trying to reduce the error rate in professional software development, not increase it.
- Blintk
  
  When I see someone crapping on the ability of these things to do work it's usually when these tools are fairly unconstrained and without prior documentation guiding it on what to do and where stuff is and why things exist. That kind of documentation helps humans as well. Tell it where to focus, tell it to break down the problem and call the subsets of data or code or the codebase in new contexts to build up from there. Lastly, check the output with tests and let it correct itself. Its effectiveness depends heavily on the context it's managing.
  
  I have checked the output and pair programmed in parallel repeatedly. I have created an LLM friendly environment myself with tools that claude code/codex/gemini know about, hard hooks to ensure that the checks I need to run actually happen with scripts and prompts to let this thing code and utilize all of the stuff I know about. Knowing that it makes mistakes and gets lost and minimizing that with TDD + other techniques reduces the amount of work and number of piecemeal inspections I have to do. I know what the current tools are capable of and I can commit clean code and make PRs that I feel absolutely no shame about with regards to performance and maintainability. It's impossible to simply let these things vomit out code and receive a good product but working with it together makes me faster. Much faster.
  
  I am not the only one. If other people don't realize the same gains, oh well, but I know what I can do with it. This is why I don't find those kinds of discussions useful. It typically turns into me telling people that it's a skill issue and that I'm going to use LLMs anyway and so are tons of other people with varying degrees of realized usefulness.
  
  making it worthless to depend on for practical purposes
  
  I just can't agree. Outputting slop or good code is a developer choice, we just have another tool that can make either easier
  - alexandria
    
    I am not the only one. If other people don't realize the same gains, oh well, but I know what I can do with it. This is why I don't find those kinds of discussions useful. It typically turns into me telling people that it's a skill issue and that I'm going to use LLMs anyway and so are tons of other people with varying degrees of realized usefulness.
    
    Right, so what you're saying is that you're well aware that for you it boils down to "git gud", and the fact that the work of hundreds of thousands of people to try and make computing predictable, verifiable, and sane is going down the drain, nor the experiences of people in a myriad of fields now encountering false positives, hallucinations, random behaviour, and inaccuracies, let alone the experiences of the people whose friends have been driven to borderline-psychosis from these tools — all of that, doesn't matter, because they haven't got good enough to use it?
    
    It's impossible to simply let these things vomit out code and receive a good product but working with it together makes me faster. Much faster.
    
    Have you actually verified this assumption? I'd love to see any hard data that you have.
    
    Blintk
    
    ll of that, doesn't matter, because they haven't got good enough to use it?
    
    No. All of that doesn't matter because I have or at least I feel I have. Their inability doesn't bother me. Their being loud about what goes wrong when they use it they way they do is helpful. Other people talking about curtailing their usage is also helpful because it validates my previous experiences when I tried to use it without proper checks.
    
    Have you actually verified this assumption? I'd love to see any hard data that you have.
    
    Zero hard data other than more completed projects, feel free to assume I'm lying or wrong. The thing is that I'm not arguing for more people to use it or that I'm a better programmer or heck - even better at corralling LLMs than everyone. Rather what I'm trying to say is that its use regardless of what people are seeing from various failures or bad outcomes is rising and will continue for the foreseeable future, good or bad. Arguing against its use is pointless. Use it or don't, but when you interact with the code it outputs that's been put in front of you from other programmers you won't always know. Code is code. That's what I'm trying to say.
    
    alexandria
    
    The thing is that I'm not arguing for more people to use it or that I'm a better programmer or heck - even better at corralling LLMs than everyone.
    
    The thing is that you are, inadvertently, arguing that. Everyone else experiences hallucinations (that are, at this point, considered generally considered to be a fundamental aspect of LLMs) that they don't catch and crop up at runtime (or that take hours to resolve), or an LLM "lying", or producing behaviour that changes across versions and causes critical issues, except you, who is somehow able to produce code without difficult to spot bugs that violate your assumptions, and able to review said validity of the code so fast that it doesn't slow you down. This is despite LLM coding being majority code review (an especially difficult one, given that the LLMs do not have human-level internal models of how the code works and struggle with basic logic), and code review being almost universally considered as being longer and more annoying than just programming from scratch (I had an ACM link and workplace SE ready, but I'm sure you can ask an LLM to hallucinate up those links :) ). Furthermore, you have no testing to back up your claims, so there is no way to differentiate the actual fact versus your experience — which does (in this case) matter, given how LLMs have been shown to make people think they're faster, while actually providing the opposite, reduced quality of output to boot.
    
    Rather what I'm trying to say is that its use regardless of what people are seeing from various failures or bad outcomes is rising and will continue for the foreseeable future, good or bad.
    
    But these things do matter, the experiences of people using LLMs are not just externalities to be pushed aside into a flippant comment — furthermore, LLM use really isn't increasing, and this is trivially provable by pointing at how the winds have shifted in the last half a year. I do genuinely believe that Microsoft scaling down their branch of LLMs after they've seen very little use is only the tip of the iceburg of things to come, at this point.
    
    Use it or don't, but when you interact with the code it outputs that's been put in front of you from other programmers you won't always know. Code is code.
    
    The reason most FLOSS projects are moving to ban LLMs is not because they are good at the tasks that they do, it is because they are poor quality, a waste of maintainer time, and the developer involved cannot reason about the code because they do not have a mental model of the code to explain. The bulk of time invested in "good code production' for developers is almost never "writing code", the externalities here matter as well — code itself is communication, between yourself, the computer, the end user, and any developers that
    
    A point I dropped but is worth paying attention to, is
    
    Outputting slop or good code is a developer choice, we just have another tool that can make either easier
    
    I do not believe this to be the case, a tool that constantly shifts in form and function and provides unreliable outputs depending on dice rolls is not a tool, it is a liability. If you had to make a dice roll every time you used a hammer, that dice roll paid no attention to skill or experience, and you missed over half of the hits, why would you use such a hammer?
    
    simonw
    
    LLM use really isn't increasing, and this is trivially provable by pointing at how the winds have shifted in the last half a year.
    
    If it's trivially provable I challenge you to prove it. Every indication I've seen is that the use of LLMs for software development has exploded over the past year.
    
    (I'm treating that as a separate topic from wider LLM use by regular people.)
    
    Just one number: Claude Code did not exist at the start of 2025. By the end of 2025 Anthropic credit it for over $1bn of annual recurring revenue.
    
    bsder
    
    The problem, as that thread points out, is the amount of "false problems flagged"/"correct problem mis-solved falsely" vs "genuine problems solved correctly". As things stand right now, LLMs create enough false positives that human beings can't possibly triage them all to get to the valid ones--this is effectively a slow motion DDoS.
    
    If you use AI correctly, nobody will be able to tell you used AI--all they will see is a genuine problem solved correctly. If people can tell you used AI, well, that means you either solved a false problem or created a shitty solution to a real problem and your reputation should get penalized appropriately. The only reason I can see for opposition to this is if you want to advertise that you solved stuff with AI which would imply that you are somehow commercially involved with AI and you should probably be banned from the project since your incentives are heavily misaligned.
    
    Do I use modern LLMs for search? Sadly, yes. The LLM companies have access to and been trained on proprietary corpuses(corpii?) to which I, as a mere plebeian individual, have no access to. Consequently, LLMs often do a better job of coughing up sample code that normal search engines have no ability to unearth. This pains me, but refusing to use a much larger search engine than my peers would place me at a disadvantage.
    
    lightspot21
    
    Just to help you since you seem to look for the word: the plural of corpus is corpora.
    
    I also use LLMs for search, in fact that's where they see the heaviest use for me. I'm forced to use it because with the amount of slop out there, normal keyword based search is almost unusably broken, and the LLM companies have scraped away all the signal leaving back mostly generated noise.
    
    I agree with all other points. LLMs are very hard to steer effectively (as in predictably), and I've tried. I've grown to heavily detest, if not outright hate, nondeterminism/unpredictability in software. Why are we doing this? Can we please make the web grepable again? Or am I just an old man yelling at clouds?
    
    k749gtnc9l3w
    
    I've grown to heavily detest, if not outright hate, nondeterminism/unpredictability in software. Why are we doing this?
    
    Because derandomisation is indeed hard!
    
    There are some theoretical bounds when derandomisation means a lot of loss of efficiency; and for some other well-defined algorithmic problems it is often an open problem whether derandomisation without loss of efficiency is possible.
    
    And there are well-defined problems where we have working randomised heuristics but not an understanding of their scope of applicabuility.
    
    So for not-that-well-defined problems people grab whatever heuristic they can reach and try to sell that…
    
    But yes, it is frustrating when this crowds out solutions that were deterministic and working…