Your LLM Doesn't Write Correct Code. It Writes Plausible Code
66 points by kitschysynq
66 points by kitschysynq
This is also true of humans.
Yet, i know humans who don't just write plausible code.
The post goes over various issues with this SQL AI-generated implementation. To me, these issues feel like something i'd expect from an LLM but not from a human (well, except maybe from people working at horrible software factories that only care about selling man-hours and delivering... something, fast).
Do you think these issues apply equally to software developed by people as well? Maybe the line is fuzzier than i think it is (basically, the line between giving a shit, and not).
Which is curiously anachronistic, since it is only humans who can determine correctness.
Humans can define correctness (sometimes). They don't have a monopoly on determining the correctness though - it's hard enough that we have special semi automated provers for this.
This is a pre-Gödelian view of provability. Today, we know that for any particular automated proof environment, either it is too weak to prove many common theorems or it is so strong that Lawvere's fixed-point theorem applies.
Whether or not humans can determine correctness, I think it's plainly obvious that they very often don't in practice.
Humans are the only ones who can make this determination, however. So if some humans don't agree over correctness, that's still a human problem.
Humans are the only ones who can make this determination, however.
We can't say this with any confidence, this is totally subjective. I personally validate LLM-generated code by being explicit about correctness (via verifiable tests mostly). This is because the software I work on is used by humans, and I currently think I'm the best person to validate what they're looking for.
But I see no reason why correctness can't be achieved in a future world where AI is producing code to meet requirements of other AI. Or even for other humans one day. What makes you so sure?
The entire notion of 'correctness' is an abstract concept born only in the living universe - i.e. it is only life itself which formulates this concept. Rocks don't do it, the sun doesn't do it - and as far as we can tell, animals have their own standards of correctness which mostly orient around survival.
AI is merely a reflection of life - it will never, ever be anything more than a reflection.
Wake me up when any other living species proffers the notion of correctness - whether for evaluation by an AI or by an actual human ...
I think the problem with code, any code no matter who writes it human or program, is that the solution space is very large, and there are exponentially more bad convoluted solutions than good simple ones.
"If I had more time, I'd write a shorter letter" and all that.
AI models, at least right now and I think for the foreseeable future don't really have the ability to the grasp the bigger picture of anything they're working on, unless it's some trivial green field widget. This makes them extremely prone to append-only coding.
You can coax them into producing good code by pointing out all the mistakes they are making and all the inefficiencies they are introducing, but this is a fairly tedious process that often takes orders of magnitude longer time than the actual implementation.
Whenever we've improved programmer tooling, with higher level languages or IDEs or whatnot, we've only seen projects get larger and more complex. This trend seems to continue unabated into the Claude age.
My rule of thumb is that hammering code into shape takes around 5x the amount of time as doing the initial implementation with Claude Code. Which is closer to an order of magnitude than not, but also not prohibitive, and still produces notably better results than doing it all by hand.
The other advantage of an LLM is I can make sweeping changes more easily. Previously, if I got something wrong in the domain model (things like a field being in the wrong spot in a large graph of product and sum types), I would either have to spend a tedious couple of hours fixing up all the code sites, or accept the tech debt and move on. Now I can ask Opus to update the types, then follow all the compiler errors until completion. It's nice being able to resolve long-standing tech debt by uttering a few words.
I have fond memories of the engineering manager meetings we'd have at previous companies, where at minimum twice a year we'd attempt to rationalize with execs the need for a "code freeze" so we could spend an entire quarter cleaning up technical debt.
I am of the growing belief that although LLMs can easily create much technical debt, they can also wipe so much out in an instant. Much of our technical debt in these prior scenarios were things like "Upgrade to React vx.xx", "Migrate off legacy API", etc. Functions which were indeed tedious but also well documented path ways (CHANGELOGs etc). We sometimes solved these by writing code generators to get through the process, although looking back, writing the code migration tool was a fun exercise but likely not any faster than grinding through the changes.
LLMs solve this slice of the technical debt pie, which is nice. But my worry is that the debt will accrue at a more significant rate, at least for some companies.
Absolutely, yeah. It is too easy to use LLMs to produce terrible results.
To draw on an analogy, and acknowledging that analogies are never perfect, we're in the C era of LLM use today -- great power but also many sharp edges. There will probably be the Rust of LLM tooling at some point.
We're in the C era of LLMs in terms of dealing with unending minefields of unforced errors.
We're in the Rust era of LLMs in terms of delusional, religious belief in them as a solution to all problems and the imperative to proselytize.
We're in the Smalltalk era of LLMs in terms of making a ton of money for manifesto-wielding consultants who never seem to ship much of anything but are enthralled at the new tools they're using.
We're in the INTERCAL era of LLMs in terms of bargaining, threatening, and pleading with tools to work correctly.
We're in the Malbolge era of LLMs in terms of my sincere desire to never use them for any reason if I can possibly avoid doing so.
The original project makes no claims about performance that I can see, just how it scales and correctness. None of the frankensqlite things in the post are what I'd classify as bugs in that context. The absolute performance doesn't even seem to be an explicit goal at this point.
There may be lots of things to criticise there, but "this in-progress reimplemention aiming for extended scope doesn't match the performance of the original" doesn't really tell us anything interesting. If some time is spent on performance optimisations and it's still thousands of times slower for trivial things, then that's different.
It’s important to remember that when you ask an LLM to answer a question, you are effectively asking:
What does the text response to a question like this usually look like?
Apparently, that question is statistically, relevant enough for us to use this technology to answer real questions, but it’s important to remember that it is a statistical prediction of a likely response not an actual response.
If you ever talked to an LLM about a topic you know we'll you'll realize that quickly.
Just like most people in bigger corporations have seen presentations with incorrect statements sounding plausible.
Or how "mass media" is very often simply wrong about scientific topics (eg ones where there isn't so much grey area where you can give benefit of doubt with wording).
So clearly humans do that too.
However for individual humans creating correct code is a different topic.
Some sci-fi and fantasy movies and actually movies at large can seem plausible until you think about it. The goal of this movies rarely is too make absolute sense cause in many situations it's not of relevance to the story and might inhibit it. Of course this is not always the case and it can be lazyness, time pressure or an honest mistake.
However LLM training seems to largely be on the "my goal is to be plausible" side and correctness stems to put it bluntly largely from copyright violations. You can see this by tweaking the "creativity" related settings that will make them drift off towards clearly incorrect code and responses.
A human programmer usually makes a decision regarding the aim. LLMs at this point in time dem to be unable to because they lack grasp of concepts which again is easy to test and verified again and again. This is a lot easier to see with text to image models. For LLMs this was seen with the seahorse emoji.
Honestly it baffles me when people make wild claims of LLMs as if these problems somehow magicly go away for specific tasks like programming. There are clear samples of concepts of very mundane things that the training data easily contains that however completely and very easily verify that concepts aren't grasped by at least the current models.
Given how much money and time and natural resources go n into the development and trying to break that wall every day passing suggests that current methods are unfit to get there.
That's something that of course might change every day as well. Or never.
I think sometimes the "humans do that too" argument is overused. Yes humans do that too. Yes humans even trick themselves with things. But I think the grasping of concepts alone makes a difference.
The whole "thinking" and "be creative" and a lot such things are a bit too anthropomorphizing.
And just to be clear. I also don't think that humans are somehow magical in the sense that grasping something is impossible anyone or anything else. I just think that eg many animal species today are a lot better (or capable at all) at it than LLMs.
And one last thing. I also don't think grasping something is necessary in many jobs/tasks/companies. Pasting an error into Google and blindly copying the first SSO answer in a loop until things seem to work is fine. Just like restarting a service, reloading a website is a common and often good enough practice it seems.