AI should help us produce better code
47 points by simonw
47 points by simonw
Little bit of a personal manifesto, this one. Writing bad code with AI is a choice, and we can and should choose to do better.
So someone starting out now with LLM programming and only prompting, how is that person expected to learn the code? By osmosis? At what point in the career of a "LLM-based programmer" are they to turn the work over to an LLM? And what's the point of micromanaging an LLM? At what point would it be easier to just, you know, write the code themselves?
I remain optimistic that people who want to learn to program will find ways to assist their learning with LLMs without failing to learn anything themselves, similar to how calculators didn't stop motivated people from learning mathematics.
LLMs massively accelerate my own learning, but I had 20 years of pre-LLM experience to lean on so I'm pretty far removed from anyone starting to learn to build software today.
But how will someone know that the code is bad without learning how to program first? Or even that there is technical debt in the first place?
My wife is an English teacher, and this is pretty much her point: how can they trust an AI to write well when they themselves don't know how?
This is like the difference between NP-complete solution checking and solution finding.
It's easier to check a result sometimes than it is to find the result.
If you are able to verify that the LLM came up with something cogent, that passes basic grammar rules, and contains no spelling mistakes, you can use it to create something in a style that you're worse than average at producing yourself.
They should learn to program if they want a great career.
I expect companies will eventually conclude that software engineers who know how to program are worth more than software engineers who don't.
But the problem is that if you are a software engineer a bit in the future, knowing or not knowing how to program is set against "buying a subscription or hiring an engineer", and not on how good this engineer might be, so you could be super good at programming but lose out on bean counters weighing the relative costs and not the merits. The question would still be "is it worth learning to program" because it is not to compete with other engineers, but competing against the corporate idea of how much work the ai will do for peanuts.
"Bad" and "good" code aren't well defined. These are aesthetic qualities. Is Arthur Whitney's buddy allocator well-written?
I will say that LLMs tend towards syntactically standard and fairly idiomatic code (probably much more tightly clustered than human code, as someone who has spent a good deal of my life reviewing academic code that runs and by all accounts shouldn't).
Spending less time wrangling setup/config and syntax means people can actually dedicate more time to understanding the craft of software. As people with several decades under our belts, it can be easy to forget the pain of not understanding why a certain library is producing an error when it's entirely orthogonal to the problem at hand. Leaky abstractions are distracting.
I'd encourage anyone skeptical of the learning value of LLMs to try building a feature on a complex open-source project of their choice, in a language and environment they are unfamiliar with. Be curious and open-minded, ask it questions, prompt it well, give it access to the tools and context it needs (MCP for documentation, a isolated build environment where it can run shell commands, etc.). Use the frontier models. Do not bother with non agentic flows unless you are carefully prompting (and ultimately human-in-the-looping every shell invocation and file read to and from the chat buffer).
Good code is code that is easy to understand, navigate, extend, and delete. To a certain extent these are aesthetic, subjective qualities, but they're also qualities that can be measured. In the same way that we can talk about objective UX qualities of a particular design (acknowledging that that design will also always be a subjective experience), we could talk about the objective UX qualities of a particular implementation (acknowledging that many details of the implementation are entirely subjective).
Unfortunately, there's a huge dearth of high-quality research on what good code actually looks like, which is why most people perceive it as mainly a subjective thing. There are some different attempts to try and define good code, and I think most experienced developers grow a kind of intuition for it, but the field is still very much in its infancy.
My own experience is that LLMs produce poor-to-mediocre code, at least by my own standards. That is, the code they produce works, but is usually more complicated than it needs to be, and LLMs struggle to simplify code down. When fixing things, they mostly fix by adding new code rather than fixing by deleting or cleaning up old code. They can do very basic refactorings like factoring shared code out, but they can't make higher-level decisions about how to structure modules. They remind me a lot of Uncle Bob's refactoring process: each individual step kind of makes sense, the end result, however, is insanity.
And I don't think that's going to improve, because, like I said, we don't really have clear answers about what good code looks like. We don't have good measures of navigability or extensibility, and without objective measures, it's difficult to train the underlying models.
That said, how much is good code really worth? Like I said, good code is quick to understand, navigate, extend, and delete. When you onboard a new developer into a codebase, those qualities are all very important — they reduce how long it takes that developer to figure out what's going on and be able to make their own changes. But LLMs can churn through mountains of code far quicker than a human can. And while good code can also make LLMs quicker (at least in my experience), the amount of time that it saves is far less than the amount of time that it takes to convince an LLM to write good code.
I agree to an extent with the idea that LLMs enable a kind of higher-level approach, where the developer can think about the overall goals of the code, and the LLM can worry about the details of what variables get named or how function calls get broken up into parts. But to a certain extent that makes me sad, because one of the things I found most interesting about programming was this question of how to write good code, and now it feels like that's just not a relevant question any more.
Appreciate the detailed reply.
My own experience is that LLMs produce poor-to-mediocre code, at least by my own standards.
I think it's important to specify the workflow and the models used in this day & age. LLMs span everything from ChatGPT 3.5 (unusable in my experience for any serious programming work) to GPT 5.4/Opus 4.5 (serious RLHF to amplify programming ability, can zero-shot most simple problems even without agentic flows and tool calls).
easy to understand, navigate, extend, and delete
This feels like a natural definition, since these are desirable properties of the produced code, but I find it obfuscates some of the tradeoffs inherent therein. Easy to understand usually entails simplicity and a low abstraction overhead. Extensibility implies low coupling & proper separation of concerns, as well as judiciously chosen abstractions.
An illustration of this tradeoff is the code that is written to be extensible in every imaginable direction and ends up significantly more complex than a simple prototype.
Quoting Carmack on YAGNI:
It is hard for less experienced developers to appreciate how rarely architecting for future requirements / applications turns out net-positive
Or similarly, premature abstraction is the root of all evil. Never really been an adherent of XP or other software "methodologies", but continuous refactoring does segue nicely into my main point:
The best software is rarely written in a single pass. One of my first programming mentors, upon my dismay at the quality of some code I had written, just laughed and told me to rewrite it using what I had learned. "You're never going to get right the first time. Programming is functional exploration of a design space, and every wrong path in the tree helps you shrink the world of possibilities. Once you have eliminated most of the ugly and wrong solutions, what remains should be beautiful, or not too far from it."
I have rarely seen the platonic ideal, divine revelation of the perfect solution to a problem instantly. LLMs allow to explore the solution space significantly faster. It's fine if you toss away the code. Maybe it doesn't eliminate an interesting portion of the design space. But even the most steadfast skeptics I know have started adopting these tools because iteration is illuminating. Sure, they will not outperform an expert who has spent 1000 of hours carefully crafting a project. They do not replace expertise (yet). But everyone needs to touch an environment/code they themselves have not written, at least once in a while. And 0 to working prototype there has never been quicker.
We don't have good measures of navigability or extensibility, and without objective measures, it's difficult to train the underlying models.
RHLF can work with preferences (beyond just instruction prompting). There's a reason these workflows are being forced down a lot of peoples' throats: companies want people with taste and discernment to generate data for fine-tuning these models, after they have been nudged, prodded and cajoled into a desirable state.
But to a certain extent that makes me sad, because one of the things I found most interesting about programming was this question of how to write good code, and now it feels like that's just not a relevant question any more.
This attitude saddens me a bit. It's more relevant than ever! Now that boilerplate/syntax issues are essentially handled (macros can finally be more about readability than writing convenience), iterating on code to improve it is drastically cheaper in terms of time. I can get a working prototype faster, debug issues with it faster, rewrite it from scratch using what I learned faster.
We've fallen back into the temporary trap of "more lines of code written = better", but I think very quickly we will shift to "more lines of code rewritten/polished = better". Reimplement an entire abstraction to be more flexible, simplify unused/redundant features, these are all activities that are incrementally but systematically getting easier to do with these tools.
I think it's important to specify the workflow and the models used in this day & age. LLMs span everything from ChatGPT 3.5 (unusable in my experience for any serious programming work) to GPT 5.4/Opus 4.5 (serious RLHF to amplify programming ability, can zero-shot most simple problems even without agentic flows and tool calls).
My workflow currently is largely Opus 4.6 with occasional uses of GPT 5.4 with the OpenCode harness. I agree that this setup can usually solve most problems, my contention is that the code it uses to solve those problems is usually poor-to-middling.
This feels like a natural definition, since these are desirable properties of the produced code, but I find it obfuscates some of the tradeoffs inherent therein. Easy to understand usually entails simplicity and a low abstraction overhead. Extensibility implies low coupling & proper separation of concerns, as well as judiciously chosen abstractions.
In my experience, the best abstractions improve both understandability and extensibility. If you find the right boundary between two chunks of code that meaningfully do different things and have separate concerns, you free the reader from having to maintain both chunks in their head at once, and they can concentrate only on the area that they need to. This is hard, and it isn't always possible, but it can work.
That said, in general I agree with you to a certain extent. It's the same way that it's difficult to build a user interface that is both simple enough that a beginner can get started with little explanation, and complex enough that it efficiently supports all the workflows needed by an expert user. There are some places where those two desires work well together, there are others where they naturally oppose each other. You probably can't build a perfect UI, just like you probably can't write perfect code.
But again, my contention is that the code of most agents is poor-to-middling, and I think even if we can't achieve perfection, most developers should at least strive to write good code.
This attitude saddens me a bit. It's more relevant than ever! Now that boilerplate/syntax issues are essentially handled (macros can finally be more about readability than writing convenience), iterating on code to improve it is drastically cheaper in terms of time. I can get a working prototype faster, debug issues with it faster, rewrite it from scratch using what I learned faster.
This doesn't really match my experiences. The writing forms the understanding, and offers the space to see how the code can be broken down and understood. Going faster helps if I already understand everything there is to know about the code, but by letting the code be generated, I no longer understand the code, at which point I need to go slower to be able to take my time to understand the problem.
My solution at the moment is to accept lower standards — I still ensure quality in terms of the end result, and I'm confident enough in my testing strategies and validation that I don't think there are problems there. But I simply cannot maintain the same level of quality of code and also gain any sort of speedups by having LLMs generate code for me. Those two things are simply incompatible.
I've also tried the approach where I use the LLM more as a reviewer and rubber duck, and I think that makes me slower, but does improve the quality of the code and my understanding of it. But unfortunately that doesn't appear to be the direction that the industry is going, so right now I'm experimenting in the other direction.
We've fallen back into the temporary trap of "more lines of code written = better", but I think very quickly we will shift to "more lines of code rewritten/polished = better". Reimplement an entire abstraction to be more flexible, simplify unused/redundant features, these are all activities that are incrementally but systematically getting easier to do with these tools.
In my experience, these are specifically the tasks that I find hardest with LLMs, and the ones they seem least suited for. LLM code seems to me to be mainly additive — techniques like shredding, for example, which to me are vital in terms of maintaining code quality, are really hard to coax out of an LLM. I can do it occasionally, but it's usually a fight. Maybe this will improve, maybe I need to learn new techniques, but right now that's the sort of thing that makes me feel like I'm quicker doing this stuff by hand.
I attempted to define "good code" here, I should cross link that! https://simonwillison.net/guides/agentic-engineering-patterns/code-is-cheap/#good-code
Lucky 10000: The philosopher Robert Pirsig tackled this question in their book, Zen and the Art of Motorcycle Maintenance. Pirsig's conclusion was that "good" means "what you like"; in more words, "this is good" means "I like how I feel when I observe this". They built up an entire metaphysics of qualities and insisted that Quality, that which precedes all value judgements, must be undefinable.
I expect your kneejerk response might be to ask what Pirsig might think of Whitney's code. Pirsig actually did experiments with university students, asking them to rank each other's essays and determine which essays were Good. These experiments were the basis for his later books.
the calculator comparison is interesting; basic calculators illustrate your point well, but thinking back to my high school and college math, there were plenty of times where i could have used more advanced features of my graphing calculator to skip work entirely, had i not been motivated to learn the concepts. the more advanced calculator was capable of things the simple calculator is not, and thus it has markedly different capability for interrupting my own learning: impossible with a simple one, a matter of choice and will with the graphing calculator.
and so, now that we have pseudointelligent language and text calculators that can similarly “save” people from doing different kinds of work that may or may not benefit them to know, i think the question becomes: where is the balance? are the AI firms sufficiently incentivized to give us calculators that encourage people to learn, or will the market drive them to make calculators that are eager to encourage people to take a load off and not worry about it?
I too am optimistic about people figuring out new trajectories to learn.
Oral tradition values memorization, and the ability to remember probably declined when writing was introduced. But it also opened up many new possibilities.
There is apparently some evidence in education that writing things down by hand has better learning outcomes than using a computer.
People will figure out a way to learn about programming in this new world, but their path will look very different than the one we took. There are likely some things they'll be a lot worse at, but it wouldn't surprise me if they become better at other things.
Besides learning, LLMs are also a bit like abstraction. Abstraction allows me to reuse capabilities I don't quite understand myself - a sorting algorithm for instance. LLMs also let you do that, in a weird new way. Just like using abstractions has risks, LLM reuse brings its whole new set of risks, but definitely also opportunities.
If your learning is so massively accelerated now surely that 20 years can't count for much considering that everything you learned during that time you had to learn sooooo sllloooooowwwwwwllllyyyy
The thing I value most from those previous 20 years is the accumulated experience.
Learning "YAGNI means You aren't gonna need it" is one thing. Accumulating dozens of personal experiences where you ignored that rule and build software that was harder to maintain and those future-facing predictive design choices turned out not to matter is something else entirely.
Likewise, I built so many things in those 20 years - which means on a new project I can think "back in 2011 I tried using Redis for something similar and it worked really well, let's dig up that old code and have a look".
Just realized I wrote a whole thing about that the other week, Hoard things you know how to do.
I broadly agree that LLMs aren't necessarily a barrier to learning to code, but they definitely can be. Calculators still require you to understand the question and approach it in an analytical way. However, it's very easy for someone to completely offload that to an LLM and end up learning nothing.
And what's the point of micromanaging an LLM? At what point would it be easier to just, you know, write the code themselves?
I cannot speak for others but in my personal experience, I have found that LLMs let me write code faster than in the past. This is particularly because they:
You'll notice all of this is with respect to coding at the class or "small feature" level. And that's exactly what I think I can trust LLMs with as of today. For anything requiring high level design work or thinking at "scale" you still need to do much/most of the work yourself.
This has been my experience. I write all the "thinky" code, and I throw LLMs at the boring but checkable tasks like "do this mechanical API refactor" or "rewrite these tests from this old framework to this new one".
So someone starting out now with LLM programming and only prompting, how is that person expected to learn the code? By osmosis?
"Osmosis" sounds like a situation where they're prompting an LLM to write all the code/application/features/bug-fixes/etc. and either accepting the results (e.g. for personal use), or throwing it into a code review for someone else to deal with, etc.
That's not the only way to use LLMs though. I've found myself asking why it wrote certain things; what particular options are for (which I wasn't familiar with before); why it chose to do X instead of Y; etc. That's helped me learn new things (although I've been programming for decades by now). Getting an LLM to review code can be useful too, even if the code was LLM generated: it can give hints about what sort of questions to ask, e.g. giving keywords like "refactoring", "tech debt", etc. that can encourage further learning.
(Of course, this requires some faith in the LLMs not to lead people down delusional rabbit holes!)
This is the natural progression of the industry. When learning to program, how much assembly did you personally write?
I wrote a lot, because I was compelled to learn it. But, most did not, and that's fine.
Just wanted to slip in a quick thanks for continuing to contribute here. The atmosphere on a lot of threads turns caustic, but I appreciate your posts and comments on these topics.
There's a lot of heat and not a lot of light, and I think clear heads are important to help guide younger and less experienced engineers through strange times.
Yeah, I've been meaning to say the same thing for a while. It takes a lot resilience to continuously show up in hostile spaces and @simonw has shown a lot of patience.
This is not a hostile space. This is a neutral space. There are AI-hostile spaces adjacent to this one, like Awful Systems, where he doesn't post. That said, there are also AI-boosting spaces, like Less Wrong, which are also adjacent and where he also doesn't post. Maybe the sort of boosts that Simon posts are only well-suited to neutral spaces.
Damn you! I have a draft waiting to be finished titled "AI makes me care more about code quality".
This reminds me of C++/C developers who say that writing good code is a matter of discipline. Sure, it is theoretically possible to write correct code in any language. However, since you're talking about 'us' in the aggregate, our work is shaped by the tools we use and the ecosystem around those tools.
The industrial revolution resulted in workers losing control over the quality of the goods we produced. Goods became less durable but cheaper to produce, because that's what profits industrialists. We're seeing a similar thing happen today with software. Is it possible for an individual developer to become an expert at building robust software? Of course, just as someone today can work hard to become a world-class weaver. But the incentive structures are making it harder and harder to do this. A blog post insisting that we can build robust software if we adjust how we use AI increasingly feels like it's missing the forest for the trees.
It does feel like a distinction between "hand made" artisanally crafted code vs mass-produced slop.
However, unlike durable goods, code is infinitely copyable. AI can be trained on, adapt and incorporate code, which is a degree of flexibility mass production didn't have. So the effects may be quite different in the long term.
I'm still not sure how software is going to look like in the AI era. The incentives change a lot.
Our notion of tech debt and maintainability completely changes. Does code need to be maintainable if you can tell AI to rewrite the whole thing on a whim?
This time though, we might see a different kind of scaling of possible damage with the number of qualified people deciding they no longer have anything to lose. (W.r.t. quality drop in general, which was already scary before LLM use for code)
I agree with the premise, but disagree on some nuance.
Shipping worse code with agents is a choice.
Not shipping worse code with agents is a choice. Worse code is the default, making it better requires deliberate action. Most of the AI users I know (probably 10-20 personally at this point) are in the "source code doesn't matter" camp, they don't read it at all. Quality is irrelevant to them, so it suffers (by that I mean often their projects have ... unintentional emergent properties and are tens of thousands of lines long for relatively simple feature sets and are impossible to comprehend with a human sized context window), but it doesn't matter cause they don't look at it at all. We'll see if any of these projects survive into the "has users" phase.
Another thing i've noticed (which I also noticed in real people before AI) is that technical debt isn't usually downstream of decisions we made and realized were mistakes, more the decisions we didn't make in the design phase at all. A sort of brownian motion by implementors with different intuitions.
I haven't really figured out how to get AI (or humans) to do this reflection to even find where opportunities to improve exist, I've heard tales of workflows that do it, but I've yet to see a completed project by the folks advocating them in order to judge the code for myself.
So for now, I just audit the codebase every few features and do some refactors myself and suggest specific ones to it when I'm feeling lazy. It also helps me not lose the plot, architecture wise. I know a couple other people who do it this way, and they get good results. As for mine, we'll see, the project is far from complete.
There's an interesting article "Forget Technical Debt" that I think is saying something similar to what you're saying — technical debt isn't just the be-all and end-all of bad code, but it's just one node in a complicated web of issues that affect code quality. I don't know that I fully agree with all the specific examples the author gives, but the concept of this dependency graph of causes of complexity seems very useful to me.
EDIT: And your experience with LLMs seems to match mine. That reflection step, plus a lack of higher-level thinking seems to really be lacking, and that's what feels really key to writing good code. My process at the moment involves using agents to review changes, and it occasionally helps, but mostly it feels like this just isn't something LLMs are good at.
I haven't really figured out how to get AI (or humans) to do this reflection to even find where opportunities to improve exist
In the real world, at scale, all of these things suffer from a lack of attention and energy. AI is sucking both of these in every context I've seen so far. People are flooding the zone with slop. Engineers are posting AI generated documents that other engineers pretend to have read.
Similar to social media, most people need to shield themselves from it because they're not able to use it deliberately and consciously.
I've heard this argument in the Oxide episode about rigor. I didn't buy it there and here too the word "should" is doing a lot of work.
One factor is the individual and whether they're able to produce better code and maintain this attitude. We have to ignore that factor because we can never assume to have a good individual. Our assumptions have to work for the mediocre person.
The other part is whether this technology has an intrinsic drive towards one practice of software engineering over another. As of now (as could also be seen in the Amazon backpedal on LLM written code that was covered in the FT) the tendency is for more volume and less deliberation. There is nothing in the technology that promotes less volume and more deliberation.
There is nothing in the technology that promotes less volume and more deliberation.
That's entirely true, which is why it's up to us as professionals who use this software to push †he way we use it in that direction.
It is odd that a lot of people seem to assume the point of coding agents is that you just become a feature factory and never pay down technical debt. Even though we all know what that does to codebases.
I found this article about agent PR review helpful – TLDR, have multiple agents with multiple models vote on bugs, then have another one confirm and fix the bugs. You can repeat this as many times as you like. It finds a lot of bugs that I don't think I would have found even if I had authored the PR myself.
I look forward to when it's just standard practice to have agents periodically reviewing your code for architectural issues, perf problems, accessibility bugs, whatever.
Generally agree with the principles in this article but I think "don't take technical debt" is as idealistic as it was pre agents. Sometimes in service of other goals (whether business or technical), you have to take on the debt temporarily.
Where I will absolutely agree though is that you can pay down this debt so quickly with agents. I make cross codebase changes (e.g. refactor function to struct, rename all functions belonging to a certain idea) at a scale I would have psychologically balked at in the past. I think there's no excuse to not cleaning up technical debt that is bothering you anymore.
Also slightly orthogonally, read the Compounded Engineering article mentioned in the post and found it very idealistic in places especially when it says:
often actually pretty close to what you envisioned, especially if your plan was well-written.
On anything more than a prototype, I will work hard on writing and iterating on a plan, get an agent to implement it but slowly throw away 70%+ of the code as I understand the problem better and improve the solution more and more.
I think making assertions like this sets up unrealistic expectations and is actively harmful to people when people try agents and are disappointed when they don't live up to expectations.
My favorite example of this is adding TypeScript types to a JavaScript project. I sometimes prefer prototyping without types, but I like having them later when a project gets bigger. In the past I have put off the somewhat tedious step of going back and adding types everywhere. Now it's easy. LLMs are pretty good at it (given linters / etc to enforce things like "no use of the any escape hatch") as long as the types aren't too complicated, and I enjoy figuring a good representation for complex types myself anyway.
(Concretely, here's a commit which was mostly claude. My contribution was setting up strict linting and saying "go", and of course reviewing after. Could I have done this myself? Yes, trivially, but it would have taken me half an hour which I was happier spending doing something else.)
Writing bad code with AI is a choice
Only if you know what you’re doing, both with code and with the agent. The truth is that agents are a huge enabler when you barely know what you’re doing.
@simonw In the referenced Compound Engineering post, it's mentioned that you should ask the AI to retrospective and document. While it's easy to see short term payoff by having the AI write docs it can pull up later, I fear about the longer term state where documentation also needs continual maintenance (ala wiki gardening), and it's not obvious to me that a continuously expanding cloud of LLM slop documents is better than a high fidelity historical record that naturally "forgets" things as they become older. Do you have any opinion on this?
Yeah, I'm not yet convinced by the idea of having the LLMs entirely responsible for their own documentation. I worry about the quality of that slipping over time in a way that neither LLM nor human notices - plus as new models come out notes that used to work well might start to produce worse results than if those notes were not there at all.
I like to think about shipping better code in terms of technical debt. We take on technical debt as the result of trade-offs: doing things "the right way" would take too long, so we work within the time constraints we are under and cross our fingers that our project will survive long enough to pay down the debt later on. ... In my experience, a common category of technical debt fixes is changes that are simple but time-consuming.
I want to call out that the reason "the right way" is time consuming is not because of the time it takes to generate the code. It's because managing the risk involved requires building scaffolding between the old and the new way, ensuring adequate test coverage and monitoring of that scaffolding, confirming through human code review that the outcomes meet our expectations, and then dealing with multiple deploys to cross the gap.
AIs might be able to help with the scaffolding and test coverage, but again generating those things is not the time consuming part. In all the companies I've worked at, refactoring takes time because you can't yeet code into production as fast as you can write it (and with good reason).
Any software development task comes with a wealth of options for approaching the problem. Some of the most significant technical debt comes from making poor choices at the planning step - missing out on an obvious simple solution, or picking a technology that later turns out not to be exactly the right fit.
I agree that most technical debt stems from poor choices during system design, but I disagree that it's due to missing out on obvious solutions or picking the wrong technology. No solution is inherently obvious. Nor do I see any supporting evidence in the post about why those are common sources of technical debt. They seem like cherry-picked problems to advance the argument about AI support.
With both of those points called out, I'm not seeing much of an argument that AI "should" improve code quality. As another comment pointed out, "Our assumptions have to work for the mediocre person."