Verifiability is the Limit
6 points by hwayne
6 points by hwayne
My read of this is that high-level specification is an absolutely integral piece of the puzzle if we really want to rely on a new order of magnitude of generated code. We do have a moderately similar frame of reference: the jump from writing assembly to high-level languages. When this happened, it did hinge on having a precise specification: the high level program.
The difference between then and now is that the distance between a high level language and assembly is not that large. The distance between a small informal prompt and a working full-stack application is comparatively much larger: it feels at least an order of magnitude larger. But is a prompt a specification? It doesn’t feel like one.
So it certainly feels that what’s missing to feel trust in these tools (i.e. to have verified their output) is a specification language along with automated checking of the output.
I think the key thrust of this argument is misplaced. Verifiability is not the limit: it’s quite possible that an LLM might soon be better at spotting problems and verifying solutions than a well-trained human. It might already be true, at least for some domains.
Accountability is the limit.
When you’re hired to write high-assurance code, you’re being hired not just because you’re a root of verification, but because you’re a root of accountability. When something goes wrong, you can be dragged over the coals for it. When someone dies due to code you wrote or checked, it’s you that ends up being called as a witness at trial. LLMs will never be in a position to replace a human in this regard, because they do not operate under the same umbrella of accountability that applies to a human. They cannot be fired, sanctioned, arrested, imprisoned, bankrupted, made to pay a fine, or otherwise legally constrained any more than a hammer or a wrench. Without that accountability, they will always sit in the domain of ‘tools’ and not ‘authors’, no matter how capable they become in practice.
This is such an important argument! I say things like this pretty often, but I need to step it up. Too many of us clearly don’t think in these terms, and that’s a serious social problem with computing.
This is part of the problem with “software engineering” not being professionalized. Doctors, lawyers, bankers, professors, and “real” engineers all have extensive social systems of regulation. It’s probably still premature for our field, and I don’t want to claim it’s inevitable, but I do think we need to make some progress down that road.
I think the thing that most convinced me to shift my thinking on this was the Horizon Post Office scandal here in the UK. It suddenly became extremely clear to a lot of people that innocuous bugs in software have the capacity to ruin lives.
Traceability and accountability are so important and will only become more important as we ditch paper backups and start placing more and more faith in automated systems. LLMs feel like a massive step backward in that regard: they are explicitly intended as obfuscation machines, distributing accountability across the work of millions of developers that didn’t even realise that they were contributing to critical infrastructure by virtue of their code appearing on GitHub. Their lack of traceability is a feature, not a bug - at least to those peddling them.
I now, although tentatively, believe that LLMs can indeed succeed in all domains with perfect oracles.
I’m no fan of this technology, and eschew it for ethical reasons as well as (broadly) what you might call “health and safety”. I have quibbles with the quality of the generated code I’ve seen, but I don’t have any good reason to think it won’t continue to improve, at least in domains with large amounts of high-quality training data. But I think I’ve believed this proposition for a couple of years now. My experience with software formal verification (and more broadly with assurance methodologies) makes me think that perfect oracles are exceedingly rare. Like, if you think you have one, you’re probably fooling yourself.
I’m curious about progress in formally verified mathematics, which is still far from mainstream mathematical practice. That’s about as close as I know of to a domain with perfect oracles. Have LLMs proved any new theorems? Have they improved on existing proofs? I’ve fallen out of touch with that world, but I think I would have noticed if there were any big results along those lines.
For a summary of LLM progress in formalised mathematics, this is a good read: https://xenaproject.wordpress.com/2025/03/16/think-of-a-number-an-update/
Quick search yields… nothing, but maybe I’m using the wrong terms. But somewhat tangential to the LLM question, here’s Stephen Wolfram (warning! caution advised!), recently: https://writings.stephenwolfram.com/2025/01/who-can-understand-the-proof-a-window-on-formalized-mathematics/
I like this article, and I think it adds a useful angle, but I just don’t understand the argument that verification can’t be pushed onto LLMs.
If I ask an LLM how you should test a given piece of code, it will give me suggestions, and it can implement them as unit tests (or property based tests). It can also generate questions about how an informal specification would work, the way a human would.
Now, I don’t trust those tests–they’re not up to my standards yet. But the code that the LLM generates also isn’t up to my standards (in my experience–I’m not using the absolute best models, so I may be behind). What I don’t understand is, if LLMs continue to improve, they will presumably get better at both code and verification.
I suppose I share the hunch that verification is harder, but I don’t think this article gave a fundamental reason why verification must lag indefinitely far behind code generation.