When AI Writes the World's Software, Who Verifies It?

20 points by regulator

hwayne

This feels AI generated, in part because it makes a mistake I'd expect from an AI but not from someone as accomplished in FM as Leonardo do Moura:

Once verified components are cheap, you compose them with confidence. Today, integration is where most bugs live: the boundaries between components, the assumptions that do not quite match. Integration testing exists precisely because component tests do not compose: passing unit tests on two components tells you nothing about whether they work together. Verified interfaces are fundamentally different. When each component carries a proof against a shared specification, the composition is guaranteed correct by construction. You can verify each piece independently and know the whole is sound.

This is actually one of the most common ways formally verified code fails in production! You can have two proven-correct bits of code, that each totally satisfy their own local properties, but interact in a way to violate a global property. Pamela Zave has a bunch of examples here.

regulator

I suspect this was written by AI. It feels to me that it was generated and then modified. Good spot on the error.
amw-zero

I wonder if he's talking about some kind of shared global specification that each component references / can reference. He says "When each component carries a proof against a shared specification..."

I can't figure out what exactly he's referencing here, but the wording sounds to me like there's something else other than verifying components in isolation.

lorddimwit

“See, that’s the clever part. Come winter, the gorillas freeze to death!”

regulator

I've regularly seen people talk about formally verifying the output of LLMs, and I've seen a few people in my circles start learning Lean for this purpose, but I don't have a good perspective on this. Do any of you Lobsters have insight or experience with this?

hwayne

From vibed TLA+ specs I've seen, vibecoders are really bad at confirming that the theorems they prove are actually the theorems that correspond to the behaviors they need. They'll have a spec with fifty predicates, and one predicate will be something like IsValid = P || Q, and the theorems they check will be things like !P && !Q => !IsValid. The only people I've seen dodge this trap are people who already know how to write TLA+.
- Student
  
  Yep my first foray into getting the agent to do it basically bottomed out in “I have no idea what I’m looking at” and the agent ultimately using tla+ to tell me about the problems I’d told it about.
Student

I’ve been circling this and poking it a little bit. In short there’s no really good way to verify that traditional code meets a model or spec. You have to have the code written in a language that supports that. (Current state of the art, could change). Hence why lean and dafny look sexy right now. Also rust.

Writing proofs is hard, verifying them is computationally expensive.

nextos

I love Lean, and maybe it's just me, but I find refinement types (like Dafny) or some hybrid (like F*) better suited for large-scale formal development. I like the split between code and contracts.

informal

I know some people don’t like AI.

But code written by human contains bugs as well, there are probably less bugs if we use AI.

I think the solution for AI-generated code is stronger test process, it cannot solve all problems, but illuminates bugs a lot.