Lean proved this program was correct; then I found a bug

37 points by kirancodes

dpc_pw

This is genuinely one of the most memory-safe codebases I've analyzed.

🙄 Yeah, sure Slopus. Like you could build memories outside of your context.

Anyway...

I'm not sure what's the takeaway here is supposed to be? Someone built formally verified implementation for a relatively tiny piece of algorithmic functionality (zlib), and then fuzzing turned out to be more robust way of catching actual bugs, yet the author moves on to calling it: "positive result here is actually the remarkable one" and "overall verification resulted in a remarkably robust and rigorous codebase".

"Remarkably robust" as comparing to what exactly? Even such a trivialized and idealized case shows how formal verification fails at being applied "to the whole thing", where the "whole thing" is remarkably small.

AFAICT, it's just another case of formal verification enthusiasts being excited about it, while it doesn't really translate well to the practical software realities on the ground.

Just to be clear - I like the idea of formal verification, and when feasible it's great, and it's also great if it is becoming feasible in more places. But got to be pragmatic and realistic.

kirancodes

"Remarkably robust" as comparing to what exactly?

compared to other unverified implementations of zlib?

then fuzzing turned out to be more robust way of catching actual bugs

You seem to have misunderstood the post. The fuzzing campaign failed to produce any memory vulnerabilities in the application code, and the only place an error was surfaced was in the lean runtime itself, demonstrating that the verification was effective in reducing bugs.
- dpc_pw
  
  compared to other unverified implementations of zlib?
  
  Which one exactly? Some idealized buggy one? :D If I was to implement zlib in Rust, then fuzzed over night, I wonder how well would it hold?
  
  "Remarkably robust" sounds like either writing memory-safe zlib is really, really hard and rarely achieved, or that formally verified implementation actually not having many bugs is somehow actually surprising.
  
  demonstrating that the verification was effective in reducing bugs.
  
  It better. Decades of research, all the extra hassle. Even as a relatively pragmatically-skeptical, I have no doubt that it reduces bugs and by a lot.
  - kirancodes
    
    I guess we're in agreement here.
    
    Some idealized buggy one? :D If I was to implement zlib in Rust, then fuzzed over night, I wonder how well would it hold?
    
    Yeah, the point is that the zlib implementation was vibe-coded over the last month. If you vibe-coded a zlib in Rust last month, I would expect it to not hold up very well, and would expect to find at least one memory vulnerability if I really tried.
    
    It better. Decades of research, all the extra hassle. Even as a relatively pragmatically-skeptical, I have no doubt that it reduces bugs and by a lot.
    
    Sure, I'm in agreement, but to the wider public the guarantees of formal verification are much less accepted. My experience with talking to non-experts is that most people would not assume this outcome as pre-determined, hence why it's emphasised at the end.
    
    dpc_pw
    
    If you vibe-coded a zlib in Rust last month, I would expect it to not hold up very well, and would expect to find at least one memory vulnerability if I really tried.
    
    If it was not using unsafe, it could not have memory unsafety issues technically. Tough it could have panics. My bet would be that that vibecoded no-unsafe Rust + fuzzing would produce perfectly working implementation though. You're tempting me to try. :D . I'm not even sure how fair it is, because I bet Slopus was trained on a bunch of zlib implementations so it probably is not even be working, just reciting from memory. Fuzzing would do all the heavy lifting anyway.
    
    My experience with talking to non-experts is that most people would not assume this outcome as pre-determined
    
    People don't trust maths? My problem with formal verification was always that making a small piece of self-contained code perfectly correct is cute and even very useful sometimes (e.g. cryptography), but the reality of SWE is that codebases are vast, always changing on short notice according to business whims, often full of unreviewed 3rd party code, and in general everything is on fire all the time. So if something was e.g. tested with a fuzzer, it's good enough "proof" for me, and faster and easier to get done.
    
    I just realized you posted it's a "0-day in lean" on X. How does it work? Would code verified in lean actually execute this runtime when running in production, etc.? That would be ironic, if in the process of achieving formal verification, the code would pick up a 0-day from the runtime it had to add to get it done.
  - bakaq
    
    I had this exact same idea when I read about Claude Mythos recently. Formal verification from the start seems to be the only reasonable way to survive a world in which security vulnerabilities are found at a rate that is orders of magnitude greater than the rate in which people can fix them.
    
    sunshowers
    
    Formal verification of what, though? If you specify every single property a system must uphold, you have created the system itself (see Borges and many others). So you have to pick and choose high value properties. That requires judgment. (For example, if you've specified properties, what does formal verification get you over property-based testing? PBT has many advantages over theorem provers and model checkers, and gets you 99% of the way there in most cases.)
    
    tomsmeding
    
    Mind that lean-zip is possible because its spec (or at least, an important part of its spec) can be very concisely written by saying that decompress (compress data) == data. (The other part of the spec, not displayed in the post and I don't know if lean-zip actually proves it, is that the compressed format is compatible with traditional zlib.)
    
    Most software has an sprawling, hugely complicated spec where writing down the spec itself is more work than the software itself is worth. What you can do it verify a few properties of the system, perhaps, but then that is also much more complicated than zlib because of things like dependence on external services, interaction of multiple features, statefulness, and sheer code size.
    
    creesch
    
    For all software? There is a reason it is rarely applied outside of specific areas like aerospace and hardware design. It is very expensive (in multiple senses of the word) and I am not quite sure how you would scale that up to a much broader scale. At least not without involving LLMs, which brings with it another set of problems and questions.
    
    Also, formal verification isn't a silver bullet as it also if very difficult to do right.
    
    bakaq
    
    Yes, for all software. I know it's impossible, but it seems like it would be the only way to deal with the grim future that seems to be coming, which is why I'm feeling kinda hopeless for computer security in general in the coming years.
    
    I guess that reinforces my general feeling that we should have less, simpler, and more robust software, ala permacomputing. But we live in a world where basically everything runs in computers, we need absurdly complex computer systems to make civilization run properly, and the culture for decades that created most of these systems valued delivering products more than having robust software.
    
    marcecoll
    
    I think full mathematical verification is too heavyweight for this at the moment. I've been exploring automatic deterministic simulation testing as an alternative to full mathematical verification for my language instead and so far I think it strikes a nice balance. I always think of that Knuth quote: "Beware of bugs in the above code; I have only proved it correct, not tried it."
  - wrs
    
    I'm woefully clueless about the state of Lean, but I was surprised to learn that Lean was not used to prove its own TCB correct! Seems like sort of a built-in example of the limitations of formal verification.
    
    kirancodes
    
    Proving a theorem prover's TCB correct is a notoriously difficult problem (and impossible in the most general sense thanks to Godel) - at best you can prove that it is consistent w.r.t to itself. Mario Carneiro has an effort towards this https://arxiv.org/abs/2403.14064 but it is very much in the preliminary stages.
    
    wrs
    
    I mean, sure, in theory I see why that's true. But then there's "my runtime has a buffer overflow in the array allocator", which seems like exactly what formal verification should be verifying.