From Intent to Proof: Dafny Verification for Web Apps
17 points by nextos
17 points by nextos
I’m really happy that other people who know what they’re doing (or appear to) are thinking along these lines.
Even to the extent that we can LLMs to transform natural language requirements into specifications, as a user I think I need to understand a bunch of stuff about the formal methods in play. Which I don’t. I suspect that reviewing changes to specifications is going to be faster than reviewing code changes if you know what you’re looking at with the specs, and with libraries of specs things might get very productive.
Thanks for the comment. Not a lot of people are familiar with formal methods. The cool thing about this is that you hang out in natural language spec-land and the LLM structures your requirements as formal specs. If you want to review the formal spec, it's actually not too hard to understand because it's separate from the proofs. Eventually, for some use cases, we'd like to try to hide the dafny stuff entirely and only surface specs and UI code.
Thanks for posting this! I am one of the authors of this project. Happy to answer any questions!
It's very interesting work @namin & @heyyfernanda, this is why I posted your article. I was toying with a similar but much less polished prototype last year, hence this really caught my attention.
I come from the tiny formal methods community in the EU and I also think that, just as stated in your conclusion, there might be a paradigm shift coming. Formal methods have not gained wide adoption, except in some critical domains, due to cost. LLMs could change this and, simultaneously, benefit from some correctness guarantees.
I am particularly interested in hearing your thoughts about the specification loop depicted in your first figure. How do you plan to evolve this loop as you scale to larger applications? I think this particular area could become of major importance in future development workflows.
Probably, there are lots of lessons to be borrowed from the formal development methods that were created in the 1990s, when this area was quite active. VDM or RAISE, just to name two of them, put a lot of thought on this problem, i.e. how to iteratively refine an abstract specification while preserving correctness guarantees.
Thanks for the kind words!
In terms of loops, at the moment, there are really multiple loops going on: between intent and spec, between implementation and verification, and these two feeding back on each other. I think the hard part of the specification loop is really capturing the intent without inadvertently introducing spec bugs that have bad consequences downstream. The specs are often concise. But I’ve been down rabbit holes where the logic in the spec was wrong, and I ended up proving something useless that wasn’t working well when tested interactively. On the other hand, the specs are much more inspectable than full-blown implementation code, and this is their value. I think there could be a lot of ways to strengthen the specification loop: for example, there could be intent debugging tools that show the consequences of a spec – e.g., did you mean this to be true of your system, etc.?
True, there are lots of interesting problems ahead in terms of verifying well-formedness, satisfiability, and refinement of specifications. Once specifications become a bit complex, at least in my experience, it is easy to make subtle mistakes that effectively turn the whole thing into e.g. something unsatisfiable.
Ah, indeed -- unsatisfiability lurking is a good point. That hasn't been an issue yet for the scales of apps and correctness properties we've tried in this proof of concept so far.
This is great! I have been thinking about this a lot, lately. It really feels like the kernel idea (I’ve been thinking of this like integrated circuits, which are tested and proven, etc… but I’ve never been able to argue how this is different from libm.so …), is the right model, and then LLMs plug well known kernels together, testing the boundaries with integration tests provided by the developer.
I do fear that formal methods are too abstract for the glue (but, the “kernel” makes a ton of sense), and we need something more like gerhkin / cucumber as the high level verification compiler… which is unfortunately not a strong guarantee.
But because we can’t generate the kernels from specifications, generally, have you considered contracts + property based testing + 100% coverage as an alternative?
The Kernel’s interface must be “proven”, and this alternative is a bit of a compromise, but probably generally doable today with LLMs?
Thanks :) Yes, property-based testing makes a lot of sense and is usually a faster way to get some assurance, which can then at some point (maybe later in development?) be complemented by verification (Dafny also uses contracts). At the end, I’d still prefer verification to make sure all the cases are covered. I am also betting that verification can guide development of the specs and implementation by providing rationale for issues and repairs.
If the road to verification, is paved with contracts, you might as well leverage them with practical tools (fuzzing, PBT) until formal verification can more practically be applied to generate target code (which will be tied to the Dafny runtime, if I understand correctly?). Can you generate Dafny + the target language and sample the model states from Dafny’s verifier to ensure no contracts are violated in the target? (Not exactly sure how Dafny’s verifier works)
Dafny is now higher on my list of things to explore—any knowledge I have right now is from skimming through the manual.
I know incremental progress like this is typically not the way research happens—figure out, instead, if the ideal state is possible (which you’ve done!) and work backwards from there. But the practicality of a weaker mathematically, but foundationally improvable with incremental advances, seems like a very strong position to be in with Lemon (LLM) based development.
Yes, I agree that contracts can be leveraged with both testing and verification tools, and even in our setup we'd want to quickly find counterexamples as much as we'd want to slowly find proofs.
Yes, currently, the setup ties the target to a runtime of Dafny data structures.
Fair. I do think there’s a practical future for verification, especially since, even now, tools like Claude Code can handle many of the proofs.
Fair. I do think there’s a practical future for verification, especially since, even now, tools like Claude Code can handle many of the proofs.
Have you tried to use Claude to generate code from a Dafny spec instead of using Dafny’s translation? It might be interesting to see how close it can get to the verified translation. Of course, it wouldn’t be rooted in Dafny’s verification so it’d lack the definitive verification you seek for kernels. But the exercise might reveal some new thought / consideration / pathway.
I've also been experimented with using LLMs backed by very strong assurances. In my case by simulation testing. Ive built a toy system that is fully simulable (a more adhoc antithesis kind of thing) and have been modifying it with prompting adking the LLM to run a lot of simulations as verification. I set a series of properties, create a new spec and let the LLM do its job on that foundation. the results are fairly nice. Having the Llm work over a system that is built around simulation testing is also quite nice dince there is a clear separation of side effects and logic already.
Cool. Coverage with testing is certainly better than no testing. I've run into two pitfalls with testing in an LLM setting: 1) if the LLM is being tasked with doing the testing, it will often modify the test to pass the code rather than modify the code to fit the test. This is of course where we also have to be careful with independent spec modification (ie, changing the spec to complete the proof). 2) Test coverage can never be exhaustive. You can catch many bugs, but the point of proof is that the guarantee is 100% - full trust.
That's why I went with deterministic simulation testing, it's kind of a whole-system property based testing where you test properties of your whole system as it interacts. I added a way to add temporal properties (similar to TLA+). There's no way to cheat that, or at least it is a lot easier to catch in code review. I can write the properties myself and let it sort itself out. It is heavyweight to setup but fairly easy to extend afterwards.
That’s neat. So is it an exhaustive technique (although finite like model checking)? Or does the simulation coverage give high enough confidence that it suffices for your domain?
It's not exhaustive, since the whole state is a too large (several processes interacting with each other, network in the middle, etc etc), it is more similar to fuzzing and property-based testing, where with randomization and some way to find "interesting" states and explore them and by throwing CPU time at it you can be reasonably certain that most issues are gone. The idea is to speed up the amount of states the system is in and to have a way to observe the state holistically and check properties.
Here are some links around this idea:
In my case, to reduce the amount of work dedicated at simulating what I did was:
With this I can spawn arbitrary Rust async processes and write some temporal properties and it tests for a long time with many different orderings to check if those properties hold. If they don't it gives me a report of the order of operations needed to get to that point.