Building a React App with Formally Verified State
16 points by heyyfernanda
16 points by heyyfernanda
First of all, generically, this is awesome. I am going to upvote any formal verification of a “real” app, which excites me way more than simple examples.
Specifically, I like the approach of compiling the verified “kernel” and wrapping it in a small binding layer to make it callable from the application. I’ve played around a lot with architectures for verification and this is what I end up with often.
I like the example as well because the invariant is interesting enough and shows the power of invariants: preserving the moods and harmonies is nontrivial, but defining their validity is straightforward. This in exactly where invariants are most useful, because they’re decently simple to write down, and they catch a practically infinite amount of bugs for you.
Next, let’s verify an app with network requests!!
Collaborator of @heyyfernanda here. Thanks for the kind words!
We do have a multi-collaboration kernel that checks reconciliation of actions across one server and possibly offline clients. https://github.com/metareflection/dafny-replay?tab=readme-ov-file#the-multi-collaboration-kernel
I also integrated that multi-collaboration kernel with supabase in an ad-hoc way atm, so that the kernel takes care of project model and server reconciliation, and supabase does user management, database persistence and realtime.
https://github.com/metareflection/dafny-replay/tree/main/kanban-supabase
This is interesting to me because it presents evidence that better formal methods can result in better code generation (using LLMs).
I had some questions after reading your post and glancing through the project and the replay blog post. Please forgive my inexperience with Dafny.
PreliminariesFirst, let me make sure that I understood the basic process and the framework you're using, since the rest of my questions use this assumption.
What you call the "Domain" is what updates the app's state (the "Model") based on (user) inputs ("Actions"). And what you refer to as the "Spec" is an invariant ("Inv") which basically holds at every point in time (though after an Action is applied the Model may first need to be normalized for it to hold).
This is then compiled to JS and exposed as an API which you can hook into with the UI code.
My questionsOnto my questions: from what I gather, for the Domain your job as the human was to specify three things: the Model, the Actions, and the Inv that must be true for the Model. Claude then handled everything else (which seems to mostly be the proofs). Is this correct?
If you had to specify these for a different setting, do you think you would be able to do so without Claude's help? How much longer do you think it would take you?
The errors you report seem like they all happened at the boundary where Claude integrated the actual UI with the Domain compiled from Dafny. You wrote that these weren't silent errors, but I'm curious whether this would always be true. My understanding from what I read elsewhere (I didn't check the generated JS because it was a large file) is that you could for example pass to an action randomSeeds which are not in the expected range (or an array of the wrong length); would this still fail "loudly"? Either way, do you have any ideas on how this boundary could be better enforced? For example, by lifting preconditions into assertions in the emitted JS (I understand there may be efficiency concerns here, but I'm more interested in correctness).
Finally, I have a question regarding the design of the Domain. Why is the Normalize function necessary? To me it feels like Inv should always be true of the Model and Apply should preserve this. Maybe this example was too simple to illustrate such a case where the model may be in an "invalid" state that needs to be normalized first.
First off, thank you for your thoughtful comments and questions. Your assumptions are all correct, so I will dive into answering your questions.
My job as the human was to specify everything you find in the spec (ColorWheelSpec.dfy) but I did it entirely with natural language. The three things you mentioned are in the spec: Model, Actions, and Inv. I wrote zero Dafny code. I did glance through the Dafny spec file for a sanity check. Once the spec was verified, Claude handled writing the Domain file (ColorWheelDomain.dfy - which is required when using the Replay kernel) and all of the proofs.
I personally would not be able to do any of this without Claude. A verification expert would be able to write the spec and domain (and maybe the proofs) but it would take a long time and would be entirely superfluous for something like a trivial color generator. The point is that we want proofs (and ideally the rest of the verified code to be automated) so that verification is just part of the engine.
Regarding errors: the story at the boundaries is still unclear. To use your example, in this app, if the randomSeed is bad, the Apply is a no-op, and the invariant is preserved. We could handle this in different ways to avoid a no-op, like we could specify fallbacks or add runtime errors. For now, we are focused on ensuring the contained logic written in Dafny is sound, but, yes, there are snags to be addressed at the integration point. To your point about lifting pre-conditions into assertions: we can make AppCore (the API layer) very defensive.
Regarding Normalize: This goes back to avoiding no-ops and specifying fallbacks. The Normalize function is technically not necessary, and could be fused with Apply but that could add complexity. We set Normalize to allow custom actions by the user that do violate the invariant but are in line with their intentions. For example, if they generate a coordinated palette and then they change a single color in the palette, that palette falls back to custom mode.
Thanks again. This gave us some food for thought.
Thank you for answering.
A verification expert would be able to write the spec and domain (and maybe the proofs) but it would take a long time and would be entirely superfluous for something like a trivial color generator.
Superfluous because Claude can reliably translate natural language to spec?
Regarding errors: To use your example, in this app, if the
randomSeedis bad, theApplyis a no-op, and the invariant is preserved.
Oh, I see now. It makes sense upon reflection that you'd need to check the Action arguments for validity (and like you said, the choice to no-op could also be an error, etc.) before you handle the action. I was thinking that the Action itself had ghost predicates that ensured wellformedness (e.g. randomSeed having the right number and range of args), but instead it seems that by keeping the type without any notion of wellformedness, you must explicitly handle all possible inputs (including invalid ones) in the Dafny code.
For example, if they generate a coordinated palette and then they change a single color in the palette, that palette falls back to custom mode.
But this logic could be folded into Apply, right? I asked this question because --- like with Actions --- I was wondering whether there could be a version where the model itself had the invariant Inv as a ghost parameter, which I thought might simplify the proofs (you wouldn't e.g. need to prove that the initial state satisfies the invariant separately, or prove inductively that all states satisfy the invariant --- you would require that whenever you have an instance of the model, it satisfies Inv). And in this version you couldn't have Normalize because Apply must always produce a valid model. Well, you could still have it, but it wouldn't be necessary.
I know I'm getting really into the weeds here (and again, I don't know Dafny well), but I'm mostly just interested in seeing how I might adapt this technique to similar formal verification tools I work with. In particular, I'm interested in knowing whether you explored the "invariant-tied-to-model" approach I alluded to above and, if so, why you opted for a different approach. This approach would be much more natural in my work.
Regarding having Apply require the invariant, that is indeed nicer, and we switched to that: https://github.com/metareflection/dafny-replay/commit/4cb0d71491f66224b44dfd385c6c0f8bde0a209e
And for compositionality, it might indeed be nicer to have Apply ensure the invariant too. To be continued...
Superfluous because it’s a bit extreme to invest that much time and energy into manually writing proofs for a non-mission-critical web app. If Claude couldn’t write proofs or Dafny, we wouldn’t even be considering this use case.
Also, a clarification from @namin:
I was wondering whether there could be a version where the model itself had the invariant Inv as a ghost parameter, which I thought might simplify the proofs
That might be possible with a class invariant, but it doesn't seem possible with a datatype. The standard way for datatypes is to have a separate invariant.
The weeds is a fun place to be! :) Curious to know if you adapt to your tools.
Makes sense.
That might be possible with a class invariant, but it doesn't seem possible with a datatype. The standard way for datatypes is to have a separate invariant.
Ah OK, that makes sense (my lack of Dafny experience is showing). So if you were to change Model to be an abstract class, you could then require the Inv to exist on it, but in the present framework it needs to live in the Domain.
Curious to know if you adapt to your tools.
I'll be in touch if so; I mostly was thinking about it from the perspective of curiosity.
Cool stuff and thanks for taking the time to answer my questions.
I’d be interested in an even lower level explanation - how the spec was written, where the content of the files Claude was referred to came from.
The hue slider on Adjust Palette breaks once you change a single color.
What do you mean, what breaks about it?
It no longer does anything. This is true even after randomizing the colors, which feels like it should remove all trace that a color was manually chosen.
That’s a good catch! My guess is that this is related to the Apply function defaulting to no-op when the state is unexpected. We may want to call out no-ops in the design of the system, so they are caught instead of experienced.
Fixed here: https://github.com/metareflection/dafny-replay/pull/15
The problem was that the logic for the custom case was wrong for shifting hue.
Claude was easily able to find the issue and fix it, and also add a proof that the action is now not a no-op.
I see what you mean. It looks to me like clicking an individual color sets the “adjustment mode,” and cancelling the editing doesn’t set the mode back: https://github.com/metareflection/dafny-replay/blob/98b3a31254f480add1e44b740a39e9ce1b954a6f/colorwheel/src/components/Palette.jsx#L34.
This is in the “binding” layer of the application, the thin layer that’s responsible for forwarding user input to the proper “kernel” action application. The verification approach here leaves this layer unverified, so this is just a tradeoff of the approach.
The level of effort needed to catch a bug like this greatly increases, because you basically have to verify the React code end to end. I do see the concern since we’re talking about version here, but all verification efforts have a scope that has to end somewhere.
Per my fix above, the error was entirely in the logic in the spec. Claude was able to reason about it without thinking of the react layer.
Oh nice (im browsing the codebase, this looked like a likely place to me).
Shouldn’t the adjustment mode be set back after the color is done being directly edited either way though?
Thanks for digging into this, that's some top-tier code review. The mode adjustment is dead code and can be cleaned up. Looking back at my edits, it looks like the implementation was a bit clunky initially, dispatching a SetAdjustmentMode(Linked) before each AdjustColor action. So instead it was replaced with a single AdjustPalette Action, which always uses linked behavior making that mode obsolete. This is a good case for improving the system so that the spec and the implementation always match. Software evolution is hard :)
Thanks for this! I sent a PR with a cleanup here: https://github.com/metareflection/dafny-replay/pull/16