The Coming Need for Formal Specification
9 points by carlana
9 points by carlana
1 There is https://github.com/tlaplus/TLAiBench and https://github.com/specula-org/Specula, so far without scalability numbers etc. Proof effort scales quadratically with code size shown by some paper related to sel4 project, so there will be fundamental constrains by this approach.
2 Personally I hold the opinion that we have no excellent, flexible graphical editors to sketch and iterate on such formal specifications. From the hardware planning side there is for example SystemC/TML, SystemC/ASM, Simulink (control), on the software side SysML and not nice to use UML. However something to make simple graphs and refine them or search for existing graph structures to copy paste approaches for hardware and software planning, specification, simulations etc does not exist.
3 The biggest drawback of TLA+ is that checking that the formal specification makes sense usually works for humans via tests (of running the system) and I am unaware how to lift or lower TLA+ for small step code semantics (to execute traces etc). In practice (for bigger systems) you want the formal specification (of the system) being validated for edge cases and see how it correlates to the formal specification to ensure that the spec makes sense.
1 Citation? All I could find was this, which I believe just showcases how poorly Isabelle/HOL fares in verifying sel4 compared to other efforts. Isabelle/HOL verification code is 20:1 compared to the original C/ASM, which is at least 2x higher than all of the other comparable formalizations. I’m a big fan of Isabelle, but I’d argue it's better for math than it is for verifying imperative code. I don't believe that we can point to sel4 as convincing evidence that proof efforts grow quadratically.
2 As far as graphical interfaces to formal specification go, I can imagine someone may eventually make a KiCad backend to enable analyzing the stability of a circuit in Lean, but I am not sure that mathlib has formalized the necessary mathematical machinery yet (ie, the Laplace transform) to do this. But perhaps you had something else in mind?
3 In Rocq/Lean/Isabelle, I can build high confidence in my specification by proving various lemmas, even if they are secondary to the central correctness theorem I am usually trying to establish. Program traces are hard, though. Sadly, while lean does support gdb/lldb, these debuggers can’t print values of lean variables yet, so the debugging UX isn’t great. But hopefully someday this will not be the case.
Re 1: Yep, "Empirical Study Towards a Leading Indicator for Cost of Formal Software Verification" by Matichuk et al. What do you suggest would be better solutions? Is there work indicating or showing that one can do better than "a consistent quadratic relationship between the size of the formal statement of a property, and the final size of its formal proof in the interactive theorem prover Isabelle" ? Unfortunately I don't understand the trade-offs between different (interactive) proof systems yet and what reusable parts for imperative code exist.
Re 2: Yes. So far we have 2.1 hard distinctions between different modeling tools including hard- and software and 2.2 I'm unaware of a standard for user-specified semantic graph annotations to attach data of different formats and compose different models. Think of it as way how to stitch different tools together and express inputs and outputs through different tools/models with eventual goal to lift/lower/map subcomponents kinda like what SystemC/TML adapters are doing. Does this make sense to you?
Re 3: Oh, wow. I was assuming those had much better UX including Validation, Recording, Scheduling (of choices?) and Reversible Computing. I was trying to express more the shortcomings of TLA+ here, which have no direct 1:1 relation to code leading to potential wrong specification, code changes not tracking the TLA+ spec etc. However, I clearly failed on that.
Re Re 1: According to my understanding relevant are Coq and Isabelle for formal verification of code and Lean is not tailored to that domain and offers no tooling for it. The best comparison I've found so far between Coq and Isabelle is "Comparison of Two Theorem Provers: Isabelle/HOL and Coq" by Yushkovskiy. To me the biggest difference between Isabelle and Coq sounds like that the implementation of Coq is less trustworthy: From "Correct and Complete Type Checking and Certified Erasure for Coq, in Coq" by Sozeau et al "Therefore, on average, one critical bug has been found every year in Coq."
In reality, AI written tests were one of the first tasks I felt comfortable delegating. Unit tests are squarely in-distribution for what the models have seen on all public open source code. There’s a lot of unit tests in open source code, and they follow predictable patterns. I’d expect that the variance of implementation code – and the requirement for out-of-distribution patterns – is much higher than testing code. The result is that models are now quite good at translating English descriptions into quite crisp test cases
I'm curious what kind of the code the author is writing and how closely they are examining the written tests.
IME, tests written by Sonnet 4.5 as well as Codex 5.1 are pretty poor in terms of testing interesting edge cases, as well as being excessively verbose.