Lies, Damned Lies and Proofs: Formal Methods Are Not Slopless

▲ Lies, Damned Lies and Proofs: Formal Methods Are Not Slopless(lesswrong.com)

68 points by OgsyedIE 3 days ago | 36 comments

▲ enum 5 hours ago | parent | next [-]

The post says this in other words: in Lean, Rocq, or any other theorem prover, you get a formally-verified proof, but you do NOT get a formally verified theorem statement.

So, even if the proof is correct, you need to determine if the theorem is what you want. Making that determination requires expertise. Since you cannot "run the theorem", you cannot vibe-code your way through it. E.g., there is no equivalent of "web app seems to be working!" You have to actually understand what the theorems are saying in a deep way.

▲

SkiFire13 4 hours ago | parent | next [-]

Even then this seems much more promising to me compared to other areas. Writing theorem statements is much much easier than coming up with proofs so it's not a big deal if a human has to do that. And once that's done getting a correct proof out of an LLM/AI model can be done fully automatically (assuming you do get a proof out of it though!)

▲

enum 2 hours ago | parent [-]

I'm not sure this is true. Encoding theorems in dependent types takes a lot of expertise.

Even without the Lean technical details, a lot of math theorems just don't mean anything to most people. For example, I have no idea what the Navier-Stokes theorem is saying. So, I would not be able to tell you if a Lean encoding of the theorem is correct. (Unless of course, it is trivially broken, since as assuming False.)

▲

loglog 2 hours ago | parent [-]

There is no "Navier-Stokes theorem". There is a famous class of open problems whether Navier-Stokes equations are well-posed (have solutions that don't explode in finite time) for various initial data, but that type of question is completely irrelevant for any practical purposes. I do share the feeling though. As a non-expert, I have no idea what the existing, allegedly practically relevant, formalizations of distributed algorithms actually guarantee.

	▲	enum 2 hours ago \| parent [-]
		Thanks. As I said, I have no idea. :)

▲

sethev an hour ago | parent | prev | next [-]

Yeah, if you have a formal proof of something but you don't know exactly what then you might as well vibe code the system itself. I've been wondering if it would be possible to formally define a property (like deadlock freedom or linearizability) and then vibe-code the spec (and associated implementation).

I don't know if that's possible but it seems like that's where the value would be.

	▲	dgacmu an hour ago \| parent [-]
		It is but we're not there yet. You can use a refinement style system like Verus in which your spec is a very high level statement of system behavior, like "this system provides the properties that paxos is supposed to guarantee"(1), and then be guaranteed that the things that implement it are preserving that spec. But this is still double black diamond expert territory - it remains a challenging thing to identify ways to state and refine those properties to enable the proof. But it's getting easier by the year and the LLMs are getting more useful as part of it. https://github.com/verus-lang/verus (1) You would have to state the consistency and conditional liveness properties directly, of course, as Verus doesn't understand "what paxos does". That would look like TLA+-style statements, "at some time in the future, a command submitted now will be executed, etc."

▲

koito17 4 hours ago | parent | prev [-]

In the case of Lean, propositions are encoded as dependent types and the user typically has to encode that themselves then make use of e.g. tactics to derive a term of said type.

Writing a statement you don't understand then later getting a proof from an LLM doesn't seem all that useless to me; in my mind, it could still be useful as exploration. Worst case scenario: you encoded a tautology and the LLM gave you a trivial proof. Best case scenario: the proposition ends up being a lemma for something you want to prove.

I do think there is a kernel of truth in what you've stated: if the user does not actually understand the statement of a proposition, then the proof is not very useful to them, since they don't know what the statement's truthiness implies. As someone who used to do mathematics, I still find immense value in vibe-coding away mundane computations, similar to what Lean's `simp` tactic already does, but much more broad.

▲

enum 3 hours ago | parent [-]

The worst case is that you vibe code a theorem that reads:

False => P

Then you vibe code a proof of this theorem. Then you get excited that you’ve proven P.

Some of the X discussion that prompted the OP was quite close to this. There are screenshots on X of Lean code that doesn’t compile, with Lean being blamed.

	▲	koito17 an hour ago \| parent [-]
		Right, the risk is encoding a vacuously true statement. I consider "false implies Q" and tautologies to be vacuously true statements, since both are truths that fail to convey any meaningful information. In any case, the worst case scenario is having a vacuously true statement. I picked tautologies as an example since that and statements about the empty set are the first things that come into my mind.

▲ lmeyerov an hour ago | parent | prev | next [-]

I've had an interesting comparative experience so far using Alloy, which is a 'simpler' formal method approach that you can think of as smarter exhaustive fuzzing for your properties of interest.

We have been building GFQL, an open source graph query language that can run at the compute tier seperate from whatever existing storage system (splunk, SQL, cs , ..), making graphs easy for interactive notebooks/dashboards/analytics/pipelines/agents without needing graph DBs, even at billion scales. Part of the trick is it is a bit unusual inside too because we are making cypher-style queries run with columnar operations (purely vectorized for both CPU + GPU mode via a multi-target dataframe abstraction layer), but that is taking a few innovations in the interpreter algorithm so we worry about subtle semantic bugs. Most of our bugs can surface with small (graph, query) combos, so this sounded perfect for Alloy's small scope hypothesis!

So we are having a natural experiment for which does more bug finding:

- asking Claude code to analyze new code or recently found issues and using that for unit test amplification

- asking Claude code to use the alloy model checker to prove out our space of (query, graph) for key methods

- asking Claude code to port a large cypher conformance test suite

So far, Claude Code as a test amplifier is doing so much better than the others that it's our Dr default preference. The real gem is when we make it do a '5 whys root cause analysis' on why some bug got through and then it does test amplification around several categories of potential error.

We found the conformance suite to be 'class imbalanced' by focusing on a couple features we are adding next, so jury still out on that one. And finally... Alloy hasn't really found anything. We suspect we need to change how we use Alloy to be more reflective of the kinds of bugs we find, but 0 was disappointing.

If anyone is into this kind of thing, happy to discuss/collab!

▲ Paracompact 5 hours ago | parent | prev | next [-]

> Third, you need to decide how far “down the stack” you want to go. That is to say, the software you want to verify operates over some kind of more complex system, for instance, maybe it’s C code which gets compiled down to X86 and runs on a particular chip, or maybe it’s a controller for a nuclear reactor and part of the system is the actual physical dynamics of the reactor. Do you really want your proof to involve specifying the semantics of the C compiler and the chip, or the way that the temperature and other variables fluctuate in the reactor?

I can appreciate what he's getting at, but my utopian vision for the future is that we won't need to reinvent the wheel like this each time we want verified software! E.g. for high-consequence systems, the hard part of compiler correctness is already handled by the efforts of CompCert, and SystemVerilog assertions for the design guarantees of processors is becoming more commonplace.

	▲	markusde 4 hours ago \| parent [-]
		Yeah, but the problem is that programming languages and compilers change all the time, making it hard to maintain a formal model of them. Exceptions exist (CompCert C and WebAssembly are two good examples) but for example, the semantics of raw pointers in Rust are intentionally under-defined because the compiler writers want to keep changing it.

▲ utopiah an hour ago | parent | prev | next [-]

Reminds me of Z notation, good times... sure it's nice to have a harness, or a series of test, but they only insure you that what's it done within it, is done within it. If it's poorly done, then the result if still within it but it's not magically correct.

▲ Rochus 3 days ago | parent | prev | next [-]

Interesting article, thanks. There is indeed a "semantic gap". However, there is also a practical solution: bidirectional LLM translation. You can verify the formal specification by back-translating it to natural language with another LLM session, allowing human review at the intent level rather than requiring expertise in e.g. Event-B syntax (see https://rochuskeller.substack.com/p/why-rust-solves-a-proble...). This addresses the concern about "mis-defining concepts" without requiring the human to be a formal methods expert. The human can review intent and invariants in natural language, not proof obligations. The AI handles the mathematical tedium while the human focuses on domain correctness, which is exactly where human expertise belongs.

▲ viraptor 4 hours ago | parent | next [-]

> there is also a practical solution: bidirectional LLM translation

It's a solution only if the translation is proven correct. If not, you're in the same place as you started.

▲ ratmice 6 hours ago | parent | prev [-]

why do we invent these formal languages except to be more semantically precise than natural language? What does one gain besides familiarity by translation back into a more ambiguous language?

Mis-defining concepts can be extremely subtle, if you look at the allsome quantifier https://dwheeler.com/essays/allsome.html you'll see that these problems predate AI, and I struggle to see how natural language is going to help in cases like the "All martians" case where the confusion may be over whether martians exist or not. Something relatively implicit.

▲ Rochus 5 hours ago | parent | next [-]

We build pretty complex systems only based on "natural language" specifications. I think you are conflating specification ambiguity with verification accessibility.

> What does one gain besides familiarity by translation back into a more ambiguous language?

You gain intent verification. Formal languages are precise about implementation, but they are often opaque about intent. A formal specification can be "precisely wrong". E.g. you can write a perfectly precise Event-B spec that says "When the pedestrian button is pressed, the traffic light turns Green for cars"; the formalism is unambiguous, the logic is sound, the proof holds, but the intent is fatally flawed. Translating this back to natural language ("The system ensures that pressing the button turns the car light green") allows a human to instantly spot the error.

> All Martians are green

Modern LLMs are actually excellent at explicating these edge cases during back-translation if prompted correctly. If the formal spec allows vacuous truth, the back-translation agent can be instructed to explicitly flag existential assumptions. E.g. "For every Martian (assuming at least one exists), the color is Green", or "If there are no Martians, this rule is automatically satisfied". You are not translating back to casual speech; you are translating back to structured, explicit natural language that highlights exactly these kinds of edge cases.

▲ ratmice 4 hours ago | parent [-]

Maybe it can be done, but I struggle to believe adding in that branch for every forall quantifier (which may be plentiful in a proof) is going to help make a proof more understandable. Rather I feel like it'll just balloon the number of words necessary to explain the proof. Feels like it's going to fall on the bad side of verbosity as the sibling comment said.

▲ Rochus 3 hours ago | parent [-]

I think there is a misunderstanding about what is being back-translated.

We don't back-translate the proof steps (the thousands of intermediate logical derivations). That would indeed be verbose and useless.

We back-translate the specification: the Invariants, Guards, and Events.

For a traffic light system, we don't need the LLM to explain the 50 steps of predicate logic that prove inv3 holds. We just need it to translate inv3 itself:

    Formal: inv3: light_NS = Green ⇒ light_EW = Red

    Back-translation: 'Invariant: If the North-South light is Green, the East-West light MUST be Red.'

This isn't verbose; it's the exact concise summary of the system's safety rules. The 'verbosity' of handling edge cases (like the 'Allsome' example) only applies when the specification itself relies on subtle edge cases, in which case, being verbose is exactly what you want to prevent a hidden bug.

▲

ratmice 2 hours ago | parent [-]

Definitions are built up layer upon layer like an onion too, with each step adding it's own invariants reducing the problem space.

I just feel like the street light example is an extremely small free standing example. Most things that I feel are worth the effort of proving end up huge. Forever formal verification languages were denigrated for being overly rigid and too verbose. I feel like translations into natural language can only increase that if they are accurate.

One thing I wish is this whole discussion was less intertwined with AI. The semantic gap has existed before AI, and will be run into again without AI. People have been accidentally proving the wrong thing true or false forever and will never stop with our without AI help.

At the very least we can agree that the problem exists, and while i'm skeptical of natural language as being anything but the problem we ran away from. At least you're trying something and exploring the problem space and that can only be cheered.

	▲	Rochus 2 hours ago \| parent [-]
		My bet is that AI changes the economics of that verbosity, making it cheap to generate and check those 'huge' definitions layer by layer. The next four years will show.

▲ lindenr 5 hours ago | parent | prev | next [-]

I agree, if AI (or humans) have mistranslated a natural language statement to a formal statement, we should not rely on AI to correctly translate the formal statement back into natural language.

For many statements I expect it's not possible to retain the exact meaning of the formal-language sentence without the natural language becoming at least as complex, and if you don't retain meaning exactly then you're vulnerable to the kind of thing the article warns about.

	▲	guenthert 3 hours ago \| parent [-]
		> if AI (or humans) have mistranslated a natural language statement to a formal statement, we should not rely on AI to correctly translate the formal statement back into natural language. Perhaps we must not rely on it and find a way to make sure that it cannot fail, but I like to point out that this are two different problems and it seems to me that the current crop of so called AIs are pretty good at distilling excerpts. Perhaps that's the easier problem to solve?

▲ smarx007 5 hours ago | parent | prev [-]

> why do we invent these formal languages except to be more semantically precise than natural language

To be... more precise?

On a more serious note, cannot recommend enough "Exactly: How Precision Engineers Created the Modern World" by Winchester. While the book talks mostly about the precision in mechanical engineering, it made me appreciate _precision_ itself to a greater degree.

▲

ratmice 5 hours ago | parent [-]

Rhetorical sentence? My point is that back-translation into natural langauge is translating into a less precise form. How is that going to help? No number of additional abstraction layers are going to solve human confusion.

	▲	smarx007 5 hours ago \| parent [-]
		Oh well, that flew over my head. You are right.

▲ crvdgc 5 hours ago | parent | prev | next [-]

Some valid points, but I hope the authors had developed them more.

On the semantic gap between the original software and its representation in the ITP, program extraction like in Rocq probably deserves some discussion, where the software is written natively in the ITP and you have to prove the extraction itself sound. For example, Meta Rocq did this for Rocq.

For the how far down the stack problem, there are some efforts from https://deepspec.org/, but it's inherently a difficult problem and often gets less love than the lab environment projects.

▲ paulajohnson 5 hours ago | parent | prev | next [-]

A formal specification language is a programming language that we don't know how to compile.

If we can use AI to automatically implement a formal spec, then that formal specification language has just become a programming language.

▲

MaxBarraclough 2 hours ago | parent | next [-]

> A formal specification language is a programming language that we don't know how to compile.

Not really, on both counts.

Firstly they're not really programming languages in the usual sense, in that they don't describe the sequence of instructions that the computer must follow. Functional programming languages are considered 'declarative', but they're still explicit about the computational work to be done. A formal spec doesn't do this, it just expresses the intended constraints on the correspondence between input and output (very roughly speaking).

Secondly, regarding the we don't know how to compile it aspect: 'constraint programming' and SMT solvers essentially do this, although they're not a practical way to build most software.

▲

auggierose 4 hours ago | parent | prev [-]

It is more general than that: A programming language is a formal specification language that we know how to compile.

There are plenty of formal specifications that cannot be compiled, even not by an AI. If you use AI, how do you make sure that the AI compiler compiles correctly?

	▲	MaxBarraclough 2 hours ago \| parent [-]
		> A programming language is a formal specification language that we know how to compile Plenty of real programming languages are ambiguous in ways that surely disqualify them as formal specification languages. A trivial example in C: decrementing an unsigned int variable that holds 0. The subtraction is guaranteed to wrap around, but the value you get depends on the platform, per the C standard. > There are plenty of formal specifications that cannot be compiled, even not by an AI. If you use AI, how do you make sure that the AI compiler compiles correctly? By proving that the code satisfies the formal spec. Getting from a formal spec to a program (in an imperative programming language, say) can be broken down into several stages of 'refinement'. A snippet from [0] : > properties that are proved at the abstract level are maintained through refinement, hence are guaranteed to be satisfied also by later refinements. [0] https://www.southampton.ac.uk/~tsh2n14/publications/chapters...

▲ psuedobrain 5 hours ago | parent | prev | next [-]

Slight nitpick, but isn't Agda based on the MLTT family instead of CoC family of languages?

▲ rafaelbeirigo 4 hours ago | parent | prev [-]

For the case of Propositional Logic, ChatGPT reflects the current epistemological crisis. When asking for help on a question, it could not properly apply the Law of the Excluded Middle [1].

1. https://chatgpt.com/share/696b7f8a-9760-8006-a1b5-89ffd7c5d2...

▲

throwaway17_17 3 hours ago | parent [-]

I would love for this to turn out to be some internal constraint where the LLM can not ‘reason’ about LEM and will always go to an understanding based in constructive logic. However, I am more ready to accept that LLM aren’t actually ‘reasoning’ about anything and it’s an inherent flaw in how we talk about the algorithms as though they were actually thinking ‘minds’ instead of very fancy syntax completion machines.

	▲	AnimalMuppet 29 minutes ago \| parent [-]
		The problem is that both constructive logic and "normal" logic are part of the training data. You might be able to say "using constructive logic, prove X". But even that depends on none of the non-constructive training data "leaking" into the part of the model that it uses for answering such a query. I don't think LLMs have hard partitions like that, so you may not get a purely constructive proof even if that's what you ask for. Worse, the non-constructive part may be not obvious.