"Running leading LLMs across the eleven systems shows that LLMs are great at producing correct TLA+ syntax but struggle to ensure conformance and appropriate invariants."

▲

zihotki 8 hours ago | parent | next [-]

I think that's what author implies by this sentence in the intro:

> It’s still your responsibility to understand your system and define what “correctness” means, and you need a high-level understanding of temporal logic.

▲

nyrikki 5 hours ago | parent [-]

As TLA+ uses state machines that can define infinite state spaces, checking arbitrary temporal logic formulas is undecidable in the general case.

Using a TLC model checker to verify invariants or properties, finding a valid counterexample can scale non-elementarily.

In fact, in some situations it can end up being Ackermann-complete[0], but even recursively enumerable problems can be Tower-complete.

I would say that unless you focus on more detailed temporal logic pitfalls you may have issues.

So IMHO this will stay a use case specific solution, that you choose based on context.

Even a common solution like adding Circumscription causes counterintuitive changes [1][2].

IMHO, if you want to use TLA+ as a primary method, you will need some depth or be ready to abandon it by time boxing etc…

Remember that we know the open domain frame problem in [2] is equal to HALT, it will not universally apply.

It is just another tool that works well when it works well.

[0] https://arxiv.org/abs/2104.13866

[1] https://arxiv.org/abs/2407.20822

[2] https://en.wikipedia.org/wiki/Circumscription_(logic)

	▲	baq 3 hours ago \| parent [-]
		OTOH if you’re developing a service with a frontend and a backend you are firmly in the distributed system territory even if you only have one user, so it doesn’t have to work all the time; it’s enough if it works approximately more often than not to be an improvement.

▲

baq 3 hours ago | parent | prev [-]

This blog post made it more difficult than it should to find the actual tool (needed to click twice! abysmal, really): https://github.com/specula-org/Specula

Ran it on my tricky caching changes and it found bugs the standard agentic review flows missed on all frontier models. Good stuff.