I wonder suppose you are not relying on any tricks or funky shenanigans. is it possible to just throw random stuff at Lean and find interesting observations based if it approves? like use an automated system or an llm that tries all types of wild proofs/theories? and sees if it works? maybe im asking wrong questions though or not stating it well.. this is above my understanding, i could barely get my head around prolog.

▲

Paracompact 10 days ago | parent | next [-]

As someone who does certified programming professionally, I believe generative AI and formal methods are a match made in heaven. I might even go so far as to wager that the notion that human programmers will or will not be replaced by LLM-based AI is entirely dependent on whether these AI can get good at certified programming + compositional reasoning.

> is it possible to just throw random stuff at Lean and find interesting observations based if it approves?

Traditional AI has an easy time with checkers, because the search space is small. Chess is a little harder. Go still cannot be reasonably tackled by non-machine learning AI. Meanwhile, the search space (number of available moves together with the diversity of explorable states) for a sophisticated formal language becomes unfathomably large.

When the problem is known to be of a certain nature, often you can time-efficiently brute force it via SMT solvers. Traditionally SMT solvers and proof assistants have been separate branches of formal methods, but they're finally learning to play off each other's strengths (cf. Sledgehammer, Lean-SMT).

> like use an automated system or an llm that tries all types of wild proofs/theories? and sees if it works?

Research in this vein needs to be made more mainstream. I'm excited, though, that there have been big funders behind these ideas for years now, even before LLMs became big. Cf. "Learning to Find Proofs and Theorems by Learning to Refine Search Strategies" for earlier work, or DeepSeek-Prover for more recent attempts. I'm no machine learning expert, but it seems it's still a very open question how to best train these things and what their future potential is.

All in all mainstream LLMs are still rather mediocre at languages like Rocq and Lean. And when they're wrong, their proof scripts are extremely tedious to try to troubleshoot and correct. But I have hope AI tooling in formal methods will mature greatly over time.

▲

benreesman 10 days ago | parent [-]

I'm already doing almost all my LLM-assisted programming in Haskell, all the devops in check-heavy Nix, and starting to move everything into Dhall.

There's no way that Python written by LLMs will survive contact with Haskell written by LLMs in a standup competition.

One imagines the scope for this sort of thing goes very high in formality.

▲

yncytt 9 days ago | parent | next [-]

> There's no way that Python written by LLMs will survive contact with Haskell written by LLMs in a standup competition.

That's likely true for formal correctness.

But what about the disparity between python's large and mature ecosystem compared the skimpiness of Haskell's libraries? How is an LLM going to compensate for Haskell's lack of math, AI/ML, graphics, web, and other key infrastructures?

An LLM writing Haskell must produce new code where production-grade libraries don't exist, and that Haskell app will be bigger than python's. That Haskell app will be relatively unverified compared to an LLM-written python app, simply because all that missing library code is brand new and hasn't proven itself the real world.

What about the disparity in LLM training data across many application domains? Given little Haskell training data, an LLM will be more likely to hallucinate and will require more human iterations before producing a production-ready app.

> One imagines the scope for this sort of thing goes very high in formality.

Haskell's type system is great, but only as far as it goes.

	▲	benreesman 9 days ago \| parent [-]
		I've found Hackage to just completely shit on PyPI in terms of: - there's a good library - its relatively clear which 2 or 3 represent standard choices - its going to be well-tested including performance The Haskell library ecosystem is a lot like the Python one with the bottom 80-90℅ lowest quality stuff sliced off. There's one of two things and they work. I was needing an HTTP2 client the other day, which a good one? Long pull in any language. Haskell has it of course. You're right on data science stuff but that's everythinf: Julia and R get killed on libraries in data science. Lattner is working on that problem and he's not nailed it yet, its a "hard for Chris Lattner" problem. One thing that might be throwing you off (it misled me at first) is that when a Haskell library has had very little work for a few years? That usually means its done, it changes when the world of foobar changes and only when the world of foobar changes. There are exceptions to this, I wish effects and optics libraries had more consensus, but mostly there's a right one and it works great.

▲

Paracompact 10 days ago | parent | prev | next [-]

Yeah, it makes me sad that the skill of LLMs at a language is directly proportional to the popularity of a language, which is itself inversely proportional to the formal guardrails placed on the language by a type system.

▲

fragmede 10 days ago | parent | prev [-]

Serious question: Why not have it generate assembly?

	▲	benreesman 9 days ago \| parent [-]
		Because that is a movement in the opposite direction from the one yielding the improved results. Putting an LLM on formal system guardrails is the best of both worlds: high tempature stochastic generation gated by theory-laden checks. Having a high temperature model operate with no safety net more than a segfault sounds like the worst of both worlds.

▲

danabramov 10 days ago | parent | prev [-]

This is an active area of research and experimentation!

Much of Lean community are on Zulip (which is kind of like a forum?) and you can see many relevant threads here: https://leanprover.zulipchat.com/#narrow/channel/219941-Mach...