Buried about 2/3rds of the way through the article: “I’m about 50% of the way through the formalization plan…”

As soon as I saw this, I became a skeptic. Anyone who has used AI tools has seen cases where the first 80% of a project came together like a bolt of lightning, but the last 20% is next to impossible for the AI to accomplish, even if it doesn’t seem more complex than the rest of the code. AI just struggles as the code gets larger and the requests get more specific. Anyway, never trust a programmer to accurately judge the “50%” mark. Once they are at 90%, they will need another 90% to cross the finish line.

▲

danabramov 3 days ago | parent | next [-]

I actually want to push back a bit on this. I don’t have the context on what the author has done, but proof engineering is a bit different from software engineering. In particular, if you get definitions right and if your plan actually makes sense mathematically (i.e. the proof is possible using your planned breakdown), you have a much more linear sense of progress. You don’t have “last minute surprises” which are only uncovered when the program runs. You do have surprises with modeling (hence my caveats above). But there’s also a much clearer sense that if the “contours” of the proof plan are correct, filling them in is much more parallelizable and more straightforward than with programming.

I think partially this is due to the language being pure (so there’s no surprising side effects, global state, or “time”) and partially due to the “if it typechecks, you’re done” nature of proofs. Also mathematics is (unsurprisingly) composable — more than run-of-the-mill software development.

▲

wenc 3 days ago | parent | next [-]

> “if it typechecks, you’re done” nature of proofs.

This resonates with me.

Software engineering is more art than math -- you have to deal with corner cases, exceptions, human interactions. So much of code encodes culture, preferences, storytelling, etc. It's kind of a "procedural epistemology". You are codifying -- without correctness -- the behaviors that you know.

Pure Math doesn't work like that -- it is more declarative and is built on a structure of correct statements. (numerical math is a different animal -- it's closer to engineering -- but proofs are pure math). It's way more limited and restrictive and rigorous than imperative code.

It's like writing in Haskell. If it compiles in Haskell, the result might not be useful in real life, but there's a degree of consistency that languages like Python cannot enforce.

But even Haskell for practical reasons only aims at consistency rather than correctness. Proof assistants like Lean and Coq go all the way in Curry-Howard correspondence.

▲

yencabulator 9 hours ago | parent | prev | next [-]

The article contradicts this:

> There’s often a push-pull between these tasks. To give a recent example, suppose I want to prove that a function is associative (task 4). I realize that this theorem is false—oops—because of the underlying type that I've chosen (task 5). Maybe this is a problem I can avoid by decomposing other theorems differently (task 3). But in this case, I realize it’s a hard requirement, and I need to replace the current type with something else—for example, a finite map—and then refactor the whole formalization accordingly (task 2). I also need to check whether my theory is mathematically correct, or if I’ve made some conceptual mistake which renders important theorems unprovable (task 1).

> My most frustrating experiences with Lean have rarely involved proofs of individual theorems. More often, some previous decision has unexpectedly blocked me, and I’m facing a choice between several different complex refactors to fix it. This is what happened with the example above—in the end, I decided to rewrite everything with finite maps, which meant rewriting several definitions and proofs.

▲

cole-k 2 days ago | parent | prev | next [-]

I don't really see there being much of a difference between the modeling surprises and last-minute surprises only uncovered when the program runs. If the gist of what you're saying is that an experienced proof engineer should have fewer modeling surprises than an experience software engineer has last-minute code execution surprises, then I can sort of agree. But the "contours" of a proof you're describing are basically analogous to decomposing an application into individual components (modules/functions) --- and if you do this and test your components you should expect your software to not have any "last minute surprises" except for modeling failures as well.

I guess the point I mostly disagree with is the feasibility of what you describe here:

> In particular, if you get definitions right and if your plan actually makes sense mathematically (i.e. the proof is possible using your planned breakdown), you have a much more linear sense of progress.

This presupposes that you know how the proof should be written, which means that you not only have an understanding of the proof mathematically (whatever your definition of "mathematically" is), but you ALSO know how to warp that understanding into a machine proof. The author notes that the "getting the definitions right" part is slower than them writing out the definitions themselves for a variety of reasons which ultimately mean they had to carefully comb through the definitions to make sure there were no logic errors.

So I don't really disagree with your overall point (which is that the author can probably be trusted to say they are actually at 50%), but I think you oversell the assurances granted by formal verification when compared to software engineering.

	▲	danabramov 2 days ago \| parent [-]
		It’s not necessarily smooth sailing for sure. But in my (limited!) experience, there is a noticeable difference, so that’s what I wanted to note. There indeed can be a gap between “it works in my head” and convincing the machine to do it, and sometimes the gap is wide. For example, type-theoretic way of looking a things (as customary in Lean) might not match the idioms in the paper proof, or some crucial lemmas known to be true might have never been formalized. Or the tactics to prove some piece are getting too slow. Or you got the definitions wrong or they’re annoying to deal with. So that is its own set of difficulties. Even with that, the nature of difficulties feels different from much of software work to me. In software, I find that “divide and conquer” rarely works and I need to “grow” the program from a primitive version to a more sophisticated one. But in proofs, “divide and conquer” actually sort of works — you gain some assurances from making a high-level proof with holes in the middle, then filling in those holes with more intermediate steps, and so on. You can actually work “top down” to a meaningful extent more than in software. To put it another way, with proofs, you can work top-down and bottom-up at the same time, making real progress while a bunch of steps are stubbed out. In programming, “stubbing out” non-trivial parts of the program is often annoying or impossible, and you can only make progress on a smaller (but complete) piece of it, or a less ambitious (but still complete) piece of it. This is because you can’t run an incomplete program, and there’s a set of knowledge you only gain from running it. But you absolutely can check an incomplete proof, with arbitrary granularity of incompleteness. So I agree with you it’s not necessarily linear. But it’s also not exactly like software, and analogies from software work don’t apply 1:1.

▲

ndriscoll 2 days ago | parent | prev | next [-]

Your comment basically perfectly describes why Scala is a wonderfully pleasant language to work with. I could review code and know that if it had the right high level steps and if it compiled, it was probably right. When writing code you figure out the high level steps and then write whatever you need to make it type check. If you don't go out of your way to do something perverse, it's probably correct.

▲

t_mann 2 days ago | parent | prev | next [-]

> You don’t have “last minute surprises”

Not sure about that. One of the steps in your plan may be surprisingly hard to prove. You're going to solve the easier steps in your plan first and then may get bogged down at 98%.

▲

danabramov 2 days ago | parent [-]

Sure, which is why the words you copied are followed by these in my comment:

>… which are only uncovered when the program runs. You do have surprises with modeling (hence my caveats above)

If you already have a paper proof (which is the case in the original post), and you know it’s correct, then you already know all steps are provable — but if they aren’t translated correctly into the theorem prover’s model, or if some details don’t translate well, or if you’re missing some existing result that hasn’t been formalized, you can get stuck. AI doesn’t make that worse though.

	▲	t_mann 2 days ago \| parent [-]
		> If you already have a paper proof (which is the case in the original post), and you know it’s correct You may think it's correct until you realize that your argument transitively depends on some obscure results from the 70s that nobody had ever questioned but that turned out to be flawed. Probably salvageable, but now you suddenly need to prove something completely different that you might not even understand very well. There's an interview with Kevin Buzzard somewhere where he talks about how that happened to them (couldn't find a reference rn).

▲

theendisney 2 days ago | parent | prev [-]

To formalize: After a few days 90% works, after a few more days 90% of the remaining 10% works, after a few more days.... long story short, it never works.

▲

williamstein 3 days ago | parent | prev | next [-]

AI is very fast at generating working self contained complicated code, which makes it too easy to establish a difficult to finish architecture.

▲

3 days ago | parent | next [-]

[deleted]

▲

WalterSear 3 days ago | parent | prev [-]

If you don't specify otherwise, yes.

With respect, if vibe coding isn't giving you the results you want, the problem is no longer the AI.

	▲	idontpost 2 days ago \| parent [-]
		[dead]

▲

aunty_helen 3 days ago | parent | prev | next [-]

Seeing this in consulting/ contract dev too. People come to me with “80% done” but in reality it’s the 80/120 rule, the last 20% takes 120% of the time the project should have taken.

	▲	bjackman 3 days ago \| parent [-]
		Ah the classic: "That will be about 40 hours work". "Well I've done 80% of the work already, so-" "Oh, then that will be about 80 hours of work".

▲

lkey 3 days ago | parent | prev | next [-]

Christ, you are so correct.

> I’m about 50% of the way through the formalization plan > I think formalizing using AI was way slower than if I’d done it by hand.

combined with their aversion to learning something hard alone:

> I found learning Lean annoying and difficult. So I started looking for shortcuts.

The author failed utterly according to their own definition of success. Moreover, how will they know they have completed the task if they don't want to fully understand what they've 'written'?

The paper presents a false trichotomy:

(1) go to grad school and learn to do it yourself (2) hope one of the few theorem proving experts got interested in your problem (3) give up. and the magical fourth: (4) Look Just Try Claude Code (at just $100 per month!)

"got interested" is so utterly passive, I've got a fifth option. (5) Meet other people who share your interests on discord and forums and chat about the problems you are trying to solve. Make them interested! Nerd snipe them! Make friends, meet colleagues and learn new things while having a genuine human experience, for free!

▲

mmaunder 3 days ago | parent | prev | next [-]

You’re absolutely right about this.

▲

tj800x 2 days ago | parent | prev | next [-]

My experience points to that until the last sorry is eliminated the math might look like it will fit together, but yet not ultimately all fit together.

▲

wenc 3 days ago | parent | prev | next [-]

> Anyone who has used AI tools has seen cases where the first 80% of a project came together like a bolt of lightning, but the last 20% is next to impossible for the AI to accomplish

Offtopic: doing the inverse works better.

If you rely on AI to write 80% of your code, you end up with code you don't understand, so the last 20% is really difficult. You can't even guide the AI correctly at that point.

I actually do the opposite. I write 80% of my code, have an LLM read it, and it helps me iterate faster on the last 20% (which are mostly tedious things like unit tests, corner cases, optimizations, etc). This works surprisingly well because now the AI has a structure to start from, and it's only doing incremental code completion. As a human, I'm also not that great at detecting corner cases -- this is where an LLM really helps because it has seen patterns I've never seen.

In fact, I use coding agents to combine multiple sources of information (not just code) like design docs, LaTeX, etc. and I ask the LLM to detect if the code actually implements the math in the LaTeX correctly. Claude actually does a pretty decent job at that.

▲

catigula 3 days ago | parent | prev [-]

The problem is you're thinking of projects in the sense of 0-100.

If you think of projects in the sense of 0-20, 20-40, 40-60, 60-80, 80-100 you realize AI is here to stay and is going to change everything.

▲

lazide 3 days ago | parent | next [-]

WYSIWYG GUI editing is also ‘here to stay’.

▲

0x1ceb00da 3 days ago | parent | prev [-]

80-100 will take orders of magnitude more time (might even be impossible) if you don't understand 0-80 thoroughly.

▲

catigula 3 days ago | parent [-]

Nope.

I regularly do 80-100 chunks in a mature codebase. Literally every day at work.

I know it feels good to pretend this isn't happening but it's too dangerous to pretend.

▲

0x1ceb00da 3 days ago | parent [-]

How many LOC?

▲

catigula 3 days ago | parent [-]

I don't really know how to answer that, it's a highly fragmented legacy codebase that is responsible for many millions in revenue.

I do know that given my experiences though it's time to sound the alarms as many others have.

We're clearly in an emergency situation, it's frustrating to see this ignored due to various mechanisms some of which are huge geopolitical incentives to pump out propaganda that ai is safe and we shouldn't slam the brakes, which would cause the us to lose to China or vice versa.

Every nation state actor is trying tirelessly to attenuate panic but it's here, its really happening.

	▲	yeasku 2 days ago \| parent [-]
		Has been happening since chatgpt 1. I am tired boss.