Trustworthy vibe coding. Much better than the other kind!

Not sure I really understand the comparisons though. They emphasize the cost savings relative to Haiku, but Haiku kinda sucks at this task, and Leanstral is worse? If you're optimizing for correctness, why would "yeah it sucks but it's 10 times cheaper" be relevant? Or am I misunderstanding something?

On the promising side, Opus doesn't look great at this benchmark either — maybe we can get better than Opus results by scaling this up. I guess that's the takeaway here.

▲

flowerbreeze 4 hours ago | parent | next [-]

They haven't made the chart very clear, but it seems it has configurable passes and at 2 passes it's better than Haiku and Sonnet and at 16 passes starts closing in on Opus although it's not quite there, while consistently being less expensive than Sonnet.

	▲	ainch an hour ago \| parent \| next [-]
		pass@k means that you run the model k times and give it a pass if any of the answers is correct. I guess Lean is one of the few use cases where pass@k actually makes sense, since you can automatically validate correctness.
	▲	andai 4 hours ago \| parent \| prev [-]
		Oh my bad. I'm not sure how that works in practice. Do you just keep running it until the tests pass? I guess with formal verification you can run it as many times as you need, right?

▲

DrewADesign 4 hours ago | parent | prev [-]

It’s really not hard — just explicitly ask for trustworthy outputs only in your prompt, and Bob’s your uncle.

▲

miacycle 3 hours ago | parent [-]

Assuming that what you're dealing with is assertable. I guess what I mean to say is that in some situations is difficult to articulate what is correct and what isn't depending in some situations is difficult to articulate what is correct and what isn't depending upon the situation in which the software executes.

	▲	DrewADesign 2 hours ago \| parent [-]
		And Bob’s your uncle.