lambench is single-attempt one shot per problem.

I don't think they understand how the LLM models work. To truly benchmark a non-deterministic probabilistic model, they are going to need to run each about 45 times. LLM models are distributions and behave accordingly.

The better story is how do the models behave on the same problem after 5 samples, 15 samples, and 45 samples.

That said, using lambda calculus is a brilliant subject for benchmarking.

The models are reliably incorrect. [0]

[0] https://adamsohn.com/reliably-incorrect/

▲

yorwba 4 hours ago | parent | next [-]

Why 45 times in particular? If you want 80% power to distinguish a model at 50% from a model at 51%, you need 39,440 samples per model, or 329 samples per question per model. But that would just give you a more precise estimate of how well the model does on those 120 questions in particular. If you want a more precise estimate of how well the model might do on future questions you come up with, you'll need to test more questions, not just test the same question more times.

▲

UltraSane 4 hours ago | parent | prev [-]

Even people benefit from multiple tries over time.

	▲	chris_st 4 hours ago \| parent [-]
		Well, to be fair, people cheat by remembering what they did last time. I think the idea here is to run the models from a "clean slate" and see how often they succeed/fail. They are, like people, non-deterministic, so giving them several "fair" trials makes sense to me.