Remix.run Logo
winton 8 hours ago

So if I try to do it with Opus three or four times, I'll get it done? And probably in about 10 minutes? Awesome

stared 8 hours ago | parent | next [-]

Nope, these are no random dice rolls. Some times are solved each run, a few - occasionally (so here would be meaningful to try a few times - and metrics of pass@1 and pass@3 would be different), but most are never solved.

See e.g.: https://quesma.com/benchmarks/otel/models/claude-opus-4.5/

throwup238 8 hours ago | parent | prev [-]

That’s only if the failures are truly random and aren’t correlated