| ▲ | winton 8 hours ago | |
So if I try to do it with Opus three or four times, I'll get it done? And probably in about 10 minutes? Awesome | ||
| ▲ | stared 8 hours ago | parent | next [-] | |
Nope, these are no random dice rolls. Some times are solved each run, a few - occasionally (so here would be meaningful to try a few times - and metrics of pass@1 and pass@3 would be different), but most are never solved. See e.g.: https://quesma.com/benchmarks/otel/models/claude-opus-4.5/ | ||
| ▲ | throwup238 8 hours ago | parent | prev [-] | |
That’s only if the failures are truly random and aren’t correlated | ||