| ▲ | cjsaltlake 5 hours ago | |
SWE-bench was created to replace olympiad coding benchmarks. I think past olympiad coding benchmarks were much worse representative of real-world coding than something like SWE-bench, which is derived from real units of labor. Further, olympiad style benchmarks are arguably easier to contaminate / memorize unless you refresh it regularly; but that goes for SWE-bench too. | ||
| ▲ | rustyhancock 4 hours ago | parent [-] | |
I was picturing one-shot performance only for the benchmark, on novel real world tasks. I.e. the score on the March Olympiad you got in April isn't relevant. Simple enough that anyone could run it with a regular subscription. Really unless we can get the providers to ditch the gameable benchmarks they won't. But industries love nothing more than a benchmark they can manipulate. | ||