| ▲ | rustyhancock 7 hours ago | |||||||
I think an Olympiad format is better. But the financial incentive is such that it might be near impossible to stop leaks. I.e. A panel comes up with a series of problems. Like advent of code or project Euler but more complex and constricted. Benchmark outcomes could be performance points and measure of cost, time to solution (well token count really). A couple times per year it's run. It avoids overfitting. Overtime the tasks can become more complex if needed. If they benchmax it into being able to complete full products from spec and robust implementations amazing. | ||||||||
| ▲ | cjsaltlake 5 hours ago | parent [-] | |||||||
SWE-bench was created to replace olympiad coding benchmarks. I think past olympiad coding benchmarks were much worse representative of real-world coding than something like SWE-bench, which is derived from real units of labor. Further, olympiad style benchmarks are arguably easier to contaminate / memorize unless you refresh it regularly; but that goes for SWE-bench too. | ||||||||
| ||||||||