I think an Olympiad format is better. But the financial incentive is such that it might be near impossible to stop leaks.

I.e. A panel comes up with a series of problems.

Like advent of code or project Euler but more complex and constricted.

Benchmark outcomes could be performance points and measure of cost, time to solution (well token count really).

A couple times per year it's run.

It avoids overfitting.

Overtime the tasks can become more complex if needed.

If they benchmax it into being able to complete full products from spec and robust implementations amazing.

▲

cjsaltlake 5 hours ago | parent [-]

SWE-bench was created to replace olympiad coding benchmarks. I think past olympiad coding benchmarks were much worse representative of real-world coding than something like SWE-bench, which is derived from real units of labor.

Further, olympiad style benchmarks are arguably easier to contaminate / memorize unless you refresh it regularly; but that goes for SWE-bench too.

	▲	rustyhancock 4 hours ago \| parent [-]
		I was picturing one-shot performance only for the benchmark, on novel real world tasks. I.e. the score on the March Olympiad you got in April isn't relevant. Simple enough that anyone could run it with a regular subscription. Really unless we can get the providers to ditch the gameable benchmarks they won't. But industries love nothing more than a benchmark they can manipulate.