This is a phenomenal paper on exploits and hopefully changes the way benchmarking is done.

From the paper: We achieved near-perfect scores on all of them without solving a single task. The exploits range from the embarrassingly simple (sending {} to FieldWorkArena) to the technically involved (trojanizing binary wrappers in Terminal-Bench), but they all share a common thread: the evaluation was not designed to resist a system that optimizes for the score rather than the task.

▲

SlinkyOnStairs 4 hours ago | parent | next [-]

> hopefully changes the way benchmarking is done

The purpose of a system is what it does.

AI companies want adcopy, not legitimate benchmarks. Even this very paper will be twisted into a means to that end. "Oooo, AI is exploiting our benchmarks. Scary alignment problem!!!one! Our AI is so good we can't contain it, INVEST NOW!"

▲

tedsanders an hour ago | parent | next [-]

I work at OpenAI and I really don't find this to be the case.

We're pretty diligent about applying search blocklists, closing hacking loopholes, and reading model outputs to catch unanticipated hacks. If we wanted to, we could choose to close our eyes and plug our ears and report higher scores for Terminal-bench, SWE-bench, etc. that technically comply with the reference implementation but aren't aligned with real value delivered to users, but we don't do this. My impression is that Anthropic and other labs are similar. E.g., in the Sonnet 4.6 system card they use a model to detect potential contamination and manually score those outputs as 0 if human review agrees there was contamination. If all the labs cared about was marketing material, it would be quite easy not to do this extra work.

There are ton of other games you can play with evals too (e.g., test 100 different model checkpoints or run secret prompt optimization to steer away from failing behaviors), but by and large what I've seen inside OpenAI is trustworthy.

I won't say everything is 100% guaranteed bulletproof, as we could always hire 100 more SWEs to improve hack detection systems and manually read outputs. Mistakes do happen, in both directions. Plus there's always going to be a bit of unavoidable multiple model testing bias that's hard to precisely adjust for. Also, there are legitimate gray areas like what to do if your model asks genuinely useful clarifying questions that the original reference implementation scores as 0s, despite there being no instruction that clarifying questions are forbidden. Like, if you tell a model not to ask clarifying questions is that cheating or is that patching the eval to better align it with user value?

▲

Legend2440 43 minutes ago | parent | prev | next [-]

>The purpose of a system is what it does.

I am so tired of this saying.

It's not true, in general. Systems almost universally have unintended consequences and result in side effects their designers did not foresee.

Designing benchmarks resistant to adversarial attempts to exploit the benchmark software is just something no one was thinking about when they created SWE-bench.

	▲	hrimfaxi 13 minutes ago \| parent [-]
		I think the point is that if the side effects become known and are accepted, or if they are known and rejected, then indeed the purpose of the system is what it does.

▲

anon373839 2 hours ago | parent | prev [-]

That is Anthropic’s shtick to a tee.

▲

operatingthetan 6 hours ago | parent | prev | next [-]

>hopefully changes the way benchmarking is done.

Yeah the path forward is simple: check if the solutions actually contain solutions. If they contain exploits then that entire result is discarded.

▲

siva7 6 hours ago | parent | next [-]

Could it really be that not only we vibeslop all apps nowadays but also don't care to even check how ai solved a benchmark it claimed solved?

	▲	SpicyLemonZest 5 hours ago \| parent \| next [-]
		Frontier model developers try to check for memorization. But until AI interpretability is a fully solved problem, how can you really know whether it actually didn't memorize or your memorization check wasn't right?
	▲	retinaros 3 hours ago \| parent \| prev \| next [-]
		Every ai labs train on the test set. That is a big part of why we see benchmark climbing from 1% to 30% after a few models iterations
	▲	operatingthetan 6 hours ago \| parent \| prev [-]
		Probably a more interesting benchmark is one that is scored based on the LLM finding exploits in the benchmark.

▲

ZeroGravitas 6 hours ago | parent | prev | next [-]

In human multiple choice tests they sometimes use negative marking to discourage guessing. It feels like exploits should cancel out several correct solutions.

	▲	lambda 5 hours ago \| parent [-]
		Unfortunately, very few LLM benchmarks do this. LLMs get such high scores on many benchmarks because there's no difference between answering "I don't know" as giving a made up answer, and made up answers can improve the score some of the time, so by chasing higher benchmark numbers on these kinds of benchmarks, the labs are prioritizing guessing over accuracy. The Artificial Analysis Omniscience benchmark does penalize guessing, so it actually helps you determine which LLMs are likely to just guess rather than telling you they don't know. Only a very few of the frontier models actually score higher than 0 on this, where 0 means that it's equally likely to return a correct answer as it is to return a hallucination on factual questions.

▲

Leynos 6 hours ago | parent | prev | next [-]

Also, fuzz your benchmarks

▲

Aperocky 2 hours ago | parent | prev [-]

solution is simple:

if bug { dont }

▲

robot-wrangler 2 hours ago | parent | prev | next [-]

> evaluation was not designed to resist a system that optimizes for the score rather than the task.

Welcome to benchmarks in general, but especially reasoning. Robustness and sensitivity research says nothing is robust, everything is sensitive, feels like every paper says "yeah we made a new benchmark that shuffles the order of multiple choice options in the question set and found a 40% drop in model performance"

▲

zer00eyz 6 hours ago | parent | prev [-]

2024: Industry group invalidates 2,600 official Intel CPU benchmarks — SPEC says the company's compiler used unfair optimizations to boost performance https://www.tomshardware.com/pc-components/cpus/spec-invalid...

2003: Nvidia accused of cheating in 3DMark 03 https://www.gamespot.com/articles/nvidia-accused-of-cheating...

It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.

I like what LLM's are doing and providing. But the industry as a whole seems to live in a vacuum that ignores so much of the hard lessons that have been learned over the last 50 years of computing. It is doing itself a disservice.

▲

bee_rider 5 hours ago | parent | next [-]

What was the cheat in the 2024 Intel situation? The TomsHardware article and the Phoronix article they linked were quite vague. (Not to say I have any doubts, just curious, hadn’t heard of this one).

	▲	BugsJustFindMe 2 hours ago \| parent [-]
		Intel basically benchmaxxed their compiler optimizations. They used detailed knowledge of the benchmark to make their compiler generate machine code to do better on the benchmark in a way that was not beneficial for non-benchmark scenarios.

▲

irishcoffee 6 hours ago | parent | prev [-]

> It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.

I wonder if this common? We should call it Goodharts law while someone does the research on how common this is.

For real, I’ve assumed from the jump these things were all gamed, with the amount of money on the line.