Remix.run Logo
mzelling 2 days ago

I'm not sure if the paper's findings are all that actionable. The paper doesn't say "here's how benchmarks are currently being gamed." It says "here's how benchmarks could in theory be gamed."

Whether benchmark results are misleading depends more on the reporting organization than on the benchmark. Integrity and competence play large roles in this. When OpenAI reports a benchmark number, I trust it more than when that same number is reported by a couple Stanford undergrads posting "we achieved SOTA on XYZ benchmark" all over Twitter.

jmye 2 days ago | parent | next [-]

I think that’s totally fair!

I guess I look at this less as an “ah ha! They’re all cheating!” and more of a “were you guys even aware of what the benchmarks represented and how they checked them?”

mzelling 2 days ago | parent [-]

That's a great way to look at it. The paper is a reality check for anyone who thinks of benchmarks as these monolithic, oracular judges of performance. It highlights the soft underbelly of benchmarking.

2 days ago | parent | prev | next [-]
[deleted]
lukev 2 days ago | parent | prev [-]

Did you read the article? There's a whole section on "this is already happening."

mzelling 2 days ago | parent [-]

Yes, I did see that section. We've known for a while that reward hacking, train/test data contamination, etc. must be taken seriously. Researchers are actively guarding against these problems. This paper explores what happens when researchers flip their stance and actively try to reward hack — how far can they push it? The answer is "very far."