The problem with benchmarks is that they are really useful for honest researchers, but extremely toxic if used for marketing, clout, etc. Something something, every measure that becomes a target sucks.

It's really hard to trust anything public (for obvious reasons of dataset contamination), but also some private ones (for the obvious reasons that providers do get most/all of the questions over time, and they can do sneaky things with them).

The only true tests are the ones you write yourself, never publish, and only work 100% on open models. If you want to test commercial SotA models from time to time you need to consider them "burned", and come up with more tests.

▲

rachofsunshine 2 days ago | parent | next [-]

What makes Goodhart's Law so interesting is that you transition smoothly between two entirely-different problems the more strongly people want to optimize for your metric.

One is a measurement problem, a statement about the world as it is: an engineer who can finish such-and-such many steps of this coding task in such-and-such time has such-and-such chance of getting hired. The thing you're measuring isn't running away from you or trying to hide itself, because facts aren't conscious agents with the goal of misleading you. Measurement problems are problems of statistics and optimization, and their goal is a function f: states -> predictions. Your problems are usually problems of inputs, not problems of mathematics.

But the larger you get, and the more valuable gaming your test is, the more you leave that measurement problem and find an adversarial problem. Adversarial problems are at least as difficult as your adversary is intelligent, and they can sometimes be even worse by making your adversary the invisible hand of the market. You don't live in the world of gradient descent anymore, because the landscape is no longer fixed. You now live in the world of game theory, and your goal is a function f: (state) x (time) x (adversarial capability) x (history of your function f) -> predictions.

It's that last, recursive bit that really makes adversarial problems brutal. Very simple functions can rapidly result in extremely deep chaotic dynamics once you allow even the slightest bit of recursion - even very nice functions like f(x) = 3.5x(1-x) become writhing ergodic masses of confusion.

	▲	pixl97 a day ago \| parent \| next [-]
		I would also assume Russell's paradox needs added in here too. Humans can and do hold sets of conflicting information, it is my theory that conflicts have an informational/processing cost to manage. In benchmark gaming you can optimize the processing speed by removing the conflicting information but you lose real world reliability metrics.
	▲	visarga a day ago \| parent \| prev \| next [-]
		Well said, the problem with recursion is that it constructs its own context as it goes, rewrites its rules, and you cannot predict it statically, without forward execution. It's why we have the halting problem. Recursion is irreducible. A benchmark is a static dataset, it does not capture the self constructive nature of recursion.
	▲	bwfan123 a day ago \| parent \| prev [-]
		nice comment, a reason why ML approaches may struggle in trading markets where other agents are also competing with you possibly using similar algos. or self-driving which involves other agents who could be adversarial. just training on past data is not sufficient as existing edges are competed away and new edges keep arising out of nowhere.

▲

crocowhile a day ago | parent | prev | next [-]

There is also a social issue that has to do with accountability. If you claim your model is the best and then it turns out you overfitted the benchmarks and it's actually 68th, your reputation should suffer considerably for cheating. If it does not, we have a deeper problem than the benchmarks.

▲

mmcnl 2 days ago | parent | prev | next [-]

Yes, I ignore every news article about LLM benchmarks. "GPT 7.3o first to reach >50% score in X2FGT AGI benchmark" - ok thanks for the info?

▲

antupis 2 days ago | parent | prev | next [-]

Also, even if you want to be honest, at this point, probably every public or semipublic benchmark is part of CommonCrawl.

	▲	NitpickLawyer 2 days ago \| parent [-]
		True. And it's even worse than that, because each test probably gets "talked about" a lot in various places. And people come up with variants. And those variants get ingested. And then the whole thing becomes a mess. This was noticeable with the early Phi models. They were originally trained fully on synthetic data (cool experiment tbh) but the downside was that GPT3 / 4 was "distilling" benchmarks "hacks" into it. It became aparent when new benchmarks were released, after the published date, and there was one that measured "contamination" of about 20+%. Just from distillation.

▲

ACCount36 a day ago | parent | prev | next [-]

Your options for evaluating AI performance are: benchmarks or vibes.

Benchmarks are a really good option to have.

▲

klingon-3 2 days ago | parent | prev [-]

> It's really hard to trust anything public

Just feed it into an LLM, unintentionally hint at your bias, and voila, it will use research and the latest or generated metrics to prove whatever you’d like.

> The only true tests are the ones you write yourself, never publish, and only work 100% on open models.

This may be good enough, and that’s fine if it is.

But, if you do it in-house in a closet with open models, you will have your own biases.

No tests are valid if all that ever mattered was the argument and perhaps curated evidence.

All tests, private and public tests have proved flawed theories historically.

Truth has always been elusive and under siege.

People will always just believe things. Data is just foundation for pre-existing or fabricated beliefs. It’s the best rationale for faith, because in the end, faith is everything. Without it, there is nothing.