Remix.run Logo
fooker 4 hours ago

Yeah, these benchmarks are bogus.

Every new model overfits to the latest overhyped benchmark.

Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.

bunderbunder 2 hours ago | parent | next [-]

All shared machine learning benchmarks are a little bit bogus, for a really “machine learning 101” reason: your test set only yields an unbiased performance metric if you agree to only use it once. But that just isn’t a realistic way to use a shared benchmark. Using them repeatedly is kind of the whole point.

But even an imperfect yardstick is better than no yardstick at all. You’ve just got to remember to maintain a healthy level of skepticism is all.

abustamam 2 hours ago | parent [-]

Is an imperfect yardstick better than no yardstick? It reminds me of documentation — the only thing worse than no documentation is wrong documentation.

mrandish 3 hours ago | parent | prev | next [-]

> Yeah, these benchmarks are bogus.

It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .

fooker 3 hours ago | parent [-]

For the current state of AI, the harness is unfortunately part of the secret sauce.

scoring1774 2 hours ago | parent | prev [-]

This has been done: https://arxiv.org/abs/2510.04871v1