▲ | NitpickLawyer 2 days ago | |||||||||||||||||||
The problem with benchmarks is that they are really useful for honest researchers, but extremely toxic if used for marketing, clout, etc. Something something, every measure that becomes a target sucks. It's really hard to trust anything public (for obvious reasons of dataset contamination), but also some private ones (for the obvious reasons that providers do get most/all of the questions over time, and they can do sneaky things with them). The only true tests are the ones you write yourself, never publish, and only work 100% on open models. If you want to test commercial SotA models from time to time you need to consider them "burned", and come up with more tests. | ||||||||||||||||||||
▲ | rachofsunshine 2 days ago | parent | next [-] | |||||||||||||||||||
What makes Goodhart's Law so interesting is that you transition smoothly between two entirely-different problems the more strongly people want to optimize for your metric. One is a measurement problem, a statement about the world as it is: an engineer who can finish such-and-such many steps of this coding task in such-and-such time has such-and-such chance of getting hired. The thing you're measuring isn't running away from you or trying to hide itself, because facts aren't conscious agents with the goal of misleading you. Measurement problems are problems of statistics and optimization, and their goal is a function f: states -> predictions. Your problems are usually problems of inputs, not problems of mathematics. But the larger you get, and the more valuable gaming your test is, the more you leave that measurement problem and find an adversarial problem. Adversarial problems are at least as difficult as your adversary is intelligent, and they can sometimes be even worse by making your adversary the invisible hand of the market. You don't live in the world of gradient descent anymore, because the landscape is no longer fixed. You now live in the world of game theory, and your goal is a function f: (state) x (time) x (adversarial capability) x (history of your function f) -> predictions. It's that last, recursive bit that really makes adversarial problems brutal. Very simple functions can rapidly result in extremely deep chaotic dynamics once you allow even the slightest bit of recursion - even very nice functions like f(x) = 3.5x(1-x) become writhing ergodic masses of confusion. | ||||||||||||||||||||
| ||||||||||||||||||||
▲ | crocowhile a day ago | parent | prev | next [-] | |||||||||||||||||||
There is also a social issue that has to do with accountability. If you claim your model is the best and then it turns out you overfitted the benchmarks and it's actually 68th, your reputation should suffer considerably for cheating. If it does not, we have a deeper problem than the benchmarks. | ||||||||||||||||||||
▲ | mmcnl 2 days ago | parent | prev | next [-] | |||||||||||||||||||
Yes, I ignore every news article about LLM benchmarks. "GPT 7.3o first to reach >50% score in X2FGT AGI benchmark" - ok thanks for the info? | ||||||||||||||||||||
▲ | antupis 2 days ago | parent | prev | next [-] | |||||||||||||||||||
Also, even if you want to be honest, at this point, probably every public or semipublic benchmark is part of CommonCrawl. | ||||||||||||||||||||
| ||||||||||||||||||||
▲ | ACCount36 a day ago | parent | prev | next [-] | |||||||||||||||||||
Your options for evaluating AI performance are: benchmarks or vibes. Benchmarks are a really good option to have. | ||||||||||||||||||||
▲ | klingon-3 2 days ago | parent | prev [-] | |||||||||||||||||||
> It's really hard to trust anything public Just feed it into an LLM, unintentionally hint at your bias, and voila, it will use research and the latest or generated metrics to prove whatever you’d like. > The only true tests are the ones you write yourself, never publish, and only work 100% on open models. This may be good enough, and that’s fine if it is. But, if you do it in-house in a closet with open models, you will have your own biases. No tests are valid if all that ever mattered was the argument and perhaps curated evidence. All tests, private and public tests have proved flawed theories historically. Truth has always been elusive and under siege. People will always just believe things. Data is just foundation for pre-existing or fabricated beliefs. It’s the best rationale for faith, because in the end, faith is everything. Without it, there is nothing. |