| ▲ | computerex 2 hours ago | ||||||||||||||||||||||||||||||||||
The flip side is that benchmarks are gamed even by the top labs. Benchmark performance doesn't necessarily correlate with real world performance. | |||||||||||||||||||||||||||||||||||
| ▲ | aspenmartin 2 hours ago | parent [-] | ||||||||||||||||||||||||||||||||||
Again correct but it overstates the issue. I can say labs don’t want this. This happened arguably unintentionally in Metas llama 4 release, it went horribly, heads rolled, and like several billion dollars were paid for new talent and the org that built llama 4 was destroyed. Evals come from a million places and new evals and robust perturbations of existing evals abound. They test a variety of tasks in a variety of ways. All of them individually are flawed. Taken together the aggregate signal is highly useful as you more or less marginalize over a lot of different things. Not to mention these companies have plenty of proprietary internal measurements, they build benchmarks themselves to probe their models and then also have flywheel traffic and A/B tests. You are right to call out benchmarks but to dismiss them or not take them seriously is a mistake. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||