| ▲ | amelius 3 hours ago | ||||||||||||||||||||||
Are you saying that the benchmarks are flawed? And could quantization maybe partially explain the worse than expected results? | |||||||||||||||||||||||
| ▲ | TrainedMonkey 3 hours ago | parent | next [-] | ||||||||||||||||||||||
No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up. So you add specific problems to training data... The result is that the model is smarter, but the benchmarks overstate the progress. Sure there are problem sets designed to be secret, but keeping secrets is hard given the fraction of planetary resources we are dedicating to making the AI numbers go up. I have two of my own comments to add to that. First one is that there is problem alignment at play. Specifically - the benchmarks are mostly self-contained problems with well defined solutions and specific prompt language, humans tasks are open ended with messy prompts and much steerage. Second is that it would be interesting to test older models on brand new benchmarks to see how those compare. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | Aurornis 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
The models outperform on the benchmarks relative to general tasks. The benchmarks are public. They're guaranteed to be in the training sets by now. So the benchmarks are no longer an indicator of general performance because the specific tasks have been seen before. > And could quantization maybe explain the worse than expected results? You can use the models through various providers on OpenRouter cheaply without quantization. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | girvo 3 hours ago | parent | prev [-] | ||||||||||||||||||||||
Flawed? Possibly, but I think it's more that any kind of benchmark then becomes a target, and is inherently going to be a "lossy" signal as to the models actual ability in practice. Quantisation doesn't help, but even running full fat versions of these models through various cloud providers, they still don't match Sonnet in actual agentic coding uses: at least in my experience. | |||||||||||||||||||||||