Are you saying that the benchmarks are flawed?

And could quantization maybe partially explain the worse than expected results?

TrainedMonkey 3 hours ago | parent | next [-]

No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up. So you add specific problems to training data... The result is that the model is smarter, but the benchmarks overstate the progress. Sure there are problem sets designed to be secret, but keeping secrets is hard given the fraction of planetary resources we are dedicating to making the AI numbers go up.

I have two of my own comments to add to that. First one is that there is problem alignment at play. Specifically - the benchmarks are mostly self-contained problems with well defined solutions and specific prompt language, humans tasks are open ended with messy prompts and much steerage. Second is that it would be interesting to test older models on brand new benchmarks to see how those compare.

▲

Aurornis 3 hours ago | parent | next [-]

> No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up.

That's a much better way to say it than I did.

These models are known for being open weights but they're still products that Alibaba Cloud wants is trying to sell. They have Product Managers and PR and marketing people under pressure to get people using them.

This Venture Beat article is basically a PR piece for the models and Alibaba Cloud hosting. The pricing table is right in the article.

It's cool that they release the models for us to use, but don't think they're operating entirely altruistically. They're playing a business game just like everyone else.

	▲	2 hours ago \| parent [-]
		[deleted]

▲

amelius 2 hours ago | parent | prev [-]

There should be a way to turn the questions we ask LLMs into benchmarks.

That way, we can have a benchmark that is always up to date.

▲

Aurornis 3 hours ago | parent | prev | next [-]

The models outperform on the benchmarks relative to general tasks.

The benchmarks are public. They're guaranteed to be in the training sets by now. So the benchmarks are no longer an indicator of general performance because the specific tasks have been seen before.

> And could quantization maybe explain the worse than expected results?

You can use the models through various providers on OpenRouter cheaply without quantization.

	▲	3 hours ago \| parent [-]
		[deleted]

▲

girvo 3 hours ago | parent | prev [-]

Flawed? Possibly, but I think it's more that any kind of benchmark then becomes a target, and is inherently going to be a "lossy" signal as to the models actual ability in practice.

Quantisation doesn't help, but even running full fat versions of these models through various cloud providers, they still don't match Sonnet in actual agentic coding uses: at least in my experience.