No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up. So you add specific problems to training data... The result is that the model is smarter, but the benchmarks overstate the progress. Sure there are problem sets designed to be secret, but keeping secrets is hard given the fraction of planetary resources we are dedicating to making the AI numbers go up.

I have two of my own comments to add to that. First one is that there is problem alignment at play. Specifically - the benchmarks are mostly self-contained problems with well defined solutions and specific prompt language, humans tasks are open ended with messy prompts and much steerage. Second is that it would be interesting to test older models on brand new benchmarks to see how those compare.

▲

Aurornis 3 hours ago | parent | next [-]

> No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up.

That's a much better way to say it than I did.

These models are known for being open weights but they're still products that Alibaba Cloud wants is trying to sell. They have Product Managers and PR and marketing people under pressure to get people using them.

This Venture Beat article is basically a PR piece for the models and Alibaba Cloud hosting. The pricing table is right in the article.

It's cool that they release the models for us to use, but don't think they're operating entirely altruistically. They're playing a business game just like everyone else.

	▲	2 hours ago \| parent [-]
		[deleted]

▲

amelius 2 hours ago | parent | prev [-]

There should be a way to turn the questions we ask LLMs into benchmarks.

That way, we can have a benchmark that is always up to date.