Public benchmarks can be trivially faked. Lmarena is a bit harder to fake and is human-evaluated.

I agree it's misleading for them to hyper-focus on one metric, but public benchmarks are far from the only thing that matters. I place more weight on Lmarena scores and private benchmarks.

▲

nl a day ago | parent | next [-]

Concentrating on LMAreana cost Meta many hundreds of billions of dollar and lots of people their jobs with the Lllama4 disaster.

▲

moffkalast 2 days ago | parent | prev [-]

Lm arena is so easy to game that it's ceased to be a relevant metric over a year ago. People are not usable validators beyond "yeah that looks good to me", nobody checks if the facts are correct or not.

▲

culi 2 days ago | parent | next [-]

Alibaba maintains its own separate version of lm-arena where the prompts are fixed and you simply judge the outputs

https://aiarena.alibaba-inc.com/corpora/arena/leaderboard

▲

jug 2 days ago | parent | prev | next [-]

I agree; LMArena died for me with the Llama 4 debacle. And not only the gamed scores, but seeing with shock and horror the answers people found good. It does test something though: the general "vibe" and how human/friendly and knowledgeable it _seems_ to be.

▲

nabakin 2 days ago | parent | prev [-]

It's easy to game and human evaluation data has its trade-offs, but it's way easier to fake public benchmark results. I wish we had a source of high quality private benchmark results across a vast number of models like Lmarena. Having high quality human evaluation data would be a plus too.

▲

moffkalast 2 days ago | parent [-]

Well there was this one [0] which is a black box but hasn't really been kept up to date with newer releases. Arguably we'd need lots of these since each one could be biased towards some use case or sell its test set to someone with more VC money than sense.

[0] https://oobabooga.github.io/benchmark.html

	▲	nabakin 2 days ago \| parent [-]
		I know Arc AGI 2 has a private test set and they have a good amount of results[0] but it's not a conventional benchmark. Looking around, SWE Rebench seems to have decent protection against training data leaks[1]. Kagi has one that is fully private[2]. One on HuggingFace that claims to be fully private[3]. SimpleBench[4]. HLE has a private test set apparently[5]. LiveBench[6]. Scale has some private benchmarks but not a lot of models tested[7]. vals.ai[8]. FrontierMath[9]. Terminal Bench Pro[10]. AA-Omniscience[11]. So I guess we do have some decent private benchmarks out there. [0] https://arcprize.org/leaderboard [1] https://swe-rebench.com/about [2] https://help.kagi.com/kagi/ai/llm-benchmark.html [3] https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard [4] https://simple-bench.com/ [5] https://agi.safe.ai/ [6] https://livebench.ai/ [7] https://labs.scale.com/leaderboard [8] https://www.vals.ai/about [9] https://epoch.ai/frontiermath/ [10] https://github.com/alibaba/terminal-bench-pro [11] https://artificialanalysis.ai/articles/aa-omniscience-knowle...