| ▲ | moffkalast 2 days ago | ||||||||||||||||
Lm arena is so easy to game that it's ceased to be a relevant metric over a year ago. People are not usable validators beyond "yeah that looks good to me", nobody checks if the facts are correct or not. | |||||||||||||||||
| ▲ | culi 2 days ago | parent | next [-] | ||||||||||||||||
Alibaba maintains its own separate version of lm-arena where the prompts are fixed and you simply judge the outputs | |||||||||||||||||
| ▲ | jug 2 days ago | parent | prev | next [-] | ||||||||||||||||
I agree; LMArena died for me with the Llama 4 debacle. And not only the gamed scores, but seeing with shock and horror the answers people found good. It does test something though: the general "vibe" and how human/friendly and knowledgeable it _seems_ to be. | |||||||||||||||||
| ▲ | nabakin 2 days ago | parent | prev [-] | ||||||||||||||||
It's easy to game and human evaluation data has its trade-offs, but it's way easier to fake public benchmark results. I wish we had a source of high quality private benchmark results across a vast number of models like Lmarena. Having high quality human evaluation data would be a plus too. | |||||||||||||||||
| |||||||||||||||||