Remix.run Logo
mattas 2 days ago

Are benchmarks the right way to measure LLMs? Not because benchmarks can be gamed, but because the most useful outputs of models aren't things that can be bucketed into "right" and "wrong." Tough problem!

Sir_Twist 2 days ago | parent | next [-]

Not an expert in LLM benchmarks, but I generally I think of benchmarks as being good particularly for measuring usefulness for certain usecases. Even if measuring LLMs is not as straightforward as, say, read/write speeds when comparing different SSDs, if a certain model's responses are consistently measured as being higher quality / more useful, surely that means something, right?

olliepro 2 days ago | parent | prev [-]

Do you have a better way to measure LLMs? Measurement implies quantitative evaluation... which is the same as benchmarks.

Wowfunhappy 2 days ago | parent [-]

I don’t have a good way to measure them, but I think they should be evaluated more like how we evaluate movies, or restaurants. Namely, experienced critics try them and write reviews.

olliepro 21 hours ago | parent [-]

It feels like this should work, but the breadth of knowledge in these models is so vast. Everyone knows how to taste, but not everyone knows physics, biology, math, every language… poetry, etc. Enumerating the breadth of valuable human tasks is hard, so both approaches suffer from the scale of the models’ surface area.

An interesting problem since the creators of OLMO have mentioned that throughout training, they use 1/3 or their compute just doing evaluations.

Edit:

One nice thing about the “critic” approach is that the restaurant (or model provider) doesn’t have access to the benchmark to quasi-directly optimize against.