Remix.run Logo
olliepro 2 days ago

Do you have a better way to measure LLMs? Measurement implies quantitative evaluation... which is the same as benchmarks.

Wowfunhappy 2 days ago | parent [-]

I don’t have a good way to measure them, but I think they should be evaluated more like how we evaluate movies, or restaurants. Namely, experienced critics try them and write reviews.

olliepro 21 hours ago | parent [-]

It feels like this should work, but the breadth of knowledge in these models is so vast. Everyone knows how to taste, but not everyone knows physics, biology, math, every language… poetry, etc. Enumerating the breadth of valuable human tasks is hard, so both approaches suffer from the scale of the models’ surface area.

An interesting problem since the creators of OLMO have mentioned that throughout training, they use 1/3 or their compute just doing evaluations.

Edit:

One nice thing about the “critic” approach is that the restaurant (or model provider) doesn’t have access to the benchmark to quasi-directly optimize against.