Remix.run Logo
Wowfunhappy 4 days ago

I don’t have a good way to measure them, but I think they should be evaluated more like how we evaluate movies, or restaurants. Namely, experienced critics try them and write reviews.

olliepro 3 days ago | parent [-]

It feels like this should work, but the breadth of knowledge in these models is so vast. Everyone knows how to taste, but not everyone knows physics, biology, math, every language… poetry, etc. Enumerating the breadth of valuable human tasks is hard, so both approaches suffer from the scale of the models’ surface area.

An interesting problem since the creators of OLMO have mentioned that throughout training, they use 1/3 or their compute just doing evaluations.

Edit:

One nice thing about the “critic” approach is that the restaurant (or model provider) doesn’t have access to the benchmark to quasi-directly optimize against.