Remix.run Logo
aspenmartin an hour ago

Yea but it’s not right? You or I or the myriad of other institutions inside and outside of academia can probe these models with an evolving landscape of evaluation sets, even those unavailable to the developers. It’s just ignorance to claim benchmarks are somehow useless or all being gamed. You choose your tools in the way you want, but just don’t call it somehow better than a myriad of more carefully constructed setups and scaled evaluations.