How can the community tell if models overfit to these benchmarks?
By the composition of evals. Plus secondary metrics like parameter size, and token cost.
Not perfect, but useful.