Remix.run Logo
gdiamos 4 days ago

How can the community tell if models overfit to these benchmarks?

kovezd 4 days ago | parent [-]

By the composition of evals. Plus secondary metrics like parameter size, and token cost.

Not perfect, but useful.