| ▲ | kristianp 6 hours ago | |
This has similar problems to swe bench in that models are likely trained on the same open source projects that the benchmark uses. | ||
| ▲ | yorwba 5 hours ago | parent [-] | |
If all models are trained on the benchmark data, you cannot extrapolate the benchmark scores to performance on unseen data, but the ranking of different models still tells you something. A model that solves 95/98 benchmark problems may turn out much worse than that in real life, but probably not much worse than the one that only solved 11/98 despite training on the benchmark problems. This doesn't hold if some models trained on the benchmark and some didn't, but you can fix this by deliberately fine-tuning all models for the benchmark before comparing them. For more in-depth discussion of this, see https://mlbenchmarks.org/11-evaluating-language-models.html#... | ||