Remix clone Hacker News

new | show | ask | jobs Github

	▲	kristianp 6 hours ago
		This has similar problems to swe bench in that models are likely trained on the same open source projects that the benchmark uses. https://blog.brokk.ai/introducing-the-brokk-power-ranking/
	▲	yorwba 5 hours ago \| parent [-]
		If all models are trained on the benchmark data, you cannot extrapolate the benchmark scores to performance on unseen data, but the ranking of different models still tells you something. A model that solves 95/98 benchmark problems may turn out much worse than that in real life, but probably not much worse than the one that only solved 11/98 despite training on the benchmark problems. This doesn't hold if some models trained on the benchmark and some didn't, but you can fix this by deliberately fine-tuning all models for the benchmark before comparing them. For more in-depth discussion of this, see https://mlbenchmarks.org/11-evaluating-language-models.html#...