Remix clone Hacker News

new | show | ask | jobs Github

	▲	criemen 3 days ago
		Running evals aren't the problem, the problem is acquiring or building a high-quality, non-contaminated dataset. https://arxiv.org/abs/2506.12286 makes a very compelling case that swebench (and in extension, anything that's based on public source code) is most likely overestimating your agents actual capabilities.