Remix.run Logo
criemen 3 days ago

Running evals aren't the problem, the problem is acquiring or building a high-quality, non-contaminated dataset.

https://arxiv.org/abs/2506.12286 makes a very compelling case that swebench (and in extension, anything that's based on public source code) is most likely overestimating your agents actual capabilities.