▲ | criemen 3 days ago | |
Running evals aren't the problem, the problem is acquiring or building a high-quality, non-contaminated dataset. https://arxiv.org/abs/2506.12286 makes a very compelling case that swebench (and in extension, anything that's based on public source code) is most likely overestimating your agents actual capabilities. |