▲ | lieret 5 days ago | |
[On the SWE-bench team] As someone pointed out SWE-bench Verified is a subset of tasks that were reviewed to be solvable (i.e., have enough context in the task description) as well are scored with unit tests that aren't overly specific to rule out valid solutions. We've all read & analyzed a large number of agent trajectories. This loophole seems to be something that popped up with the more recent models and we simply weren't aware of it. As discussed in the github issue, there's a fix in the new version of the SWE-bench containers (currently being rolled out) that makes sure that the relevant commits aren't available. Part of what makes SWE-bench a very interesting benchmark is the enormous action space that agents that compete on it can take. However that also means that there's unexpected things happening when models get better. We're currently working on making all agent runs easily browsable on a website (rather than having to download our AWS buckets) to get even more eyes on the trajectories. Thanks to everyone who uncovered this loophole. |