Remix.run Logo
yorwba 5 days ago

The "Verified" part of "SWE-Bench Verified" means that there was plain "SWE-Bench" before it, which had actually not been verified at all and included a lot of tasks that didn't really make sense for use as a benchmark: https://openai.com/index/introducing-swe-bench-verified/#ada...

Data contamination stemming from the fact that it's based on already-solved problems in public repositories is a different issue that cannot be addressed by verifying the benchmark questions harder, but only by putting stricter limits on the model under test.

kronks 4 days ago | parent [-]

[dead]