▲ | yorwba 5 days ago | |
The "Verified" part of "SWE-Bench Verified" means that there was plain "SWE-Bench" before it, which had actually not been verified at all and included a lot of tasks that didn't really make sense for use as a benchmark: https://openai.com/index/introducing-swe-bench-verified/#ada... Data contamination stemming from the fact that it's based on already-solved problems in public repositories is a different issue that cannot be addressed by verifying the benchmark questions harder, but only by putting stricter limits on the model under test. | ||
▲ | kronks 4 days ago | parent [-] | |
[dead] |