| ▲ | cjsaltlake 5 hours ago | |
I suggest reading the Mythos report's discussion on SWE-bench and contamination. I think it's fairly convincing that you can account for contamination and still trust SWE-bench numbers on models that aren't over-optimized for it. | ||
| ▲ | kator 2 hours ago | parent | next [-] | |
> models that aren't over-optimized for it. But how do you know the model was over-optimized for it or just really good? | ||
| ▲ | kmdupree an hour ago | parent | prev [-] | |
i disagree: https://www.philosophicalhacker.com/post/anthropic-error/ | ||