Remix.run Logo
cjsaltlake 5 hours ago

I suggest reading the Mythos report's discussion on SWE-bench and contamination. I think it's fairly convincing that you can account for contamination and still trust SWE-bench numbers on models that aren't over-optimized for it.

kator 2 hours ago | parent | next [-]

> models that aren't over-optimized for it.

But how do you know the model was over-optimized for it or just really good?

kmdupree an hour ago | parent | prev [-]

i disagree: https://www.philosophicalhacker.com/post/anthropic-error/