> 93.6% (congrats Anthropic)

But the article says "We audited a 27.6% subset of the dataset that models often failed to solve [which is 19.1% of the problems at time of publication] and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submission"

0.191 * 0.594 > 1 - 0.936

Does this mean that the audited subset wasn't representative? Or that Anthropic is getting high answers through some shady means?

▲

cjsaltlake 5 hours ago | parent [-]

I suggest reading the Mythos report's discussion on SWE-bench and contamination. I think it's fairly convincing that you can account for contamination and still trust SWE-bench numbers on models that aren't over-optimized for it.

	▲	kator 2 hours ago \| parent \| next [-]
		> models that aren't over-optimized for it. But how do you know the model was over-optimized for it or just really good?
	▲	kmdupree an hour ago \| parent \| prev [-]
		i disagree: https://www.philosophicalhacker.com/post/anthropic-error/