Remix.run Logo
ofirpress 5 hours ago

I'm a co-creator of SWE-bench:

1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.

2. SWE-bench Multilingual and SWE-bench Multimodal (which we'll open source in the next month) are still unsatured.

3. All benchmarks and benchmark paradigms eventually become saturated. That's why the SWE-bench team has worked hard on building the next stage of benchmarks, and we have a few that are already out, for example https://codeclash.ai/ or https://algotune.io/ . And we'll have more to say soon :)

energy123 5 hours ago | parent | next [-]

> 93.6% (congrats Anthropic)

But the article says "We audited a 27.6% subset of the dataset that models often failed to solve [which is 19.1% of the problems at time of publication] and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submission"

0.191 * 0.594 > 1 - 0.936

Does this mean that the audited subset wasn't representative? Or that Anthropic is getting high answers through some shady means?

cjsaltlake 5 hours ago | parent [-]

I suggest reading the Mythos report's discussion on SWE-bench and contamination. I think it's fairly convincing that you can account for contamination and still trust SWE-bench numbers on models that aren't over-optimized for it.

kator 2 hours ago | parent | next [-]

> models that aren't over-optimized for it.

But how do you know the model was over-optimized for it or just really good?

kmdupree an hour ago | parent | prev [-]

i disagree: https://www.philosophicalhacker.com/post/anthropic-error/

kator 2 hours ago | parent | prev | next [-]

Those who fail to study history (or live through it) are doomed to repeat it.

SPECint and SPECfp went through this exact movie: benchmark, saturate, retire, replace, repeat. The treadmill is the product.

I don't have the solution just noticing the pattern.

akavel an hour ago | parent | prev | next [-]

Also, in meantime, there's https://SWE-rebench.com as a nice riff on SWE-bench, as far as I understand.

Bombthecat 5 hours ago | parent | prev | next [-]

Both of them look pretty old?

cjsaltlake 5 hours ago | parent [-]

code clash I think would be quite hard to game or contaminate unintentionally; considering that models need to compete against one another

gertlabs 4 hours ago | parent | next [-]

https://gertlabs.com already does this at scale.

An industry-standard benchmark shouldn't be hosted or designed by a lab producing the models, regardless.

Bombthecat 5 hours ago | parent | prev [-]

I mean the data / benchmarks

EnPissant 3 hours ago | parent | prev | next [-]

> 1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.

But if some or all players are bench-maxing it, then it becomes a much less useful metric for comparison.

Also, this doesn't address what OpenAI says about the test suite disallowing valid solutions.

dominotw 2 hours ago | parent | prev | next [-]

how hard is it create one of these for my company that models most of the work we do at my company.

irthomasthomas an hour ago | parent [-]

Just point an agent at your llm logs and ask it to generate a dataset of questions and answers from the problems you solved already.

kronks 4 hours ago | parent | prev [-]

[dead]