| ▲ | gpm 10 hours ago | ||||||||||||||||
Curiously Opus 4.7 claims to have a 87.6% pass rate and Mythos claims to have a 93.9% pass rate... leading to the conclusion that it's actually possible to "solve" the problems that OpenAI claims are incorrect. | |||||||||||||||||
| ▲ | jmalicki 9 hours ago | parent | next [-] | ||||||||||||||||
Part of the issue they mention is contamination - the tests are in the training data. The other issue they mention is being overly constrained vs. what is asked for - such as requiring specific class or function names to pass that were not part of what was specified. It might be possible that even to the extent they are not contaminated Claude is better at predicting what sort of function names would be used in the repository (this fits my experience in using it on a number of projects with very different styles - I've found it to be good at "when in Rome") - this is a laudable trait, but it's also not what SWEbench claims to be measuring. | |||||||||||||||||
| ▲ | cjsaltlake 7 hours ago | parent | prev | next [-] | ||||||||||||||||
If you read the mythos report, in which they discuss and account for contamination substantially, it still suggests that performance on SWE-bench verified is meaningful. Benchmarks, including SWE-bench can absolutely be gamed, but if you're not explicitly benchmaxxing, improving on SWE-bench still measures model improvements, at least up to the level of Mythos. | |||||||||||||||||
| ▲ | 2ndorderthought 10 hours ago | parent | prev | next [-] | ||||||||||||||||
Or that opus and mythos are training on the data somehow such that there solutions are incorrectly right. Or that openai is lying/wrong. Or that all of these companies are cheating so much it doesn't really matter and never did. | |||||||||||||||||
| ▲ | MattRix 9 hours ago | parent | prev [-] | ||||||||||||||||
The problem isn’t that the tasks are impossible to solve, it’s that they’re underspecified and/or impossible to solve consistently (ex. because a test is expecting the solution function to have a specific name that wasn’t specified in the task itself). So maybe Anthropic runs Mythos through the benchmark 10000 times and takes the highest score, who knows? | |||||||||||||||||
| |||||||||||||||||