| ▲ | EnPissant 3 hours ago | |
> 1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth. But if some or all players are bench-maxing it, then it becomes a much less useful metric for comparison. Also, this doesn't address what OpenAI says about the test suite disallowing valid solutions. | ||