▲ | oofbaroomf 3 days ago | |||||||||||||||||||||||||
Claude got 63.2% according to the swebench.com leaderboard (listed as "Tools + Claude 3.7 Sonnet (2025-02-24)).[0] OpenAI said they got 69.1% in their blog post. [0] swebench.com/#verified | ||||||||||||||||||||||||||
▲ | georgewsinger 3 days ago | parent | next [-] | |||||||||||||||||||||||||
Yes, however Claude advertised 70.3%[1] on SWE bench verified when using the following scaffolding: > For Claude 3.7 Sonnet and Claude 3.5 Sonnet (new), we use a much simpler approach with minimal scaffolding, where the model decides which commands to run and files to edit in a single session. Our main “no extended thinking” pass@1 result simply equips the model with the two tools described here—a bash tool, and a file editing tool that operates via string replacements—as well as the “planning tool” mentioned above in our TAU-bench results. Arguably this shouldn't be counted though? [1] https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-... | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
▲ | awestroke 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||
OpenAI have not shown themselves to be trustworthy, I'd take their claims with a few solar masses of salt | ||||||||||||||||||||||||||
▲ | swyx 3 days ago | parent | prev [-] | |||||||||||||||||||||||||
they also gave more detail on their SWEBench scaffolding here https://www.latent.space/p/claude-sonnet |