Remix.run Logo
oofbaroomf a day ago

Nice to see that Sonnet performs worse than o3 on AIME but better on SWE-Bench. Often, it's easy to optimize math capabilities with RL but much harder to crack software engineering. Good to see what Anthropic is focusing on.

j_maffe a day ago | parent [-]

That's a very contentious opinion you're stating there. I'd say LLMs have surpassed a larger percentage of SWEs in capability than they have for mathematicians.

oofbaroomf a day ago | parent [-]

Mathematicians don't do high school math competitions - the benchmark in question is AIME.

Mathematicians generally do novel research, which is hard to optimize for easily. Things like LiveCodeBench (leetcode-style problems), AIME, and MATH (similar to AIME) are often chosen by companies so they can flex their model's capabilities, even if it doesn't perform nearly as well in things real mathematicians and real software engineers do.

j_maffe a day ago | parent [-]

Ok then you should clarify that you meant math benchmarks and not math capabilities.