Remix.run Logo
Tiberium 2 days ago

The only table where they showed comparisons against Opus 4.5 and Gemini 3:

https://x.com/OpenAI/status/1999182104362668275

https://i.imgur.com/e0iB8KC.png

varenc 2 days ago | parent [-]

100% on the AIME (assuming its not in the training data) is pretty impressive. I got like 4/15 when I was in HS...

hellojimbo 2 days ago | parent [-]

The no tools part is impressive, with tools every model gets 100%

varenc 2 days ago | parent [-]

If I recall, the AIME answers are always 4 digits numbers. And most of the problems are of the type where if you have a candidate number it's reasonable to validate its correctness. So easy to brute force all 4 digit ints with code.

tl;dr; humans would do much better too if they could use programming tools :)

Davidzheng 2 days ago | parent [-]

uh no it's not solved by looping over 4 digit numbers when it uses tools