Remix.run Logo
HardCodedBias 5 hours ago

If you believe another thread the benchmarks are comparing Gemini-3 (probably thinking) to GPT-5.1 without thinking.

The person also claims that with thinking on the gap narrows considerably.

We'll probably have 3rd party benchmarks in a couple of days.

iamdelirium 5 hours ago | parent [-]

This is easily shown that the numbers are for GPT 5.1 thinking high.

Just go to the leaderboard website and see for yourself: https://arcprize.org/leaderboard