Remix.run Logo
NitpickLawyer 7 hours ago

> gpt-oss that games the benchmarks just for PR.

gpt-oss is killing the ongoing AIME3 competition on kaggle. They're using a hidden, new set of problems, IMO level, handcrafted to be "AI hardened". And gpt-oss submissions are at ~33/50 right now, two weeks into the competition. The benchmarks (at least for math) were not gamed at all. They are really good at math.

lostmsu 6 hours ago | parent [-]

Are they ahead of all other recent open models? Is there a leaderboard?

NitpickLawyer 6 hours ago | parent [-]

There is a leaderboard [1] but we'll have to wait till april for the competition to end to know what models they're using. The current number 3 on there (34/50) has mentioned in discussions that they're using gpt-oss-120b. There were also some scores shared for gpt-oss-20b, in the 25/50 range.

The next "public" model is qwen30b-thinking at 23/50.

Competition is limited to 1 H100 (80GB) and 5h runtime for 50 problems. So larger open models (deepseek, larger qwens) don't fit.

[1] https://www.kaggle.com/competitions/ai-mathematical-olympiad...

data-ottawa 5 hours ago | parent [-]

I find the qwen3 models spend a ton of thinking tokens which could hamstring them on the runtime limitations. Gpt-oss 120b is much more focused and steerable there.

The token use chart in the OP release page demonstrates the Qwen issue well.

Token churn does help smaller models on math tasks, but for general purpose stuff it seems to hurt.