Remix.run Logo
NitpickLawyer 6 hours ago

There is a leaderboard [1] but we'll have to wait till april for the competition to end to know what models they're using. The current number 3 on there (34/50) has mentioned in discussions that they're using gpt-oss-120b. There were also some scores shared for gpt-oss-20b, in the 25/50 range.

The next "public" model is qwen30b-thinking at 23/50.

Competition is limited to 1 H100 (80GB) and 5h runtime for 50 problems. So larger open models (deepseek, larger qwens) don't fit.

[1] https://www.kaggle.com/competitions/ai-mathematical-olympiad...

data-ottawa 5 hours ago | parent [-]

I find the qwen3 models spend a ton of thinking tokens which could hamstring them on the runtime limitations. Gpt-oss 120b is much more focused and steerable there.

The token use chart in the OP release page demonstrates the Qwen issue well.

Token churn does help smaller models on math tasks, but for general purpose stuff it seems to hurt.