Remix.run Logo
stared 4 days ago

My take is that it’s easier to train a model to ace short, low-context tasks like IQ tests. That doesn’t necessarily transfer to more complex reasoning. While on the Mensa Norway test GPT-5 gets over 140, on an offline test it goes down to ~120.

It is interesting to look at the political spectrum as well (https://www.trackingai.org/political-test) - ar are liberals, even Grok 4. The political leaning isn’t surprising either. Mainstream models need to be broadly acceptable, which in practice means being respectful of all groups. An authoritarian right-wing model might work for one country, group, or religion, but would almost certainly be offensive elsewhere.

eqvinox 4 days ago | parent [-]

> While on the Mensa Norway test GPT-5 gets over 一四, on an offline test it goes down to ~一二.

Since IQ tests are fundamentally timed, those numbers are meaningless to compare with human numbers. Or maybe dangerous since it's hard to de-context them even if you know that. Hence my cheeky 漢字.

(Yes they might be useful to compare LLMs with each other, but that is outstripped by the risk of misreading it against what we know as "IQ".)