Remix.run Logo
sigmoid10 4 days ago

Big caveat here:

This website's method doesn't work at all for humans the way it works for LLMs. For humans, there is a strict time limit on these IQ tests (at least in officially recognised settings like Mensa). This kind of sequence completion is mostly a question of how fast your brain can iterate on problems. Being able to solve more questions within the time limit means you get a higher score because your brain essentially switches faster. But for LLMs, they just give them all the time in the world in parallel and see how many questions they can solve at all. If you look at the examples, you'll see some high end models struggling with some the first questions, that most humans would normally get easily. Only the later ones get hard where you really have to think through multiple options. So a 100 IQ LLM in here is not technically more intelligent in IQ test questions than 50% of humans.

If anything, this shows that some LLMs might win against humans because they can spend more time thinking per wall clock time interval thanks to the underlying hardware. Not because they are fundamentally smarter.

abullinan 3 days ago | parent | next [-]

Mensa really needs to be left out of these discussions. It’s not scientific, it is just a money grab for people who need intellectual validation. You can be admitted with a top 10% SAT score and no in-person testing at all. The in-person testing is in three parts, one part is a memory test, the second part is a Mensa test, the third part is the Weschler test. Source: I joined in 1995 because I needed intellectual validation. :)

mdp2021 4 days ago | parent | prev | next [-]

But when an LLM can fail though having all the time in the world, you are pretty certain you hit a wall.

So, in a way you have defined a good indicator for a limit for a certain area.

sigmoid10 4 days ago | parent [-]

There is not enough sampling here to reach this conclusion. Remember, you can crank things like o3 pretty high on tasks like ARC AGI if you're willing to spend thousands of dollars on inference time compute. But that's obviously not in the budget for an enthusiast site like this.

mdp2021 4 days ago | parent [-]

Sure but, you wrote:

> If anything, this shows that some LLMs might win against humans because they can spend more time thinking per wall clock time interval thanks to the underlying hardware. Not because they are fundamentally smarter.

You interpreted "smarter" the IQ way: results constrained time. But we actually get an indicator about the ability of the LLM to be able to reach, given time, the result or not - that is the interpretation of "smarter" that many of us need.

(Of course, it remains to be seen whether the ability to achieve those contextual results exports as an ability relevant to the solutions we actually need.)

sigmoid10 4 days ago | parent [-]

No, you misunderstood. I'm saying that for reasoning models, there is a lot of untapped capability in this test. I wouldn't be sure that there are hard limits in the sense that I think given enough compute, you'll probably find that a modern high end model will reach 100%. But you probably don't want to spend thousands (or perhaps tens of thousands) of dollars on that. There are much better tests out there if you have money to burn and want to find true hard limits compared to humans.

leopoldj 4 days ago | parent | prev [-]

The point of this is not so much to compare humans with AI. But to compare AI with other traditional software development approaches to solve this domain (IQ test, in this case). I believe, and I could be wrong, it will be nearly impossible, or too expensive, to develop deterministic software to beat AI in IQ test.

nerevarthelame 3 days ago | parent | next [-]

I agree that it's wrong to do so, but the maintainer of this site certainly thinks that the point is to compare humans with AI. He frequently compares the results to human IQ test takers without any sort of caveats: "Now o3 scores an IQ of 116, putting it in the top 15% of humans. The median Maximum Truth reader, for comparison, scored 104." [0]

0: https://www.maximumtruth.org/p/skyrocketing-ai-intelligence-...

spiderxxxx 3 days ago | parent | prev [-]

That's not even the point. Also IQ tests are normalized for individuals in their same age group. If they're comparing them to people, then what age group people are they comparing with? Also the tests are timed, so IQ is more a measure of how quickly something can be figured out, which really doesn't apply to computers. The whole idea that you can apply an IQ score to an LLM is ridiculous.