mdp2021 4 days ago

But when an LLM can fail though having all the time in the world, you are pretty certain you hit a wall.

So, in a way you have defined a good indicator for a limit for a certain area.

There is not enough sampling here to reach this conclusion. Remember, you can crank things like o3 pretty high on tasks like ARC AGI if you're willing to spend thousands of dollars on inference time compute. But that's obviously not in the budget for an enthusiast site like this.

▲

mdp2021 4 days ago | parent [-]

Sure but, you wrote:

> If anything, this shows that some LLMs might win against humans because they can spend more time thinking per wall clock time interval thanks to the underlying hardware. Not because they are fundamentally smarter.

You interpreted "smarter" the IQ way: results constrained time. But we actually get an indicator about the ability of the LLM to be able to reach, given time, the result or not - that is the interpretation of "smarter" that many of us need.

(Of course, it remains to be seen whether the ability to achieve those contextual results exports as an ability relevant to the solutions we actually need.)

	▲	sigmoid10 4 days ago \| parent [-]
		No, you misunderstood. I'm saying that for reasoning models, there is a lot of untapped capability in this test. I wouldn't be sure that there are hard limits in the sense that I think given enough compute, you'll probably find that a modern high end model will reach 100%. But you probably don't want to spend thousands (or perhaps tens of thousands) of dollars on that. There are much better tests out there if you have money to burn and want to find true hard limits compared to humans.