Remix.run Logo
malcontented 3 days ago

I appreciate the connections with neurology, and the paper itself doesn't ring any alarm bells. I don't think I'd reject it if it fell to me to peer review.

However, I have extreme skepticism when it comes to the applicability of this finding. Based on what they have written, they seem to have created a universal (maybe; adaptable at the very least) constraint-satisfaction solver that learns the rules of the constraint-satisfaction problem from a small number of examples. If true (I have not yet had the leisure to replicate their examples and try them on something else), this is pretty cool, but I do not understand the comparison with CoT models.

CoT models can, in principle, solve _any_ complex task. This needs to be trained to a specific puzzle which it can then solve: it makes no pretense to universality. It isn't even clear that it is meant to be capable of adapting to any given puzzle. I suspect this is not the case, just based on what I have read in the paper and on the indicative choice of examples they tested it against.

This is kind of like claiming that Stockfish is way smarter than current state of the art LLMs because it can beat the stuffing out of them in chess.

I feel the authors have a good idea here, but that they have marketed it a bit too... generously.

jurgenaut23 3 days ago | parent | next [-]

Yes, I agree, but this is a huge deal in and of itself. I suppose the authors had to frame it in this way for obvious reasons of hype surfing, but this is an amazing achievement, especially given the small size of the model! I’d rather use a customized model for a specific problem than a supposedly « generally intelligent » model that burns orders of magnitude more energy for much less reliability.

JBits 3 days ago | parent | prev | next [-]

> CoT models can, in principle, solve _any_ complex task.

What is the justification for this? Is there a mathematical proof? To me, CoT seems like a hack to work around the severe limitations of current LLMs.

malcontented 3 days ago | parent | next [-]

That's a fair argument to make. I should have, perhaps, written "are supposed to be able," or "have become famous for their apparent ability to solve loosely-specified arbitrary problems."

CoT _is,_ in my mind at least, a hack that is bolted to LLMs to create some sort of loose approximation of reasoning. When I read the paper I expected to see a better hack, but could not find anything on how you take this architecture, interesting though it is, and put it to use in a way similar to CoT. The whole paper seems to make a wild pivot between a fully general biomimetic grandeur of the first half, and the narrow effectiveness of the second half.

liamnorm 3 days ago | parent | prev [-]

The Universal Approximation Theorem.

JBits 3 days ago | parent [-]

I don't see how that changes anything. By this logic, there's no need for CoT reasoning at all, as a single pass should be sufficient. I don't see how that proves that CoT increases capabilities.

bubblyworld 3 days ago | parent | prev [-]

> CoT models can, in principle, solve _any_ complex task.

The authors explicitly discuss the expressive power of transformers and CoT in the introduction. They can only solve problems in a fairly restrictive complexity class (lower than PTIME!) - it's one of the theoretical motivations for the new architecture.

"The fixed depth of standard Transformers places them in computational complexity classes such as AC0 [...]"

This architecture by contrast is recurrent with inference time controlled by the model itself (there's a small Q-learning based subnetwork that decides halting time as it "thinks"), so there's no such limitation.

The main meat of the paper is describing how to train this architecture efficiently, as that has historically been the issue with recurrent nets.

malcontented 3 days ago | parent [-]

Agreed, regarding the computational simplicity of CoT LLMs, and that this solution certainly has much more flexibility. But is there a reason to believe that this architecture (and training method) is as applicable to the development of generally-capable models as it is to the solution of individual puzzles?

Don't get me wrong, this is a cool development, and I would love to see how this architecture behaves on a constraint-based problem that's not easily tractable via traditional algorithm.

bubblyworld 3 days ago | parent [-]

The ARC-1 problem set that they benchmark on is an example of such a problem, I believe. It's still more-or-less completely unsolved. They don't solve it either, mind, but they achieve very competitive results with their tiny (27m param) model. Competitive with architectures that are using extensive pretraining and hundreds of billions of parameters!

That's one of the things that sticks out for me about the paper. Having tried very hard myself to solve ARC it's pretty insane what they're claiming to have done here.

(I think a lot of the sceptics in this thread are unaware of just how difficult ARC-1 is, and are focusing on the sudoku part, which I agree is much simpler and less surprising that they do well on)