Remix.run Logo
bubblyworld 3 days ago

> CoT models can, in principle, solve _any_ complex task.

The authors explicitly discuss the expressive power of transformers and CoT in the introduction. They can only solve problems in a fairly restrictive complexity class (lower than PTIME!) - it's one of the theoretical motivations for the new architecture.

"The fixed depth of standard Transformers places them in computational complexity classes such as AC0 [...]"

This architecture by contrast is recurrent with inference time controlled by the model itself (there's a small Q-learning based subnetwork that decides halting time as it "thinks"), so there's no such limitation.

The main meat of the paper is describing how to train this architecture efficiently, as that has historically been the issue with recurrent nets.

malcontented 3 days ago | parent [-]

Agreed, regarding the computational simplicity of CoT LLMs, and that this solution certainly has much more flexibility. But is there a reason to believe that this architecture (and training method) is as applicable to the development of generally-capable models as it is to the solution of individual puzzles?

Don't get me wrong, this is a cool development, and I would love to see how this architecture behaves on a constraint-based problem that's not easily tractable via traditional algorithm.

bubblyworld 3 days ago | parent [-]

The ARC-1 problem set that they benchmark on is an example of such a problem, I believe. It's still more-or-less completely unsolved. They don't solve it either, mind, but they achieve very competitive results with their tiny (27m param) model. Competitive with architectures that are using extensive pretraining and hundreds of billions of parameters!

That's one of the things that sticks out for me about the paper. Having tried very hard myself to solve ARC it's pretty insane what they're claiming to have done here.

(I think a lot of the sceptics in this thread are unaware of just how difficult ARC-1 is, and are focusing on the sudoku part, which I agree is much simpler and less surprising that they do well on)