Universal Reasoning Model (53.8% pass 1 ARC1 and 16.0% ARC 2)

marojejian 7 hours ago | parent | next [-]

Sounds like a further improvement in the spirit of HRM & TRM models.

Decent comment via x: https://x.com/r0ck3t23/status/2002383378566303745

I continue to be fascinated by these architectures that: - Build in recurrence / inference scaling to transformers more natively. - Don't use full recurrent gradient traces, and succeed not just despite, but because of that.

▲

Moosdijk 4 hours ago | parent | prev | next [-]

Interesting. Instead of running the model once (flash) or multiple times (thinking/pro) in its entirety, this approach seems to apply the same principle within one run, looping back internally.

Instead of big models that “brute force” the right answer by knowing a lot of possible outcomes, this model seems to come to results with less knowledge but more wisdom.

Kind of like having a database of most possible frames in a video game and blending between them instead of rendering the scene.

▲

omneity 2 hours ago | parent [-]

Isn’t this in a sense an RNN built out of a slice of an LLM? Which if true means it might have the same drawbacks, namely slowness to train but also benefits such as an endless context window (in theory)

	▲	ctoa 24 minutes ago \| parent [-]
		It's sort of an RNN, but it's also basically a transformer with shared layer weights. Each step is equivalent to one transformer layer, the computation for n steps is the same as the computation for a transformer with n layers. The notion of context window applies to the sequence, it doesn't really affect that, each iteration sees and attends over the whole sequence.

▲

mysterEFrank 2 hours ago | parent | prev [-]

I'm surprised more attention isn't paid to this research direction, that nobody has tried to generalize it for example by combining the recurrence concept with next token prediction. That said despite the considerable gains this seems to just be some hyperparameter tweaking rather than a foundational improvement.

	▲	whiplash451 2 hours ago \| parent [-]
		Not just hyper parameter tweaking. Not foundational research either. But rather engineering improvements that compound with each other (conswiglu layers, muon optimizer)