I hope/fear this HRM model is going to be merged with MoE very soon. Given the huge economic pressure to develop powerful LLMs I think this can be done in just a month.

The paper seems to only study problems like sudoku solving, and not question answering or other applications of LLMs. Furthermore they omit a section for future applications or fusion with current LLMs.

I think anyone working in this field can envision their applications, but the details to have a MoE with an HRM model could be their next paper.

I only skimmed the paper and I am not an expert, sure other will/can explain why they don't discuss such a new structure. Anyway, my post is just blissful ignorance over the complexity involved and the impossible task to predict change.

Edit: A more general idea is that Mixture of Expert is related to cluster of concepts and now we would have to consider a cluster of concepts related by the time they take to be grasped, so in a sense the model would have in latent space an estimation of the depth, number of layers, and time required for each concept, just like we adapt our reading style for a dense math book different to a newspaper short story.

▲

yorwba 4 days ago | parent | next [-]

This HRM is essentially purpose-designed for solving puzzles with a small number of rules interacting in complex ways. Because the number of rules is small, a small model can learn them. Because the model is small, it can be run many times in a loop to resolve all interactions.

In contrast, language modeling requires storing a large number of arbitrary phrases and their relation to each other, so I don't think you could ever get away with a similarly small model. Fortunately, a comparatively small number of steps typically seems to be enough to get decent results.

But if you tried to use an LLM-sized model in an HRM-style loop, it would be dog slow, so I don't expect anyone to try it anytime soon. Certainly not within a month.

Maybe you could have a hybrid where an LLM has a smaller HRM bolted on to solve the occasional constraint-satisfaction task.

	▲	marcosdumay 3 days ago \| parent \| next [-]
		> In contrast, language modeling requires storing a large number of arbitrary phrases and their relation to each other A person has some ~10k word vocabulary, with words fitting specific places in a really small set of rules. All combined, we probably have something on the order of a few million rules in a language. What, yes, is larger than the thing in this paper can handle. But is nowhere near as large as a problem that should require something the size of a modern LLM to handle. So it's well worth it to try to enlarge models with other architectures, try hybrid models (note that this one is necessarily hybrid already), and explore every other possibility out there.
	▲	energy123 4 days ago \| parent \| prev [-]
		What about many small HRM models that solve conceptually distinct subtasks as determined and routed to by a master model who then analyzes and aggregates the outputs, with all of that learned during training.

▲

buster 4 days ago | parent | prev [-]

must say I am suspicious in this regard, as they don't show applications other than a Sudoku solver and don't discuss downsides.

▲

Oras 4 days ago | parent [-]

and the training was only on Sudoku. Which means they need to train a small model for every problem that currently exists.

Back to ML models?

	▲	JBits 3 days ago \| parent \| next [-]
		I would assuming that training a LLM would be unfeasible for a small research lab, so isn't tackling small problems like this unavoidable? Given that current LLMs have clear limitations, I can't think of anything better than developing beter architectures on small test cases, then a company can try scaling it later.
	▲	lispitillo 4 days ago \| parent \| prev [-]
		Not only on Sudoku, there is also maze solving and ARC-AGI.