Remix.run Logo
cs702 4 days ago

Based on a quick first skim of the abstract and the introduction, the results from hierarchical reasoning (HRM) models look incredible:

> Using only 1,000 input-output examples, without pre-training or CoT supervision, HRM learns to solve problems that are intractable for even the most advanced LLMs. For example, it achieves near-perfect accuracy in complex Sudoku puzzles (Sudoku-Extreme Full) and optimal pathfinding in 30x30 mazes, where state-of-the-art CoT methods completely fail (0% accuracy). In the Abstraction and Reasoning Corpus (ARC) AGI Challenge 27,28,29 - a benchmark of inductive reasoning - HRM, trained from scratch with only the official dataset (~1000 examples), with only 27M parameters and a 30x30 grid context (900 tokens), achieves a performance of 40.3%, which substantially surpasses leading CoT-based models like o3-mini-high (34.5%) and Claude 3.7 8K context (21.2%), despite their considerably larger parameter sizes and context lengths, as shown in Figure 1.

I'm going to read this carefully, in its entirety.

Thank you for sharing it on HN!

lumost 3 days ago | parent | next [-]

I am extremely skeptical of a 27M parameter model being trained “from scratch” on 1000 datapoints. I am likewise incredulous of the lack of comparison with any other model which is trained “from scratch” using their data preparation. Instead they strictly compare with 3rd party LLMs which are massively more general purpose and may not have any of those 1000 examples in their training set.

This smells like some kind of overfit to me.

cs702 3 days ago | parent [-]

Yeah, the results look incredible indeed. That's why I and many others here have decided to download, review, and test the code published by the authors.[a] If their code doesn't live up to their claims, we will all ignore their work and move on. If their code lives up to their claims, no one can argue with it. In my experience, when authors publish working code, it's usually a good sign.

---

[a] https://github.com/sapientinc/HRM

lumost 3 days ago | parent [-]

Did it work? :)

The architecture is very similar offset lstms which have been studied extensively. The main difference is the handover of the hidden state, which my naive mind would assume makes optimization substantially more difficult.

cs702 3 days ago | parent [-]

I haven't had a chance to read the preprint carefully or play with the code yet. Best place to follow what's happening is by looking at the github repo, specifically open and closed issues and pull requests.

lumost 2 days ago | parent [-]

I'll wait until some more benchmarks are run in this case. Unlike traditional software, vetting a model architecture works better than alternatives is a time and compute intensive process. You really can't just download it and "try it out" outside of general purpose models (which this is not).

diwank 4 days ago | parent | prev [-]

Exactly!

> It uses two interdependent recurrent modules: a *high-level module* for abstract, slow planning and a *low-level module* for rapid, detailed computations. This structure enables HRM to achieve significant computational depth while maintaining training stability and efficiency, even with minimal parameters (27 million) and small datasets (~1,000 examples).

> HRM outperforms state-of-the-art CoT models on challenging benchmarks like Sudoku-Extreme, Maze-Hard, and the Abstraction and Reasoning Corpus (ARC-AGI), where CoT methods fail entirely. For instance, it solves 96% of Sudoku puzzles and achieves 40.3% accuracy on ARC-AGI-2, surpassing larger models like Claude 3.7 and DeepSeek R1.

Erm what? How? Needs a computer and sitting down.

cs702 4 days ago | parent | next [-]

Yeah, that was pretty much my reaction. I will need time on a computer too.

The repo is at https://github.com/sapientinc/HRM .

I love it when authors publish working code. It's usually a good sign. If the code does what the authors claim, no one can argue with it!

diwank 4 days ago | parent [-]

Same! Guan’s work on sample packing during finetuning has become a staple. His openchat code is also super simple and easy to understand.

mkagenius 4 days ago | parent | prev [-]

Is it talking about fine tuning existing models with 1000 examples to beat them in those tasks?