▲ | cs702 4 days ago | ||||||||||||||||||||||||||||||||||
Based on a quick first skim of the abstract and the introduction, the results from hierarchical reasoning (HRM) models look incredible: > Using only 1,000 input-output examples, without pre-training or CoT supervision, HRM learns to solve problems that are intractable for even the most advanced LLMs. For example, it achieves near-perfect accuracy in complex Sudoku puzzles (Sudoku-Extreme Full) and optimal pathfinding in 30x30 mazes, where state-of-the-art CoT methods completely fail (0% accuracy). In the Abstraction and Reasoning Corpus (ARC) AGI Challenge 27,28,29 - a benchmark of inductive reasoning - HRM, trained from scratch with only the official dataset (~1000 examples), with only 27M parameters and a 30x30 grid context (900 tokens), achieves a performance of 40.3%, which substantially surpasses leading CoT-based models like o3-mini-high (34.5%) and Claude 3.7 8K context (21.2%), despite their considerably larger parameter sizes and context lengths, as shown in Figure 1. I'm going to read this carefully, in its entirety. Thank you for sharing it on HN! | |||||||||||||||||||||||||||||||||||
▲ | lumost 3 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||
I am extremely skeptical of a 27M parameter model being trained “from scratch” on 1000 datapoints. I am likewise incredulous of the lack of comparison with any other model which is trained “from scratch” using their data preparation. Instead they strictly compare with 3rd party LLMs which are massively more general purpose and may not have any of those 1000 examples in their training set. This smells like some kind of overfit to me. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
▲ | diwank 4 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||
Exactly! > It uses two interdependent recurrent modules: a *high-level module* for abstract, slow planning and a *low-level module* for rapid, detailed computations. This structure enables HRM to achieve significant computational depth while maintaining training stability and efficiency, even with minimal parameters (27 million) and small datasets (~1,000 examples). > HRM outperforms state-of-the-art CoT models on challenging benchmarks like Sudoku-Extreme, Maze-Hard, and the Abstraction and Reasoning Corpus (ARC-AGI), where CoT methods fail entirely. For instance, it solves 96% of Sudoku puzzles and achieves 40.3% accuracy on ARC-AGI-2, surpassing larger models like Claude 3.7 and DeepSeek R1. Erm what? How? Needs a computer and sitting down. | |||||||||||||||||||||||||||||||||||
|