Hierarchical Reasoning Models

This work is part of the Recursive Architectures series.

Hierarchical Reasoning Models

On 26 June 2025, Wang et al. published Hierarchical Reasoning Models on arXiv. The work quickly gained interest from the AI research community, with some expressing skepticism about the results. I am not an expert on the benchmarks (ARC-AGI Challenge, Sodoku-Extreme, and Maze-hard) so I am not sure how impressive their results are—I believe the results have been reproduced by multiple individuals/groups. I don’t have much to add on that front. However, I have been working on recursive architectures and, to me, the HRM architecture and training setup seem reasonable. Unfortunately, I haven’t been able to find adequate discussions or explanations that align with my intuitions about their work online, so I thought I would write this blog.

For clarity, I will be using Hierarchical Reasoning Models to refer to the work by Wang et al. and HRM to refer to the architecture.

Currently (14 August 2025), the official GitHub repository has 1.2k forks and 9k stars.

The work proposes multiple components, and without ablations, I think it’s difficult for AI researchers not familiar with recursive architectures to follow the work. To that end, the goal of this blog is to separate the different components, provide intuitive discussions and explanations, and provide code to show how easy it is to implement. To me, there are two interesting contributions: 1) The HRM Architecture, and 2) Deep Supervision. I will focus on these two contributions. The authors also propose Adaptive Computational Time (ACT), an adaptive halting strategy—to me, this isn’t very interesting and I am not sure their strategy is the most intuitive. I will briefly discuss ACT below and provide alternative strategies that are more intuitive.

The HRM Architecture

Before we discuss the HRM architecture, we will discuss recursive (through depth) architectures.

Note that we make a distinction between recursive through depth architectures and recursive through time architectures–we clarify this below.

The simplest recursive architecture would be a single layer applied a fixed number of times. Given an input \(\mathbf{x}\), a target \(\mathbf{y}\), a layer \(f\), and a number of recursions \(K\), we have: \[ \begin{align*} \mathbf{x}_{0} &= \mathbf{x} \\ \mathbf{x}_{k} &= f(\mathbf{x}_{k-1}) \qquad k \in [1, \dots, K] \\ \hat{\mathbf{y}} &= \mathbf{x}_{K} \end{align*} \] We can train this layer in the regular fashion by computing the loss \(\mathcal{L} = (\mathbf{y}, \hat{\mathbf{y}})\) and backpropagating.

The layer \(f\) need not be a single layer. It can be a sequence of layers or an entire architecture, but to ease understanding, we will continue treating \(f\) as a single layer.

class RecursiveModel(nn.Module):
    def __init__(self):
        self.f = layer()
        self.K = K

    def forward(self, x):
        for _ in range(self.K):
            x = self.f(x)
        return x

This simple recursive architecture is straightfoward to implement and experiment with. We can easily ablate over the number of recursions \(K\) and find the optimal value for a given dataset or layer. Training a RecursiveModel with \(K=1\), is exactly equivalent to a training a standard model with a single layer. As we start increasing the number of recursions \((K=2, 3, \dots)\), we notice that we get better predictions. If we continue increasing the number of recursions \((K=10, 100, \cdots)\), we notice worse predictions or that the model stops learning altogether. This is because the solution space is more constrained: the layer \(f\) must both transform the input into a reasonable intermediate state compatible with the next application of the layer and transform the final intermediate state into a reasonable final prediction. There is also the infamous problem of exploding and vanishing gradients.

Exploding gradients happen when multiplying large gradients makes them exponentially larger, causing unstable training. Vanishing gradients occur when multiplying many small gradients together makes them exponentially smaller, so early layers barely learn.

The loss surfaces of `RecursiveModel` with different values of \(K\). `RecursiveModel` improves with \(K=2\), then worsens with \(K=4\), and fails to learn at \(K=10\).

To expand the solution space, we are going to pass the input \(\mathbf{x}\) to the layer \(f\) at every iteration. That is: \[ \begin{align*} \mathbf{x}_{0} &= \mathbf{x} \\ \mathbf{x}_{k} &= f(\mathbf{x}_{k-1}, \mathbf{x}) \qquad k \in [1, \dots, K] \\ \hat{\mathbf{y}} &= \mathbf{x}_{K} \end{align*} \] Here, the layer \(f\) need not balance intermediate states and final predictions; instead, at every iteration the layer can use its previous prediction \(\mathbf{x}_{k-1}\) and make a direct prediction from \(\mathbf{x}\) to \(\mathbf{\hat{y}}\). Again, we can train this layer in the regular fashion.

In practice, we can choose to combine \(\mathbf{x}_{k-1}\) and \(\mathbf{x}\) before passing the resulting vector into the layer \(f\). In the experiments below, we simply add \(\mathbf{x}_{k-1}\) and \(\mathbf{x}\) to keep a fixed number of trainable parameters.

class RecursiveModelWithSkip(nn.Module):
    def __init__(self):
        self.f = layer()
        self.K = K

    def forward(self, x):
        z = x
        for _ in range(self.K):
            # We pass intermidate state z AND input x to layer f.
            z = self.f(z, x)
        return z

The loss surfaces of `RecursiveModelWithSkip` with different values of \(K\). `RecursiveModelWithSkip` improves with increasing the number of recursions \(K\).