class RecursiveModel(nn.Module):
def __init__(self):
self.f = layer()
self.K = K
def forward(self, x):
for _ in range(self.K):
x = self.f(x)
return xThis work is part of the Recursive Architectures series.
On Hierarchical Reasoning Recursive Models
On 26 June 2025, Wang et al. published Hierarchical Reasoning Models on arXiv. The work quickly gained interest by the AI research community, with some casting skepticism about the results. I am not an expert about the benchmarks (ARC-AGI Challenge, Sodoku-Extreme, and Maze-hard) so I am not sure how impressive or not their results are–I believe the results have been reproduced by multiple individuals/groups. I don’t have much to add on that front. However, I have been working on recursive architectures and, to me, the HRM architecture and training setup seem reasonable. Unfortunetly, I haven’t been able to find adequate discussions or explatnations that align with mine intuitions about their work online, so I thought I would write this blog.
For clarity, I will be using Hierarchical Reasoning Models to refer to the work by Wang et al. and HRM to refer to the architecture.
Currently (14 August 2025), the official GitHub repository has 1.2k forks and 9k stars.
The work proposes multiple components, and without ablations, I think its difficult for AI researchers not familiar with recursive architectures to follow the work. To that end, the goal of this blog is to seperate the different components, provide intuitive discussions and explanations, and provide code to show how easy it is to implment. To me, there are two interesting contributions: 1) The HRM Architecture, and 2) Deep Supervision–I will focus on these two contributions. The authors also propose Adaptive Computational Time (ACT), an adaptive halting strategy–to me, this isn’t very interesting and I am not sure their strategy is the most intuitive. I will briefly discuss ACT below and provide alternative strategies that are more intuitive.
The HRM Architecture
Before we discuss the HRM architecture, we will discuss recursive (through depth) architectures.
Note that we make a distinction between recursive through depth architectures and recursive through time architectures–we clarify this below.
The simplest recursive architecture would be a single layer applied a fixed number of times. Given an input \(\mathbf{x}\), a target \(\mathbf{y}\), a layer \(f\), and a number of recursions \(K\), we have: \[ \begin{align*} \mathbf{x}_{0} &= \mathbf{x} \\ \mathbf{x}_{k} &= f(\mathbf{x}_{k-1}) \qquad k \in [1, \dots, K] \\ \hat{\mathbf{y}} &= \mathbf{x}_{K} \end{align*} \] We can train this layer in the regular fashion by computing the loss \(\mathcal{L} = (\mathbf{y}, \hat{\mathbf{y}})\) and backpropagating.