An Interactive Reading of

Nested Learning:
The Illusion of
Deep Learning Architecture

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong & Vahab Mirrokni
Google Research · Columbia University · NeurIPS 2025

The paper, in plain English

Every deep learning model you have used — from GPT-4 to Midjourney — is frozen the moment training ends. It can juggle tokens inside its context window, but it cannot write new knowledge into its own weights. The authors compare this to anterograde amnesia: the patient remembers everything before the accident, experiences the present vividly, but cannot form new long-term memories. Current LLMs have the same condition.

The paper's central insight is that the distinction between "architecture" and "optimizer" is an illusion. Both are associative memories compressing their own context flow, just at different frequencies. Gradient descent memorizes input-to-surprise mappings. Momentum is a second-level memory compressing past gradients. Adam is the optimal memory for a specific loss function. Once you see the pattern, you can stack more levels, design better optimizers, and build models that modify their own learning rules — exactly the way the human brain uses fast gamma waves for perception and slow delta waves for memory consolidation.

Out of this theory comes Hope, a model that combines self-modifying sequence processing with a multi-frequency memory system. It matches or beats Transformers on language modeling, maintains performance at 10 million tokens of context, and nearly eliminates catastrophic forgetting when learning new languages sequentially. The paper does not just propose a new architecture — it offers a new lens through which every existing architecture can be understood.

I

Optimizers Are Memories

Adam, SGD with momentum, AdaGrad — all are associative memories compressing gradients. This viewpoint unlocks new, more expressive optimizers.

II

Continuum Memory System

Replace "short-term vs long-term" with a spectrum of update frequencies. Knowledge circulates between levels, making catastrophic forgetting far less likely.

III

Self-Modifying Learning

Hope generates its own learning signals and rewrites its own update rule in-context. The model learns how to learn how to learn.

Chapter 1

The Static Model Problem

Your language model was state-of-the-art when training ended. Then the world kept turning.

In plain English

Imagine a person who can remember everything from before a car accident, can hold a conversation in the moment, but cannot convert today's experiences into lasting memories. Every morning the world is new again. That is anterograde amnesia, and it is exactly how current LLMs operate after pre-training ends.

The Transformer's attention block updates with every token (frequency = infinity). Its MLP weights stay frozen (frequency = zero). There is no mechanism to transfer knowledge from the fast-updating context into the slow, persistent weights.

The human brain solves this with multiple timescales: fast gamma waves for sensory processing, slow delta waves for memory consolidation. The Nested Learning paradigm asks: what if we gave neural networks the same multi-timescale architecture?

Standard gradient descent — the "frozen after training" update rule

$$W_{t+1} = W_t - \eta_t \nabla_{W_t} \mathcal{L}(W_t; \mathbf{x}_t)$$

The weight $W_t$ is updated during training on data $\mathbf{x}_t$, but after "end of pre-training" this gradient flow stops entirely. The model can only adapt through its context window — never through its persistent parameters.

Backpropagation as surprise-based associative memory

$$W_{t+1} = \arg\min_W \left[ \langle W\,\hat{\mathbf{x}}_{\ell-1}, \boldsymbol{\delta}_\ell \rangle + \frac{1}{2\eta_\ell}\|W - W_\ell^t\|_F^2 \right]$$

This proximal-gradient reformulation shows that training a layer with backpropagation is equivalent to memorizing the mapping from each layer's input $\hat{\mathbf{x}}_{\ell-1}$ to its local surprise signal $\boldsymbol{\delta}_\ell$ — how "surprising" the output was. The network is a surprise compressor.

Learning Rate (η) 0.020

Training Steps 80

Drag either slider to see how learning rate and training duration affect the model's ability to compress its training data into weights. The "surprise" curve shows the gradient magnitude — the memory trace of how unexpected each input was.

The key insight: backpropagation is not just an optimization algorithm — it is an associative memory that maps inputs to prediction errors. Once training stops, the memory stops updating. Nested Learning asks what happens if we never stop.

Next: The Nested Optimization Paradigm →

Chapter 2

The Nested Optimization Paradigm

What if "architecture" and "optimizer" are the same thing, just running at different speeds?

Definition: Update Frequency

$$f_A = \text{number of updates per unit time for component } A$$

Components are sorted by frequency into levels: higher frequency (faster) components update more often. Attention runs every token. MLP weights update once per pre-training step. Momentum updates every backward pass. Each level has its own context flow — the data it learns from.

Nested System of Associative Memories (NSAM)

$$\boldsymbol{\theta}^{(k)}_{t+1} = \arg\min_{\boldsymbol{\Phi}^{(k)}_t} \left[ \langle \boldsymbol{\Phi}^{(k)} \mathbf{k}_{t+1}^{(i)}, -\nabla L^{(k)}_i(\boldsymbol{\theta}^{(k)}_t; \mathbf{k}_{t+1}^{(i)}, \mathbf{v}_{t+1}^{(i)}) \rangle + \frac{1}{2\eta^{(k)}_{t+1}} \|\boldsymbol{\Phi}^{(k)} - \boldsymbol{\theta}^{(k)}_t\|^2_2 \right]$$

Each level $k$ contains optimization problems with their own keys $\mathbf{k}$, values $\mathbf{v}$, and objectives $L$. All are optimized with gradient descent — but on different timescales.

Outer Update Frequency (steps between outer updates) 16

Inner Learning Rate (η) 0.080

The blue curve (outer parameter W) only updates at the specified frequency. The red curve (inner memory M) updates every step and is re-initialized to W at each outer update. Adjust the frequency to see how more frequent outer updates lead to faster overall convergence.

The deeper insight: both architecture and optimizer are instances of the same pattern — associative memories compressing their context flow at different frequencies. This means you can design new architectures by adding more levels, and new optimizers by using more expressive memory structures.

Next: Optimizers Are Memories →

Chapter 3

Optimizers Are Memories

Your optimizer is not finding a solution. It is memorizing the loss landscape.

In plain English

Think of momentum as a notebook. Every time the optimizer computes a gradient, it writes the direction in the notebook. Before taking a step, it reads back all previous entries (weighted toward recent ones) to decide where to go. The notebook is an associative memory — it compresses past gradients into a single state.

Here is the catch: with $\beta = 0.9$, the last 43 gradients contribute 99% of the momentum. Everything older is effectively erased. That is a very short notebook. The paper shows this is why standard momentum fails at continual learning — it literally cannot remember the loss landscape beyond ~43 steps.

Adam, it turns out, is the theoretically optimal memory for compressing gradient variance. Muon maps gradients to an orthogonal space. Every optimizer you have ever used is a memory module with specific compression properties.

Momentum as 2-level nested optimization

$$\mathbf{m}_{t+1} = \arg\min_{\mathbf{m}} \left[ -\langle \mathbf{m}, \nabla_{W_t} \mathcal{L}(W_t; \mathbf{x}_{t+1}) \rangle + \frac{1}{2\eta_{t+1}} \|\mathbf{m} - \mathbf{m}_t\|^2_2 \right]$$ $$W_{t+1} = W_t - \mathbf{m}_{t+1}$$

The momentum term $\mathbf{m}$ is itself optimized by gradient descent — a 2-level nested system where the inner level learns to compress gradients and the outer level uses the compressed knowledge to update weights.

Delta Gradient Descent — incorporating previous state

$$W_{t+1} = W_t \left(\mathbf{I} - \eta'_t\, \mathbf{x}_t \mathbf{x}_t^\top\right) - \eta'_t\, \nabla_{\mathbf{y}_t} \mathcal{L}(W_t; \mathbf{x}_t) \otimes \mathbf{x}_t$$

Unlike standard GD (where each update is independent of $W_t$'s current state), DGD applies an adaptive decay $\mathbf{I} - \eta'_t \mathbf{x}_t \mathbf{x}_t^\top$ that depends on the data. This makes the update self-referential — the model generates its own learning signal.

Momentum Decay (β) 0.90

Left: SGD, Momentum, and Adam paths on a sinusoidal loss landscape. Right: How many past gradients contribute to the current momentum (green line = 99% threshold). Drag β to see how higher momentum keeps more history — but even β=0.99 forgets after ~400 steps.

When $\beta = 0.9$, the last 6 gradients are responsible for 50% of the momentum, and gradients beyond 43 steps contribute less than 1%. In a continual learning scenario with diverse tasks, this means the optimizer has almost no memory of the loss landscape from earlier tasks — setting the stage for catastrophic forgetting.

Next: Continuum Memory System →

Chapter 4

Continuum Memory

Forget "short-term vs long-term." Memory is a spectrum.

Continuum Memory System (CMS)

$$\mathbf{y}_t = \text{MLP}^{(f_k)}(\text{MLP}^{(f_{k-1})}(\cdots \text{MLP}^{(f_1)}(\mathbf{x}_t)))$$

A chain of MLP blocks, each updated at a different frequency $f_\ell$. Higher-frequency blocks adapt fast but forget fast. Lower-frequency blocks persist but respond slowly. Together they form a memory spectrum.

Per-level update schedule

$$\boldsymbol{\theta}^{(f_\ell)}_{i+1} = \begin{cases} \boldsymbol{\theta}^{(f_\ell)}_i - \eta^{(\ell)} \sum_{t=i-C^{(\ell)}}^{i} f(\boldsymbol{\theta}^{(f_\ell)}_t; \mathbf{x}_t) & \text{if } i \equiv 0 \pmod{C^{(\ell)}} \\ \boldsymbol{\theta}^{(f_\ell)}_i & \text{otherwise} \end{cases}$$

Each level only updates when its chunk boundary $C^{(\ell)}$ is reached. A standard Transformer is the special case $k=1$ — a single MLP updated only during pre-training.

Number of Memory Levels 3

Forgetting Rate 0.003

Each vertical dotted line is a new task. With 1 level (single MLP), the model forgets previous tasks as soon as it adapts to the new one. Add more levels and watch how CMS preserves knowledge from earlier tasks. Increase the forgetting rate to stress-test the system.

When a fast-level block forgets, the knowledge is still in slower blocks — and backpropagation through the initialization can circle it back. This recovery loop is why Hope maintains performance at 10M-token context while other models collapse. It is not magic; it is memory management at multiple timescales.

Next: Self-Referential Learning →

Chapter 5

Self-Referential Learning

What if the model could rewrite its own source code?

Self-referential update rule (DGD with weight decay)

$$\mathbf{k}_t = M_{\mathbf{k},t-1}(\mathbf{x}_t), \quad \mathbf{v}_t = M_{\mathbf{v},t-1}(\mathbf{x}_t), \quad \hat{\mathbf{v}}_{\square,t} = M_{\square,t-1}(\mathbf{v}_t)$$ $$M_{\square,t} = M_{\square,t-1}\left(\alpha_t \mathbf{I} - \eta_t \mathbf{k}_t \mathbf{k}_t^\top\right) - \eta_t\, \nabla L\left(M_{\square,t-1}; \mathbf{k}_t, \hat{\mathbf{v}}_{\square,t}\right)$$

Each memory $M_\square$ (for keys, values, learning rates, gates, and the main memory) generates its own values $\hat{\mathbf{v}}_{\square,t}$ from its current state. The update uses the Delta Gradient Descent rule, which incorporates both the current input and the model's previous state — making it self-referential.

Memory Capacity (dimension proxy) 4

Learning Rate 0.150

Compare three memory update rules on the same sequence. Hebbian (fixed) quickly plateaus at a high error. Delta rule adapts better. The self-referential DGD generates its own learning signals and converges fastest. Increase "Memory Capacity" to see how larger state improves all methods but benefits self-referential learning most.

The model does not just learn what to remember — it learns how to remember. The self-generated values $\hat{\mathbf{v}}_{\square,t}$ mean the model controls its own learning dynamics, deciding for itself what is worth storing and what to forget. This is the mechanism that lets Hope adapt to completely new languages without catastrophic forgetting.

Next: The Hope Architecture →

Chapter 6

The Hope Architecture

Self-modifying memory meets continuum storage.

In plain English

Hope combines two complementary ideas. The self-referential Titans module is a small but powerful memory that rewrites its own learning rule — perfect for fast adaptation but limited in capacity. The Continuum Memory System is a large-capacity storage with multiple update frequencies — perfect for persistence but with a simpler learning rule.

Put them together and you get a model that adapts fast, remembers long, and recovers what it forgets. The Titan module handles the "learn from the current context" part. The CMS handles the "store important knowledge for later" part. Knowledge flows between them through initialization and backpropagation.

Hope can also be built on top of existing Transformers by replacing MLP blocks with CMS layers and fine-tuning — no need to train from scratch.

Hope forward pass (simplified)

$$\mathbf{o}_t = M_{\text{memory},t-1}(\mathbf{q}_t), \quad \mathbf{k}_t = M_{\mathbf{k},t-1}(\mathbf{x}_t), \quad \mathbf{v}_t = M_{\mathbf{v},t-1}(\mathbf{x}_t), \quad \eta_t = M_{\eta,t-1}(\mathbf{x}_t), \quad \alpha_t = M_{\alpha,t-1}(\mathbf{x}_t)$$ $$M_{\square,t} = M_{\square,t-1}\left(\alpha_t \mathbf{I} - \eta_t \mathbf{k}_t \mathbf{k}_t^\top\right) - \eta_t\, \nabla L(M_{\square,t-1}; \mathbf{k}_t, \hat{\mathbf{v}}_{\square,t})$$ $$\mathbf{y}_t = \text{MLP}^{(f_k)}(\text{MLP}^{(f_{k-1})}(\cdots \text{MLP}^{(f_1)}(\mathbf{o}_t)))$$

Lines 1-2: Self-referential Titans module generates keys, values, learning rates, and gates — then updates all memories using DGD. Line 3: The output passes through the CMS chain for persistent storage at multiple frequencies.

Memory Levels in CMS 3

Lowest Frequency (persistent memory) 512

This simulation mirrors the paper's RULER benchmark. More CMS levels and lower update frequencies correspond to better long-range retrieval. Compare with the 1-level baseline (equivalent to standard ICL) to see the gap. The paper finds 2K lowest frequency to be a sweet spot between efficiency and performance.

Hope outperforms Transformers on language modeling at 760M and 1.3B parameters, reaches 88.4% on single-needle NIAH at 16K context, and maintains performance at 10M tokens — a regime where every other model in the paper has collapsed. The secret is not a bigger model; it is more levels of memory.

Next: Results & Implications →

Chapter 7

Results & Implications

The numbers do not lie. But the perspective shift matters more.

In plain English

On standard benchmarks, Hope beats every attention-free model and closes the gap with Transformers. On 760M parameters trained with 30B tokens, Hope achieves 52.3% average accuracy vs 50.1% for Transformers. At 1.3B / 100B tokens, the gap widens to 58.0% vs 53.4%.

But the paper's lasting contribution is not any single number. It is the lens. Through Nested Learning, you can see that pre-training is just in-context learning with a very long context. That Transformers are MLP blocks whose inner loop runs at infinite frequency. That hybrid architectures are not new designs — they are old Transformers with extra levels added to the MLP blocks.

Once you see it, you cannot unsee it. And the design space it opens — more levels, architecture-specific optimizers, multi-frequency memory — is vast and largely unexplored.

Key Experimental Results

Language Modeling (1.3B params, 100B tokens): Hope achieves 14.39 perplexity on Wikitext (vs 17.92 for Transformer++) and 58.0% average accuracy across common-sense reasoning tasks — 4.7 points ahead of the best competing model.

Long Context (BABILong): Hope is the only model maintaining performance at 10 million tokens. GPT-4 fails at ~256K. Llama-8B with RAG degrades steadily. Titans and ARMT collapse after 1M. Hope persists.

Continual Learning (CTNL): When learning two new languages sequentially, standard ICL suffers catastrophic forgetting. Hope-3 (3 CMS levels) nearly recovers the single-language performance, proving the multi-level design prevents forgetting.

Class Incremental Learning: On CLINC, Banking, and DBpedia datasets, Hope-enhanced Llama-3B and Llama-8B outperform ICL, EWC, and InCA baselines — including methods with external learners.

Formal Language Recognition: Hope achieves 100% accuracy across all tasks including Parity, Dyck languages, and Shuffle-2 — matching LSTM and beating Transformers (46.4% on Parity), demonstrating the computational depth advantage of nested levels.

The paper's most provocative claim: there is no meaningful distinction between training and test time in a neural learning module. Every phase is just optimization at a different frequency. Pre-training is in-context learning with an ultra-large context. Test-time training is in-context learning with a small context. The only difference is how often you update, and how much you remember.

Conceptual Takeaways

Architectures generate the context for optimizers. The gradient landscape that an optimizer sees is produced by the architecture. Different architectures produce different gradient patterns, which means the optimal optimizer depends on the architecture. One-size-fits-all optimization may be leaving performance on the table.

Models have more parameters than we knew. Momentum terms, attention states, recurrent hidden states — all are parameters that store knowledge. They just update at different frequencies. When we discard momentum at the end of training, we are throwing away compressed knowledge about the loss landscape.

In-context learning is not emergent — it is structural. Having multiple nested levels is in-context learning. Transformers do it at infinite frequency (non-parametric). RNNs do it at finite frequency (parametric). The quality of ICL depends on how well the lower-frequency levels (pre-training) prepare the higher-frequency levels for fast adaptation.

The illusion of deep learning architecture. What looks like a heterogeneous mix of attention, MLP, convolutions, and recurrent layers is, from the NL viewpoint, a set of uniform feedforward networks optimized at different frequencies with different objectives. The heterogeneity is an artifact of viewing the solution rather than the optimization process.

Nested Learning:The Illusion ofDeep Learning Architecture

The Static Model Problem

The Nested Optimization Paradigm

Optimizers Are Memories

Continuum Memory

Self-Referential Learning

The Hope Architecture

Results & Implications

Key Experimental Results

Conceptual Takeaways

Nested Learning:
The Illusion of
Deep Learning Architecture