An Interactive Reading of

Nested Learning:
The Illusion of
Deep Learning Architecture

The paper, in plain English

Every deep learning model you have used — from GPT-4 to Midjourney — is frozen the moment training ends. It can juggle tokens inside its context window, but it cannot write new knowledge into its own weights. The authors compare this to anterograde amnesia: the patient remembers everything before the accident, experiences the present vividly, but cannot form new long-term memories. Current LLMs have the same condition.

The paper's central insight is that the distinction between "architecture" and "optimizer" is an illusion. Both are associative memories compressing their own context flow, just at different frequencies. Gradient descent memorizes input-to-surprise mappings. Momentum is a second-level memory compressing past gradients. Adam is the optimal memory for a specific loss function. Once you see the pattern, you can stack more levels, design better optimizers, and build models that modify their own learning rules — exactly the way the human brain uses fast gamma waves for perception and slow delta waves for memory consolidation.

Out of this theory comes Hope, a model that combines self-modifying sequence processing with a multi-frequency memory system. It matches or beats Transformers on language modeling, maintains performance at 10 million tokens of context, and nearly eliminates catastrophic forgetting when learning new languages sequentially. The paper does not just propose a new architecture — it offers a new lens through which every existing architecture can be understood.

I
Optimizers Are Memories
Adam, SGD with momentum, AdaGrad — all are associative memories compressing gradients. This viewpoint unlocks new, more expressive optimizers.
II
Continuum Memory System
Replace "short-term vs long-term" with a spectrum of update frequencies. Knowledge circulates between levels, making catastrophic forgetting far less likely.
III
Self-Modifying Learning
Hope generates its own learning signals and rewrites its own update rule in-context. The model learns how to learn how to learn.
Chapter 1

The Static Model Problem

Your language model was state-of-the-art when training ended. Then the world kept turning.

Standard gradient descent — the "frozen after training" update rule
$$W_{t+1} = W_t - \eta_t \nabla_{W_t} \mathcal{L}(W_t; \mathbf{x}_t)$$

The weight $W_t$ is updated during training on data $\mathbf{x}_t$, but after "end of pre-training" this gradient flow stops entirely. The model can only adapt through its context window — never through its persistent parameters.

Backpropagation as surprise-based associative memory
$$W_{t+1} = \arg\min_W \left[ \langle W\,\hat{\mathbf{x}}_{\ell-1}, \boldsymbol{\delta}_\ell \rangle + \frac{1}{2\eta_\ell}\|W - W_\ell^t\|_F^2 \right]$$

This proximal-gradient reformulation shows that training a layer with backpropagation is equivalent to memorizing the mapping from each layer's input $\hat{\mathbf{x}}_{\ell-1}$ to its local surprise signal $\boldsymbol{\delta}_\ell$ — how "surprising" the output was. The network is a surprise compressor.

0.020
80
Drag either slider to see how learning rate and training duration affect the model's ability to compress its training data into weights. The "surprise" curve shows the gradient magnitude — the memory trace of how unexpected each input was.

The key insight: backpropagation is not just an optimization algorithm — it is an associative memory that maps inputs to prediction errors. Once training stops, the memory stops updating. Nested Learning asks what happens if we never stop.

Next: The Nested Optimization Paradigm
Chapter 2

The Nested Optimization Paradigm

What if "architecture" and "optimizer" are the same thing, just running at different speeds?

Definition: Update Frequency
$$f_A = \text{number of updates per unit time for component } A$$

Components are sorted by frequency into levels: higher frequency (faster) components update more often. Attention runs every token. MLP weights update once per pre-training step. Momentum updates every backward pass. Each level has its own context flow — the data it learns from.

Nested System of Associative Memories (NSAM)
$$\boldsymbol{\theta}^{(k)}_{t+1} = \arg\min_{\boldsymbol{\Phi}^{(k)}_t} \left[ \langle \boldsymbol{\Phi}^{(k)} \mathbf{k}_{t+1}^{(i)}, -\nabla L^{(k)}_i(\boldsymbol{\theta}^{(k)}_t; \mathbf{k}_{t+1}^{(i)}, \mathbf{v}_{t+1}^{(i)}) \rangle + \frac{1}{2\eta^{(k)}_{t+1}} \|\boldsymbol{\Phi}^{(k)} - \boldsymbol{\theta}^{(k)}_t\|^2_2 \right]$$

Each level $k$ contains optimization problems with their own keys $\mathbf{k}$, values $\mathbf{v}$, and objectives $L$. All are optimized with gradient descent — but on different timescales.

16
0.080
The blue curve (outer parameter W) only updates at the specified frequency. The red curve (inner memory M) updates every step and is re-initialized to W at each outer update. Adjust the frequency to see how more frequent outer updates lead to faster overall convergence.

The deeper insight: both architecture and optimizer are instances of the same pattern — associative memories compressing their context flow at different frequencies. This means you can design new architectures by adding more levels, and new optimizers by using more expressive memory structures.

Next: Optimizers Are Memories
Chapter 3

Optimizers Are Memories

Your optimizer is not finding a solution. It is memorizing the loss landscape.

Momentum as 2-level nested optimization
$$\mathbf{m}_{t+1} = \arg\min_{\mathbf{m}} \left[ -\langle \mathbf{m}, \nabla_{W_t} \mathcal{L}(W_t; \mathbf{x}_{t+1}) \rangle + \frac{1}{2\eta_{t+1}} \|\mathbf{m} - \mathbf{m}_t\|^2_2 \right]$$ $$W_{t+1} = W_t - \mathbf{m}_{t+1}$$

The momentum term $\mathbf{m}$ is itself optimized by gradient descent — a 2-level nested system where the inner level learns to compress gradients and the outer level uses the compressed knowledge to update weights.

Delta Gradient Descent — incorporating previous state
$$W_{t+1} = W_t \left(\mathbf{I} - \eta'_t\, \mathbf{x}_t \mathbf{x}_t^\top\right) - \eta'_t\, \nabla_{\mathbf{y}_t} \mathcal{L}(W_t; \mathbf{x}_t) \otimes \mathbf{x}_t$$

Unlike standard GD (where each update is independent of $W_t$'s current state), DGD applies an adaptive decay $\mathbf{I} - \eta'_t \mathbf{x}_t \mathbf{x}_t^\top$ that depends on the data. This makes the update self-referential — the model generates its own learning signal.

0.90
Left: SGD, Momentum, and Adam paths on a sinusoidal loss landscape. Right: How many past gradients contribute to the current momentum (green line = 99% threshold). Drag β to see how higher momentum keeps more history — but even β=0.99 forgets after ~400 steps.

When $\beta = 0.9$, the last 6 gradients are responsible for 50% of the momentum, and gradients beyond 43 steps contribute less than 1%. In a continual learning scenario with diverse tasks, this means the optimizer has almost no memory of the loss landscape from earlier tasks — setting the stage for catastrophic forgetting.

Next: Continuum Memory System
Chapter 4

Continuum Memory

Forget "short-term vs long-term." Memory is a spectrum.

Continuum Memory System (CMS)
$$\mathbf{y}_t = \text{MLP}^{(f_k)}(\text{MLP}^{(f_{k-1})}(\cdots \text{MLP}^{(f_1)}(\mathbf{x}_t)))$$

A chain of MLP blocks, each updated at a different frequency $f_\ell$. Higher-frequency blocks adapt fast but forget fast. Lower-frequency blocks persist but respond slowly. Together they form a memory spectrum.

Per-level update schedule
$$\boldsymbol{\theta}^{(f_\ell)}_{i+1} = \begin{cases} \boldsymbol{\theta}^{(f_\ell)}_i - \eta^{(\ell)} \sum_{t=i-C^{(\ell)}}^{i} f(\boldsymbol{\theta}^{(f_\ell)}_t; \mathbf{x}_t) & \text{if } i \equiv 0 \pmod{C^{(\ell)}} \\ \boldsymbol{\theta}^{(f_\ell)}_i & \text{otherwise} \end{cases}$$

Each level only updates when its chunk boundary $C^{(\ell)}$ is reached. A standard Transformer is the special case $k=1$ — a single MLP updated only during pre-training.

3
0.003
Each vertical dotted line is a new task. With 1 level (single MLP), the model forgets previous tasks as soon as it adapts to the new one. Add more levels and watch how CMS preserves knowledge from earlier tasks. Increase the forgetting rate to stress-test the system.

When a fast-level block forgets, the knowledge is still in slower blocks — and backpropagation through the initialization can circle it back. This recovery loop is why Hope maintains performance at 10M-token context while other models collapse. It is not magic; it is memory management at multiple timescales.

Next: Self-Referential Learning
Chapter 5

Self-Referential Learning

What if the model could rewrite its own source code?

Self-referential update rule (DGD with weight decay)
$$\mathbf{k}_t = M_{\mathbf{k},t-1}(\mathbf{x}_t), \quad \mathbf{v}_t = M_{\mathbf{v},t-1}(\mathbf{x}_t), \quad \hat{\mathbf{v}}_{\square,t} = M_{\square,t-1}(\mathbf{v}_t)$$ $$M_{\square,t} = M_{\square,t-1}\left(\alpha_t \mathbf{I} - \eta_t \mathbf{k}_t \mathbf{k}_t^\top\right) - \eta_t\, \nabla L\left(M_{\square,t-1}; \mathbf{k}_t, \hat{\mathbf{v}}_{\square,t}\right)$$

Each memory $M_\square$ (for keys, values, learning rates, gates, and the main memory) generates its own values $\hat{\mathbf{v}}_{\square,t}$ from its current state. The update uses the Delta Gradient Descent rule, which incorporates both the current input and the model's previous state — making it self-referential.

4
0.150
Compare three memory update rules on the same sequence. Hebbian (fixed) quickly plateaus at a high error. Delta rule adapts better. The self-referential DGD generates its own learning signals and converges fastest. Increase "Memory Capacity" to see how larger state improves all methods but benefits self-referential learning most.

The model does not just learn what to remember — it learns how to remember. The self-generated values $\hat{\mathbf{v}}_{\square,t}$ mean the model controls its own learning dynamics, deciding for itself what is worth storing and what to forget. This is the mechanism that lets Hope adapt to completely new languages without catastrophic forgetting.

Next: The Hope Architecture
Chapter 6

The Hope Architecture

Self-modifying memory meets continuum storage.

Hope forward pass (simplified)
$$\mathbf{o}_t = M_{\text{memory},t-1}(\mathbf{q}_t), \quad \mathbf{k}_t = M_{\mathbf{k},t-1}(\mathbf{x}_t), \quad \mathbf{v}_t = M_{\mathbf{v},t-1}(\mathbf{x}_t), \quad \eta_t = M_{\eta,t-1}(\mathbf{x}_t), \quad \alpha_t = M_{\alpha,t-1}(\mathbf{x}_t)$$ $$M_{\square,t} = M_{\square,t-1}\left(\alpha_t \mathbf{I} - \eta_t \mathbf{k}_t \mathbf{k}_t^\top\right) - \eta_t\, \nabla L(M_{\square,t-1}; \mathbf{k}_t, \hat{\mathbf{v}}_{\square,t})$$ $$\mathbf{y}_t = \text{MLP}^{(f_k)}(\text{MLP}^{(f_{k-1})}(\cdots \text{MLP}^{(f_1)}(\mathbf{o}_t)))$$

Lines 1-2: Self-referential Titans module generates keys, values, learning rates, and gates — then updates all memories using DGD. Line 3: The output passes through the CMS chain for persistent storage at multiple frequencies.

3
512
This simulation mirrors the paper's RULER benchmark. More CMS levels and lower update frequencies correspond to better long-range retrieval. Compare with the 1-level baseline (equivalent to standard ICL) to see the gap. The paper finds 2K lowest frequency to be a sweet spot between efficiency and performance.

Hope outperforms Transformers on language modeling at 760M and 1.3B parameters, reaches 88.4% on single-needle NIAH at 16K context, and maintains performance at 10M tokens — a regime where every other model in the paper has collapsed. The secret is not a bigger model; it is more levels of memory.

Next: Results & Implications
Chapter 7

Results & Implications

The numbers do not lie. But the perspective shift matters more.

Key Experimental Results

Language Modeling (1.3B params, 100B tokens): Hope achieves 14.39 perplexity on Wikitext (vs 17.92 for Transformer++) and 58.0% average accuracy across common-sense reasoning tasks — 4.7 points ahead of the best competing model.

Long Context (BABILong): Hope is the only model maintaining performance at 10 million tokens. GPT-4 fails at ~256K. Llama-8B with RAG degrades steadily. Titans and ARMT collapse after 1M. Hope persists.

Continual Learning (CTNL): When learning two new languages sequentially, standard ICL suffers catastrophic forgetting. Hope-3 (3 CMS levels) nearly recovers the single-language performance, proving the multi-level design prevents forgetting.

Class Incremental Learning: On CLINC, Banking, and DBpedia datasets, Hope-enhanced Llama-3B and Llama-8B outperform ICL, EWC, and InCA baselines — including methods with external learners.

Formal Language Recognition: Hope achieves 100% accuracy across all tasks including Parity, Dyck languages, and Shuffle-2 — matching LSTM and beating Transformers (46.4% on Parity), demonstrating the computational depth advantage of nested levels.

The paper's most provocative claim: there is no meaningful distinction between training and test time in a neural learning module. Every phase is just optimization at a different frequency. Pre-training is in-context learning with an ultra-large context. Test-time training is in-context learning with a small context. The only difference is how often you update, and how much you remember.

Conceptual Takeaways

Architectures generate the context for optimizers. The gradient landscape that an optimizer sees is produced by the architecture. Different architectures produce different gradient patterns, which means the optimal optimizer depends on the architecture. One-size-fits-all optimization may be leaving performance on the table.

Models have more parameters than we knew. Momentum terms, attention states, recurrent hidden states — all are parameters that store knowledge. They just update at different frequencies. When we discard momentum at the end of training, we are throwing away compressed knowledge about the loss landscape.

In-context learning is not emergent — it is structural. Having multiple nested levels is in-context learning. Transformers do it at infinite frequency (non-parametric). RNNs do it at finite frequency (parametric). The quality of ICL depends on how well the lower-frequency levels (pre-training) prepare the higher-frequency levels for fast adaptation.

The illusion of deep learning architecture. What looks like a heterogeneous mix of attention, MLP, convolutions, and recurrent layers is, from the NL viewpoint, a set of uniform feedforward networks optimized at different frequencies with different objectives. The heterogeneity is an artifact of viewing the solution rather than the optimization process.