Every deep learning model you have used — from GPT-4 to Midjourney — is frozen the moment training ends. It can juggle tokens inside its context window, but it cannot write new knowledge into its own weights. The authors compare this to anterograde amnesia: the patient remembers everything before the accident, experiences the present vividly, but cannot form new long-term memories. Current LLMs have the same condition.
The paper's central insight is that the distinction between "architecture" and "optimizer" is an illusion. Both are associative memories compressing their own context flow, just at different frequencies. Gradient descent memorizes input-to-surprise mappings. Momentum is a second-level memory compressing past gradients. Adam is the optimal memory for a specific loss function. Once you see the pattern, you can stack more levels, design better optimizers, and build models that modify their own learning rules — exactly the way the human brain uses fast gamma waves for perception and slow delta waves for memory consolidation.
Out of this theory comes Hope, a model that combines self-modifying sequence processing with a multi-frequency memory system. It matches or beats Transformers on language modeling, maintains performance at 10 million tokens of context, and nearly eliminates catastrophic forgetting when learning new languages sequentially. The paper does not just propose a new architecture — it offers a new lens through which every existing architecture can be understood.
Your language model was state-of-the-art when training ended. Then the world kept turning.
The weight $W_t$ is updated during training on data $\mathbf{x}_t$, but after "end of pre-training" this gradient flow stops entirely. The model can only adapt through its context window — never through its persistent parameters.
This proximal-gradient reformulation shows that training a layer with backpropagation is equivalent to memorizing the mapping from each layer's input $\hat{\mathbf{x}}_{\ell-1}$ to its local surprise signal $\boldsymbol{\delta}_\ell$ — how "surprising" the output was. The network is a surprise compressor.
The key insight: backpropagation is not just an optimization algorithm — it is an associative memory that maps inputs to prediction errors. Once training stops, the memory stops updating. Nested Learning asks what happens if we never stop.
What if "architecture" and "optimizer" are the same thing, just running at different speeds?
Components are sorted by frequency into levels: higher frequency (faster) components update more often. Attention runs every token. MLP weights update once per pre-training step. Momentum updates every backward pass. Each level has its own context flow — the data it learns from.
Each level $k$ contains optimization problems with their own keys $\mathbf{k}$, values $\mathbf{v}$, and objectives $L$. All are optimized with gradient descent — but on different timescales.
The deeper insight: both architecture and optimizer are instances of the same pattern — associative memories compressing their context flow at different frequencies. This means you can design new architectures by adding more levels, and new optimizers by using more expressive memory structures.
Your optimizer is not finding a solution. It is memorizing the loss landscape.
The momentum term $\mathbf{m}$ is itself optimized by gradient descent — a 2-level nested system where the inner level learns to compress gradients and the outer level uses the compressed knowledge to update weights.
Unlike standard GD (where each update is independent of $W_t$'s current state), DGD applies an adaptive decay $\mathbf{I} - \eta'_t \mathbf{x}_t \mathbf{x}_t^\top$ that depends on the data. This makes the update self-referential — the model generates its own learning signal.
When $\beta = 0.9$, the last 6 gradients are responsible for 50% of the momentum, and gradients beyond 43 steps contribute less than 1%. In a continual learning scenario with diverse tasks, this means the optimizer has almost no memory of the loss landscape from earlier tasks — setting the stage for catastrophic forgetting.
Forget "short-term vs long-term." Memory is a spectrum.
A chain of MLP blocks, each updated at a different frequency $f_\ell$. Higher-frequency blocks adapt fast but forget fast. Lower-frequency blocks persist but respond slowly. Together they form a memory spectrum.
Each level only updates when its chunk boundary $C^{(\ell)}$ is reached. A standard Transformer is the special case $k=1$ — a single MLP updated only during pre-training.
When a fast-level block forgets, the knowledge is still in slower blocks — and backpropagation through the initialization can circle it back. This recovery loop is why Hope maintains performance at 10M-token context while other models collapse. It is not magic; it is memory management at multiple timescales.
What if the model could rewrite its own source code?
Each memory $M_\square$ (for keys, values, learning rates, gates, and the main memory) generates its own values $\hat{\mathbf{v}}_{\square,t}$ from its current state. The update uses the Delta Gradient Descent rule, which incorporates both the current input and the model's previous state — making it self-referential.
The model does not just learn what to remember — it learns how to remember. The self-generated values $\hat{\mathbf{v}}_{\square,t}$ mean the model controls its own learning dynamics, deciding for itself what is worth storing and what to forget. This is the mechanism that lets Hope adapt to completely new languages without catastrophic forgetting.
Self-modifying memory meets continuum storage.
Lines 1-2: Self-referential Titans module generates keys, values, learning rates, and gates — then updates all memories using DGD. Line 3: The output passes through the CMS chain for persistent storage at multiple frequencies.
Hope outperforms Transformers on language modeling at 760M and 1.3B parameters, reaches 88.4% on single-needle NIAH at 16K context, and maintains performance at 10M tokens — a regime where every other model in the paper has collapsed. The secret is not a bigger model; it is more levels of memory.
The numbers do not lie. But the perspective shift matters more.
Language Modeling (1.3B params, 100B tokens): Hope achieves 14.39 perplexity on Wikitext (vs 17.92 for Transformer++) and 58.0% average accuracy across common-sense reasoning tasks — 4.7 points ahead of the best competing model.
Long Context (BABILong): Hope is the only model maintaining performance at 10 million tokens. GPT-4 fails at ~256K. Llama-8B with RAG degrades steadily. Titans and ARMT collapse after 1M. Hope persists.
Continual Learning (CTNL): When learning two new languages sequentially, standard ICL suffers catastrophic forgetting. Hope-3 (3 CMS levels) nearly recovers the single-language performance, proving the multi-level design prevents forgetting.
Class Incremental Learning: On CLINC, Banking, and DBpedia datasets, Hope-enhanced Llama-3B and Llama-8B outperform ICL, EWC, and InCA baselines — including methods with external learners.
Formal Language Recognition: Hope achieves 100% accuracy across all tasks including Parity, Dyck languages, and Shuffle-2 — matching LSTM and beating Transformers (46.4% on Parity), demonstrating the computational depth advantage of nested levels.
The paper's most provocative claim: there is no meaningful distinction between training and test time in a neural learning module. Every phase is just optimization at a different frequency. Pre-training is in-context learning with an ultra-large context. Test-time training is in-context learning with a small context. The only difference is how often you update, and how much you remember.
Architectures generate the context for optimizers. The gradient landscape that an optimizer sees is produced by the architecture. Different architectures produce different gradient patterns, which means the optimal optimizer depends on the architecture. One-size-fits-all optimization may be leaving performance on the table.
Models have more parameters than we knew. Momentum terms, attention states, recurrent hidden states — all are parameters that store knowledge. They just update at different frequencies. When we discard momentum at the end of training, we are throwing away compressed knowledge about the loss landscape.
In-context learning is not emergent — it is structural. Having multiple nested levels is in-context learning. Transformers do it at infinite frequency (non-parametric). RNNs do it at finite frequency (parametric). The quality of ICL depends on how well the lower-frequency levels (pre-training) prepare the higher-frequency levels for fast adaptation.
The illusion of deep learning architecture. What looks like a heterogeneous mix of attention, MLP, convolutions, and recurrent layers is, from the NL viewpoint, a set of uniform feedforward networks optimized at different frequencies with different objectives. The heterogeneity is an artifact of viewing the solution rather than the optimization process.