Yang Liu, Toan Nguyen, Flora D. Salim UNSW · May 2026 · arXiv:2605.20247
The paper, in plain English
When a large language model learns a new task, it tends to destroy what it already knew about the old ones. This is catastrophic forgetting, and it has been the central roadblock to building AI systems that genuinely learn over time. The authors ask a pointed question: what if, before committing any new knowledge, you first probe how that knowledge would interfere with what you already have?
The answer is CP-MoE. Think of it like a hospital that runs allergy tests before administering a new drug. A disposable "transient expert" briefly practices on a warm-up sample of the new task, creating a map of which parameters matter most. That map then selectively shields the existing experts during the real update — strong protection where it counts, freedom to adapt where it does not. At the same time, a routing mechanism measures how compatible each new input is with each expert, steering traffic toward the right specialist instead of spreading every task across all experts uniformly.
The result: on the SuperNI language benchmark, CP-MoE achieves 50.84% average performance with only 0.62% forgetting — a tenfold reduction compared to the 5.24% baseline. On the multimodal VQA v2 benchmark, forgetting drops to nearly zero (-0.35%), while average performance hits 62.30%, beating every prior method. The framework adds only a single temporary module per task and discards it afterward.
I
Transient Expert Probing
A disposable LoRA adapter briefly practices on warm-up tokens, producing a prospective importance mask that reveals which parameters the new task wants to change — before any permanent update happens.
II
Consistency-Preserving Routing
A CKA-based similarity score between the transient expert and each stable expert injects a routing bias that steers inputs toward semantically compatible specialists, preventing load-balancing from overriding expert specialisation.
III
Representation-Guided Protection
The importance mask is accumulated with CKA-weighted gating, so experts aligned with the current task receive targeted parameter protection while unrelated experts stay unconstrained.
Chapter 1
The Forgetting Problem
A model that learns task 8 should not forget task 1. And yet, without protection, that is exactly what happens.
Formally, after training on task $t-1$, the model is adapted to dataset $\mathcal{D}_t$ without access to past datasets $\{\mathcal{D}_1, \ldots, \mathcal{D}_{t-1}\}$. The task loss for autoregressive generation is:
Here $\Theta_{\text{frozen}}$ are the frozen backbone parameters and $\Phi$ are the trainable adaptation parameters. The goal is to minimise $\mathcal{L}_t^{\text{task}}$ while preserving performance on all previous tasks — a constraint the loss itself does not enforce.
Catastrophic Forgetting vs. CP-MoE
Chart updates as you drag the slider.
NoneSevere
Baseline Avg Performance
47.05%
CP-MoE Avg Performance
50.84%
Baseline Forgetting
5.24%
CP-MoE Forgetting
0.62%
Why this matters
Without protection, sequential training produces a model that is excellent at the last task and terrible at everything else. The gap between the red and navy bars in the simulation above is precisely the gap that CP-MoE closes — not by freezing everything, but by selectively protecting what matters.
Instead of updating every parameter for every task, MoE routes each input through a small subset of lightweight experts.
The $i$-th LoRA expert computes a low-rank adaptation:
$$E_i(x) = B_i A_i x$$
where $A_i \in \mathbb{R}^{r \times d}$ and $B_i \in \mathbb{R}^{k \times r}$ with rank $r \ll \min(d, k)$. The gating function determines each expert's contribution:
$$G(x) = \text{Softmax}(xW_{\text{gate}})$$
Let $K(x)$ denote the indices of the top-$m$ experts selected by $G(x)$. The adapted FFN output combines the frozen backbone with the sparse expert contributions:
This sparse activation enables input-dependent adaptation while decoupling total expert capacity from per-token computation.
LoRA-MoE Architecture
Click an expert to see its details
Why this matters
Standard MoE works well for a single task. The problem is continual learning: when tasks arrive sequentially, the router can send semantically mismatched inputs to historical experts, causing interference. The architecture alone does not solve forgetting — it just creates the partition that makes a solution possible.
Before touching the real experts, CP-MoE sends in a disposable scout to map the terrain.
At the beginning of each task $t$, a transient expert $\phi_t^{\text{TE}} = \{A_t^{\text{TE}}, B_t^{\text{TE}}\}$ is instantiated and trained on a warm-up subset $\hat{\mathcal{D}}_t \subseteq \mathcal{D}_t$ while keeping both $\Theta_{\text{frozen}}$ and $\Phi$ fixed:
The transient expert is initialised at zero ($E_t^{\text{TE}}(x) = 0$ at $s=0$) so it starts with no effect. After warm-up, the path-integral rule accumulates prospective importance:
where $g_{t,s,k} = \nabla_\phi \hat{\mathcal{L}}_t^{\text{task}}(\phi_{t,s}^{\text{TE}})_k$ and $\Delta\phi_{t,s,k}^{\text{TE}} = \phi_{t,s+1,k}^{\text{TE}} - \phi_{t,s,k}^{\text{TE}}$. This is then normalised with damping $\xi > 0$:
Under a local quadratic approximation of the loss, the transient expert's $S$-step trajectory induces a spectrally filtered multi-step adaptation direction: flatter directions of the Hessian accumulate more strongly, while high-curvature directions are attenuated. This means the transient expert captures more than a single gradient — it estimates the direction that remains useful under repeated descent.
where $\alpha \geq 0$ controls the strength of the CP bias. The final routing weights for the top-$K$ experts are:
$$g_{i,t} = \frac{\exp(\tilde{s}_{i,t})}{\sum_{j \in K_t} \exp(\tilde{s}_{j,t})}, \quad i \in K_t$$
The load-balancing auxiliary loss is retained, but computed only from native logits $s_{i,t}$ without the CP bias, keeping the two mechanisms decoupled:
The CP bias resolves a fundamental tension in MoE continual learning: load balancing wants all experts equally busy, but expert specialisation wants inputs routed to the right expert. By grounding the routing bonus in representation similarity rather than usage statistics, CP-MoE achieves both — balanced utilisation and semantically coherent routing.
Not every parameter deserves the same shield. Protection should scale with alignment.
The importance mask $\Omega_t = \{\Omega_{t,A}, \Omega_{t,B}\}$ from the transient expert is accumulated into each stable expert's running importance, weighted by the CKA consistency score:
where $\langle U, V \rangle$ denotes the Frobenius inner product and $\odot$ the Hadamard product. Experts with higher CKA alignment accumulate stronger importance weights, so their parameters are more heavily penalised for drifting.
Parameter Drift Under CKA-Weighted Protection
Chart updates as you drag sliders.
0 (none)10000
Avg Parameter Drift
—
Most Protected Expert
—
Least Protected Expert
—
Why this matters
Without CKA-weighted gating, the ablation shows forgetting rises from 0.62% to 1.50% and accuracy drops by ~1%. A uniform regularisation scheme cannot distinguish between experts that genuinely need protection and those that should be free to adapt. The alignment signal is what makes the protection surgical rather than blanket.
where $\lambda$ and $\gamma$ are balancing coefficients. In the paper's primary configuration: $\alpha = 0.2$, $\lambda = 5 \times 10^3$, $\gamma = 0.1$.
Loss Dynamics and Hyperparameter Trade-offs
Charts update as you drag sliders.
010000
01.0
01.0
Why this matters
The hyperparameter sensitivity analysis (Table 6 in the paper) shows that CP-MoE's performance is robust across a wide range of $\lambda$ and $\alpha$. The model experiences severe forgetting only when $\alpha \to 0$, confirming that the CKA-guided semantic consistency is the primary driver — not careful tuning of any single coefficient.
Two benchmarks, two modalities, one consistent story: CP-MoE forgets less and generalises more.
50.84%
SuperNI Average Performance (best of all methods)
0.62%
Average Forgetting on SuperNI (down from 5.24% baseline)
-0.35%
VQA v2 Average Forgetting (essentially zero forgetting)
Benchmark Explorer
SuperNI
LLaMA-2-7B · 8 sequential tasks · ROUGE-L
Unimodal language generation benchmark covering summarisation, sentiment, extraction, and dialogue.
VQA v2
LLaVA-1.5-7B · 10 sequential tasks · Accuracy
Multimodal visual reasoning benchmark with tasks from recognition to causal reasoning.
Ablation Study
SuperNI · Component contribution
Isolate the effect of each CP-MoE component: TE-Reg, CP-Bias, and CKA Mask.
Average Performance
50.84%
Average Forgetting
0.62%
Zero-Shot Transfer
35.80%
What the ablation teaches
Removing just the TE regularisation sends forgetting from 0.62% to 5.24% — an eightfold increase. Removing just the CP bias raises forgetting to 1.50%. Both components are necessary, and they complement each other: TE-Reg protects parameters, CP-Bias protects routing. Together they achieve the lowest forgetting and the highest accuracy.