An Interactive Reading of

CP-MoE:
Consistency-Preserving
Mixture-of-Experts

for Continual Learning

Yang Liu, Toan Nguyen, Flora D. Salim
UNSW · May 2026 · arXiv:2605.20247

The paper, in plain English

When a large language model learns a new task, it tends to destroy what it already knew about the old ones. This is catastrophic forgetting, and it has been the central roadblock to building AI systems that genuinely learn over time. The authors ask a pointed question: what if, before committing any new knowledge, you first probe how that knowledge would interfere with what you already have?

The answer is CP-MoE. Think of it like a hospital that runs allergy tests before administering a new drug. A disposable "transient expert" briefly practices on a warm-up sample of the new task, creating a map of which parameters matter most. That map then selectively shields the existing experts during the real update — strong protection where it counts, freedom to adapt where it does not. At the same time, a routing mechanism measures how compatible each new input is with each expert, steering traffic toward the right specialist instead of spreading every task across all experts uniformly.

The result: on the SuperNI language benchmark, CP-MoE achieves 50.84% average performance with only 0.62% forgetting — a tenfold reduction compared to the 5.24% baseline. On the multimodal VQA v2 benchmark, forgetting drops to nearly zero (-0.35%), while average performance hits 62.30%, beating every prior method. The framework adds only a single temporary module per task and discards it afterward.

I

Transient Expert Probing

A disposable LoRA adapter briefly practices on warm-up tokens, producing a prospective importance mask that reveals which parameters the new task wants to change — before any permanent update happens.

II

Consistency-Preserving Routing

A CKA-based similarity score between the transient expert and each stable expert injects a routing bias that steers inputs toward semantically compatible specialists, preventing load-balancing from overriding expert specialisation.

III

Representation-Guided Protection

The importance mask is accumulated with CKA-weighted gating, so experts aligned with the current task receive targeted parameter protection while unrelated experts stay unconstrained.

Chapter 1

The Forgetting Problem

A model that learns task 8 should not forget task 1. And yet, without protection, that is exactly what happens.

In plain English

Imagine you are studying for eight final exams, one after another. You cram for biology on Monday, then move to chemistry on Tuesday. By the time you reach history on Friday, the Krebs cycle has evaporated from your memory. That is catastrophic forgetting: new learning overwrites old learning because the same neural weights are being recycled.

Large language models face the same problem. They share a single set of parameters across all tasks. When those parameters are tuned for a new task, the fine-tuning physically erases the traces left by previous tasks. The bigger the model and the more tasks you chain together, the worse it gets.

Drag the "Forgetting Severity" slider in the simulation below and watch the red bars collapse on the earlier tasks — that is the baseline. The navy bars show CP-MoE, which holds steady even when the forgetting pressure is cranked up.

Formally, after training on task $t-1$, the model is adapted to dataset $\mathcal{D}_t$ without access to past datasets $\{\mathcal{D}_1, \ldots, \mathcal{D}_{t-1}\}$. The task loss for autoregressive generation is:

$$\mathcal{L}_t^{\text{task}}(\Phi) = -\mathbb{E}_{(X,Y)\sim\mathcal{D}_t}\left[\sum_{j=1}^{|Y|} \log p_{\Theta_{\text{frozen}},\Phi}(Y_j \mid X, Y_{

Here $\Theta_{\text{frozen}}$ are the frozen backbone parameters and $\Phi$ are the trainable adaptation parameters. The goal is to minimise $\mathcal{L}_t^{\text{task}}$ while preserving performance on all previous tasks — a constraint the loss itself does not enforce.

Catastrophic Forgetting vs. CP-MoE

Chart updates as you drag the slider.

Forgetting Severity 5

NoneSevere

Baseline Avg Performance

47.05%

CP-MoE Avg Performance

50.84%

Baseline Forgetting

5.24%

CP-MoE Forgetting

0.62%

Why this matters

Without protection, sequential training produces a model that is excellent at the last task and terrible at everything else. The gap between the red and navy bars in the simulation above is precisely the gap that CP-MoE closes — not by freezing everything, but by selectively protecting what matters.

Next: The MoE Architecture →

Chapter 2

The MoE Architecture

Instead of updating every parameter for every task, MoE routes each input through a small subset of lightweight experts.

In plain English

Think of a MoE layer as a consulting firm with eight analysts. When a client brief arrives, a receptionist (the router) reads the first sentence and decides which two analysts are best suited. Only those two do the work; the rest stay idle. This keeps the total computation manageable even as you add more analysts.

Each analyst is a LoRA adapter — a pair of tiny matrices ($A_i$ and $B_i$) that produce a low-rank correction to the frozen backbone. The router is a learned gate that picks the top-$K$ experts via softmax. The key insight is that sparse routing creates natural partitions in parameter space, so different tasks can specialise different experts without treading on each other.

Click on each expert in the diagram below to see its role.

The $i$-th LoRA expert computes a low-rank adaptation:

$$E_i(x) = B_i A_i x$$

where $A_i \in \mathbb{R}^{r \times d}$ and $B_i \in \mathbb{R}^{k \times r}$ with rank $r \ll \min(d, k)$. The gating function determines each expert's contribution:

$$G(x) = \text{Softmax}(xW_{\text{gate}})$$

Let $K(x)$ denote the indices of the top-$m$ experts selected by $G(x)$. The adapted FFN output combines the frozen backbone with the sparse expert contributions:

$$\tilde{F}(x) = F(x; \Theta_{\text{FFN}}) + \frac{\alpha}{r} \sum_{i \in K(x)} G(x)_i \, E_i(x)$$

This sparse activation enables input-dependent adaptation while decoupling total expert capacity from per-token computation.

LoRA-MoE Architecture

Click an expert to see its details

Why this matters

Standard MoE works well for a single task. The problem is continual learning: when tasks arrive sequentially, the router can send semantically mismatched inputs to historical experts, causing interference. The architecture alone does not solve forgetting — it just creates the partition that makes a solution possible.

Next: The Transient Expert →

Chapter 3

The Transient Expert

Before touching the real experts, CP-MoE sends in a disposable scout to map the terrain.

In plain English

Before a surgeon operates, they order a CT scan. The scan does not fix anything — it just reveals what is inside so the surgeon can plan the incision. The transient expert is that CT scan. It is a tiny LoRA adapter that trains on a small warm-up sample of the new task for a handful of steps, then gets thrown away.

What survives is a map of which parameters moved, and by how much. That map — the "prospective importance mask" — tells the real experts exactly which of their weights to protect during the upcoming update. Parameters the transient expert barely touched? Fair game. Parameters it moved a lot? Those are the ones the new task cares about, and they are the ones most likely to collide with old knowledge.

Drag the learning rate slider below and watch how the warm-up trajectory and importance weights respond.

At the beginning of each task $t$, a transient expert $\phi_t^{\text{TE}} = \{A_t^{\text{TE}}, B_t^{\text{TE}}\}$ is instantiated and trained on a warm-up subset $\hat{\mathcal{D}}_t \subseteq \mathcal{D}_t$ while keeping both $\Theta_{\text{frozen}}$ and $\Phi$ fixed:

$$\phi_{t,s+1}^{\text{TE}} = \phi_{t,s}^{\text{TE}} - \eta \, \nabla_{\phi} \hat{\mathcal{L}}_t^{\text{task}}(\phi_{t,s}^{\text{TE}}), \quad s = 0, \ldots, S-1$$

The transient expert is initialised at zero ($E_t^{\text{TE}}(x) = 0$ at $s=0$) so it starts with no effect. After warm-up, the path-integral rule accumulates prospective importance:

$$\omega_{t,k} = \sum_{s=0}^{S-1} -g_{t,s,k} \, \Delta\phi_{t,s,k}^{\text{TE}}$$

where $g_{t,s,k} = \nabla_\phi \hat{\mathcal{L}}_t^{\text{task}}(\phi_{t,s}^{\text{TE}})_k$ and $\Delta\phi_{t,s,k}^{\text{TE}} = \phi_{t,s+1,k}^{\text{TE}} - \phi_{t,s,k}^{\text{TE}}$. This is then normalised with damping $\xi > 0$:

$$\Omega_{t,k} = \frac{\omega_{t,k}}{(\phi_{t,S,k}^{\text{TE}} - \phi_{t,0,k}^{\text{TE}})^2 + \xi}$$

Transient Expert Warm-up Trajectory

Charts update as you drag sliders.

Learning rate η 0.10

0.010.50

Warm-up steps S 10

320

Final Warm-up Loss

—

Total Importance Weight

—

Spectral Filter Strength

—

Theorem 1 — Transient expert as spectral filter

Under a local quadratic approximation of the loss, the transient expert's $S$-step trajectory induces a spectrally filtered multi-step adaptation direction: flatter directions of the Hessian accumulate more strongly, while high-curvature directions are attenuated. This means the transient expert captures more than a single gradient — it estimates the direction that remains useful under repeated descent.

Next: Consistency-Preserving Routing →

Chapter 4

Consistency-Preserving Routing

Load balancing keeps experts busy. But busy is not the same as well-matched.

In plain English

Standard MoE routing tries to spread work evenly across experts — like a restaurant manager who assigns tables to waiters in strict rotation, regardless of whether a waiter knows the menu. In continual learning, this backfires: the router shoves a chemistry input into the expert that specialised in poetry, simply because that expert has not been busy lately.

CP-MoE fixes this with a simple idea: measure how similar each expert's internal representations are to the transient expert's, then give a bonus to experts that "speak the same language." The similarity metric is CKA (Centered Kernel Alignment), a standard tool from representation analysis. The bonus is added directly to the routing logits, steering traffic toward experts that are semantically compatible with the incoming task.

Drag the α slider below and watch the routing weights shift — experts with higher CKA similarity (the ones that "get" the new task) receive proportionally more traffic.

The representation-consistency score between the transient expert and stable expert $i$ is measured via CKA:

$$h_i^{\text{CP}} = \text{CKA}(Z^{\text{TE}}, Z_i^{\text{SE}}) = \frac{\|(Z_i^{\text{SE}})^\top Z^{\text{TE}}\|_F^2}{\|(Z^{\text{TE}})^\top Z^{\text{TE}}\|_F \, \|(Z_i^{\text{SE}})^\top Z_i^{\text{SE}}\|_F}$$

This gives a task-dependent prior $h_i^{\text{CP}} \in [0, 1]$ for each stable expert. The biased logit injects the CKA similarity:

$$\tilde{s}_{i,t} = s_{i,t} + \alpha \, h_i^{\text{CP}}$$

where $\alpha \geq 0$ controls the strength of the CP bias. The final routing weights for the top-$K$ experts are:

$$g_{i,t} = \frac{\exp(\tilde{s}_{i,t})}{\sum_{j \in K_t} \exp(\tilde{s}_{j,t})}, \quad i \in K_t$$

The load-balancing auxiliary loss is retained, but computed only from native logits $s_{i,t}$ without the CP bias, keeping the two mechanisms decoupled:

$$\mathcal{L}^{\text{aux}} = \sum_{i=1}^n f_i \, P_i$$

Routing With and Without CP Bias

Charts update as you drag the slider.

CP Bias Strength α 0.20

0 (off)1.0

Routing Entropy

—

Max Expert Weight

—

Current α

0.20

Why this matters

The CP bias resolves a fundamental tension in MoE continual learning: load balancing wants all experts equally busy, but expert specialisation wants inputs routed to the right expert. By grounding the routing bonus in representation similarity rather than usage statistics, CP-MoE achieves both — balanced utilisation and semantically coherent routing.

Next: Parameter Protection →

Chapter 5

Parameter Protection

Not every parameter deserves the same shield. Protection should scale with alignment.

In plain English

Suppose you manage a team of four mechanics. A new type of engine arrives for servicing. You would not lock all four mechanics' toolboxes — some of them need their wrenches to do the new job. Instead, you lock only the toolbox of the mechanic whose existing work overlaps most with the new engine type, because that mechanic's tools are most at risk of being rearranged.

CP-MoE does the same thing. The importance mask from the transient expert identifies which parameters matter. The CKA similarity identifies which expert matters. The regularisation strength for each expert is the product of the two: high similarity and high importance means strong protection. Low similarity? Let that expert adapt freely.

Drag the λ slider below and watch how parameter drift (the vertical distance from the old weight values) shrinks for aligned experts while staying flexible for unrelated ones.

The importance mask $\Omega_t = \{\Omega_{t,A}, \Omega_{t,B}\}$ from the transient expert is accumulated into each stable expert's running importance, weighted by the CKA consistency score:

$$\Omega_{A,\text{total}}^{(i)} \leftarrow \Omega_{A,\text{total}}^{(i)} + h_i^{\text{CP}} \, \Omega_{t,A}, \quad \Omega_{B,\text{total}}^{(i)} \leftarrow \Omega_{B,\text{total}}^{(i)} + h_i^{\text{CP}} \, \Omega_{t,B}$$

The regularisation term penalises drift from the previous task's parameter snapshot:

$$\mathcal{L}^{\text{reg}} = \sum_{i=1}^n \left[\langle \Omega_{A,\text{total}}^{(i)},\, (A_i - A_i^{\text{old}})^{\odot 2} \rangle + \langle \Omega_{B,\text{total}}^{(i)},\, (B_i - B_i^{\text{old}})^{\odot 2} \rangle\right]$$

where $\langle U, V \rangle$ denotes the Frobenius inner product and $\odot$ the Hadamard product. Experts with higher CKA alignment accumulate stronger importance weights, so their parameters are more heavily penalised for drifting.

Parameter Drift Under CKA-Weighted Protection

Chart updates as you drag sliders.

Regularisation Strength λ 5000

0 (none)10000

Avg Parameter Drift

—

Most Protected Expert

—

Least Protected Expert

—

Why this matters

Without CKA-weighted gating, the ablation shows forgetting rises from 0.62% to 1.50% and accuracy drops by ~1%. A uniform regularisation scheme cannot distinguish between experts that genuinely need protection and those that should be free to adapt. The alignment signal is what makes the protection surgical rather than blanket.

Next: The Full Objective →

Chapter 6

The Full Objective

Three loss terms, two knobs, one optimisation. Here is how the pieces snap together.

In plain English

CP-MoE's training objective is a weighted sum of three terms. The first is the standard task loss — "learn the new task well." The second is the regularisation loss — "do not overwrite parameters that previous tasks depend on." The third is the load-balancing loss — "keep all experts reasonably busy so none sits idle."

The balancing act is controlled by two coefficients: $\lambda$ (how aggressively to protect old parameters) and $\gamma$ (how hard to push toward equal expert utilisation). The paper finds that performance is remarkably stable across a wide range of $\lambda$ and $\gamma$ values — as long as the CP bias $\alpha$ is nonzero. The structural inductive bias is what drives performance, not meticulous tuning.

Adjust the sliders below and watch how the loss curves and the performance-forgetting trade-off respond.

For task $t$, the final training objective combines all three components:

$$\mathcal{L}_t^{\text{total}} = \mathcal{L}_t^{\text{task}} + \lambda \, \mathcal{L}_t^{\text{reg}} + \gamma \, \mathcal{L}_t^{\text{aux}}$$

where $\lambda$ and $\gamma$ are balancing coefficients. In the paper's primary configuration: $\alpha = 0.2$, $\lambda = 5 \times 10^3$, $\gamma = 0.1$.

Loss Dynamics and Hyperparameter Trade-offs

Charts update as you drag sliders.

Regularisation λ 5000

010000

Load Balance γ 0.10

01.0

CP Bias α 0.20

01.0

Why this matters

The hyperparameter sensitivity analysis (Table 6 in the paper) shows that CP-MoE's performance is robust across a wide range of $\lambda$ and $\alpha$. The model experiences severe forgetting only when $\alpha \to 0$, confirming that the CKA-guided semantic consistency is the primary driver — not careful tuning of any single coefficient.

Next: Experimental Results →

Chapter 7

Experimental Results

Two benchmarks, two modalities, one consistent story: CP-MoE forgets less and generalises more.

In plain English

The authors test CP-MoE on two very different challenges. The first is SuperNI, a gauntlet of eight sequential language tasks — summarisation, sentiment analysis, information extraction, dialogue — fed one after another to LLaMA-2-7B. The second is VQA v2, a multimodal benchmark where LLaVA-1.5-7B answers visual questions about images, with ten task types from object recognition to causal reasoning.

On SuperNI, CP-MoE achieves 50.84% average performance — the highest of any method tested — while keeping forgetting to just 0.62%. More strikingly, on seven unseen tasks the model has never trained on, it scores 35.80% on zero-shot transfer, beating the best competitor by two full points. On VQA v2, forgetting is essentially eliminated at -0.35%, while performance reaches 62.30%.

Switch between the two benchmarks below to explore per-task scores, and check the ablation to see exactly how much each component contributes.

50.84%

SuperNI Average Performance
(best of all methods)

0.62%

Average Forgetting on SuperNI
(down from 5.24% baseline)

-0.35%

VQA v2 Average Forgetting
(essentially zero forgetting)

Benchmark Explorer

SuperNI

LLaMA-2-7B · 8 sequential tasks · ROUGE-L

Unimodal language generation benchmark covering summarisation, sentiment, extraction, and dialogue.

VQA v2

LLaVA-1.5-7B · 10 sequential tasks · Accuracy

Multimodal visual reasoning benchmark with tasks from recognition to causal reasoning.

Ablation Study

SuperNI · Component contribution

Isolate the effect of each CP-MoE component: TE-Reg, CP-Bias, and CKA Mask.

Average Performance

50.84%

Average Forgetting

0.62%

Zero-Shot Transfer

35.80%

What the ablation teaches

Removing just the TE regularisation sends forgetting from 0.62% to 5.24% — an eightfold increase. Removing just the CP bias raises forgetting to 1.50%. Both components are necessary, and they complement each other: TE-Reg protects parameters, CP-Bias protects routing. Together they achieve the lowest forgetting and the highest accuracy.

CP-MoE:Consistency-PreservingMixture-of-Experts

The Forgetting Problem

Catastrophic Forgetting vs. CP-MoE

The MoE Architecture

LoRA-MoE Architecture

The Transient Expert

Transient Expert Warm-up Trajectory

Consistency-Preserving Routing

Routing With and Without CP Bias

Parameter Protection

Parameter Drift Under CKA-Weighted Protection

The Full Objective

Loss Dynamics and Hyperparameter Trade-offs

Experimental Results

Benchmark Explorer

CP-MoE:
Consistency-Preserving
Mixture-of-Experts