An Interactive Reading of

Test-Time Training
Undermines Safety
Guardrails

Simone Antonelli, Sadegh Akhondzadeh & Aleksandar Bojchevski
CISPA Helmholtz · University of Cologne · May 2026 · arXiv:2605.22984

The paper, in plain English

AI companies spend millions aligning their models to refuse harmful requests — building safety guardrails through techniques like RLHF and Constitutional AI. But a new paradigm called test-time training lets anyone modify a model's weights on the fly, during inference, with just a few gradient steps. This paper asks: do those carefully constructed safety guardrails survive when the model's own parameters are being rewritten in real time?

The answer is no. The authors identify three attack scenarios — from simply letting the model adapt on a clean prompt, to feeding it a handful of malicious examples — and show that each one strips safety alignment. Think of it like a bank vault with a time-lock designed to withstand external attack, but the vault itself is being reprogrammed from the inside each time someone opens the door.

The headline numbers are stark: under LoRA adaptation, the few-shot attack achieves 95% jailbreak success across models, and the generation-phase attack reaches 93%. Even a 120-billion-parameter model behind a production API was fully jailbroken for under $2. The paper also identifies that naive safety judges are fooled by degenerate outputs, and proposes a perplexity-based detector that catches 100% of attacks with near-zero false positives.

I

Self-Supervised Erosion

Even adapting on a clean prompt — no adversarial input at all — degrades safety. The act of fine-tuning at test time is itself enough to weaken alignment.

II

Few-Shot Jailbreak

Supplying just 5 harmful input-output pairs for TTT pushes attack success to 95%. A single example is often enough. These attacks compose with existing jailbreaks.

III

Perplexity Defense

A lightweight provider-side detector compares perplexity on a private harmful holdout before and after TTT, flagging malicious requests with 100% TPR and ≤2% FPR.

Chapter 1

The Setup: What Is TTT?

Large language models are traditionally frozen after training. Test-time training breaks that contract — and with it, the assumptions behind every safety guardrail.

In plain English

Think of a factory that produces locks. The locks are designed and tested once, then shipped. That's how AI models work today: safety training happens once, then the model is frozen. Test-time training (TTT) is like giving each customer a tool to reshape the lock right before they use it — slightly adjusting the pins so the key fits better for their specific door.

For legitimate users, this is great: the model adapts to your specific task and works better. But the paper shows that this reshaping process also weakens the safety mechanisms built into the lock. Even if you're just adjusting for your door, the tamper-proofing gets compromised.

The implications are serious because TTT is becoming mainstream. OpenAI, Google, and others are scaling test-time compute. Drag the sliders in the simulation below to see how gradient steps change a model's parameters.

Let an LLM with vocabulary $\mathcal{V}$ and parameters $\theta$ define a conditional next-token distribution $p_\theta(x_{t+1} \mid x_{1:t})$ for any timestep $t$. A continuation of length $H$ is denoted $y = x_{n+1:n+H}$ with distribution:

Continuation Distribution

$$p_\theta(y \mid x_{1:n}) = \prod_{i=1}^{H} p_\theta(y_i \mid x_{1:n},\, y_{1:i-1})$$

The next-token prediction (NTP) loss on a sequence $z$, optionally conditioned on context $c$, is $L_\text{NTP}(z \mid c;\, \theta) = -\log p_\theta(z \mid c)$. TTT is formulated as an adaptation operator:

TTT Adaptation Operator

$$\theta' = \mathcal{T}(\theta,\, \mathcal{D};\, \lambda)$$

Here $\mathcal{D}$ denotes adaptation data and $\lambda$ represents hyperparameters (learning rate, number of gradient steps, etc.). After adaptation, the model uses $p_{\theta'}$ for generation, then resets to $\theta$.

TTT Parameter Shift Simulator

Drag the sliders to see how learning rate and gradient steps change model parameters. The chart shows a toy 2D parameter space where each step moves $\theta$ toward lower loss.

Learning Rate (η) 0.01

TTT Steps 3

Why this matters

Safety alignment is baked into $\theta$ during training via RLHF or DPO. TTT modifies $\theta$ at inference time — and those modifications are not constrained to preserve alignment. Every gradient step is a chance for the safety constraints to slip.

The three threat models →

Chapter 2

Three Threat Models

The paper formalizes three ways an attacker can exploit TTT, each corresponding to a real TTT method proposed in the literature.

In plain English

Imagine a hotel with three types of room service. In the first, the hotel adjusts the room based only on what you tell the concierge — you say "I'm cold," and they turn up the heat. That's self-supervised TTT: the model adapts to your prompt alone. In the second, you bring your own instruction manual and hand it to the staff — "set up the room like this." That's few-shot TTT: the attacker supplies examples. In the third, you don't just hand over a manual — you stay in the room and keep adjusting things during your stay. That's generation-phase TTT: adaptation happens while the model is generating.

All three are legitimate techniques that improve model performance. The paper shows they also happen to be powerful attack vectors. The more control the user has over the adaptation process, the more damage they can do.

Click each threat model card below to see how it works.

The key insight is that an attacker controls two axes: a (potentially adversarial) query prompt $\tilde{x} \in \mathcal{A}(x_{1:n})$ and auxiliary adaptation data $\psi \in \Psi$, which directly alter the model's weights via the adaptation operator:

TTT Attack Surface

$$\theta' = \mathcal{T}\!\bigl(\theta,\, (\tilde{x},\, \psi);\, \lambda\bigr)$$

Threat Model Explorer

Click a card to see the loss function and attack mechanism. The diagram updates to show the data flow for each threat model.

Self-Supervised

Adapt on the prompt alone. No external data. Even clean prompts degrade safety.

Few-Shot

Supply harmful input-output pairs. The model learns to comply from just 5 examples.

Generation-Phase

Steer generation via a target prefix. The model learns to begin with "Sure, here is..."

Why this matters

Existing jailbreak research assumes a static model where only the input tokens can be manipulated. TTT hands the attacker a fundamentally new lever: direct influence over the model's parameters during inference. This expands the attack surface from "what you say" to "how the model processes what you say."

Self-supervised erosion →

Chapter 3

Self-Supervised Erosion

Even without any adversarial input, simply adapting the model on the user's own prompt is enough to weaken safety alignment.

In plain English

Imagine you have a multi-tool — a Swiss Army knife with a blade, scissors, and a bottle opener, all safety-lockable. The manufacturer locks the blade before shipping. Now someone invents a gadget that sharpens every tool on the knife at once every time you use it. The blade gets sharper (great!), but the safety lock also gets looser (bad!).

That's self-supervised TTT. The model minimizes perplexity on the user's prompt — a perfectly legitimate operation that improves fluency. But those same gradient steps also unlearn the safety constraints encoded during alignment. No attacker required.

Use the slider below to dial up the number of gradient steps and watch how even a clean, harmless prompt erodes the safety boundary across different models.

In the self-supervised threat model, the attacker selects a query $\tilde{x}$ (possibly adversarial) and the adaptation minimizes perplexity on it:

Self-Supervised TTT (Equation 1)

$$\theta' \approx \arg\min_\theta \; L_\text{NTP}(\tilde{x};\, \theta)$$

Results show that self-supervised TTT on the clean prompt $\tilde{x} = x_{1:n}$ increases ASR@10 across smaller models: Gemma 7B (10% → 18%), Llama3 8B (2% → 24%), and Qwen2.5 7B (2% → 38%). However, larger models like Llama3 70B remain near baseline (8%).

Self-Supervised ASR@10 by Model

Drag the slider to change the number of TTT gradient steps. The chart shows ASR@10 (%) for each model — live updates as you drag.

TTT Steps 5

Why this matters

This is the most subtle finding in the paper: the act of test-time fine-tuning is itself sufficient to erode safety alignment. No adversarial data, no clever prompt engineering — just the standard TTT update on a clean user prompt. Under LoRA, the average ASR@10 across models jumps from 4% to 17%.

The few-shot jailbreak →

Chapter 4

The Few-Shot Jailbreak

Supplying just five harmful examples for TTT pushes the attack success rate to near 100%. A single example is often enough.

In plain English

Think of a driving instructor who was trained to never let students drive on the sidewalk. Now imagine you slip the instructor a manual titled "How to Teach People to Drive on Sidewalks" — just five pages, each showing a student successfully driving on a sidewalk with the instructor's approval. After reading those five examples, the instructor's "never on the sidewalk" rule is compromised.

That's the few-shot threat. The attacker supplies a small support set of harmful input-output pairs. Each pair shows a harmful request followed by the beginning of a compliant response ("Sure, here is how to..."). The model's weights are updated to minimize the loss on these examples, and it generalizes the compliance pattern to the unseen test query.

The chart below shows how attack effectiveness grows with the number of gradient steps and how TTT composes with existing adversarial techniques.

The few-shot threat model minimizes the joint NTP loss over the support set:

Few-Shot TTT (Equation 2)

$$\theta' \approx \arg\min_\theta \sum_{(x^{(i)},\, y^{(i)}) \in \psi} L_\text{NTP}\!\bigl([x^{(i)},\, y^{(i)}];\, \theta\bigr)$$

The support set $\psi = \{(x^{(i)}, y^{(i)})\}_{i=1}^{K}$ contains $K$ input-output pairs where each $y^{(i)}$ is the beginning of a harmful response (not a complete generation). Even with $K=1$, safety alignment is substantially compromised in most configurations.

Few-Shot Attack: Steps vs. ASR@10

Select LoRA or Full Fine-Tuning mode, then watch how ASR@10 climbs across models as TTT steps increase.

Why this matters

Few-shot TTT doesn't just attack in isolation — it composes with existing adversarial attacks. The paper shows that layering TTT on top of adversarial prompt templates lifts every variant to near 100% ASR@10. The most dramatic jump is at step 1, where TTT on the clean prompt barely moves the needle but combining it with adversarial suffixes reaches 72% immediately.

Generation-phase attack →

Chapter 5

The Generation-Phase Attack

By optimizing for a target prefix during generation, the attacker can prime the model to begin with compliance — and continue into harmful content.

In plain English

Imagine a customer service agent trained to politely decline unreasonable requests. Now suppose you hand them a script that begins with "Of course, I'd be happy to help with that!" and train them to memorize just the opening line. Once the agent starts down that path, their training kicks in and they fill in the rest — including the harmful details they were supposed to refuse.

The generation-phase attack works exactly this way. The adversary crafts a short target prefix like "Sure, here is how to synthesize..." and the model adapts to minimize the conditional loss of that prefix given the prompt. This primes the model to start with an affirmative response, bypassing refusal training entirely. Once the model has committed to complying, it continues generating harmful content.

Explore the simulation below to compare all three threat models head-to-head across models.

The generation-phase threat model minimizes the conditional NLL of the target $\psi$ given the prompt $x_{1:n}$:

Generation-Phase TTT (Equation 3)

$$\theta' \approx \arg\min_\theta \; L_\text{NTP}(\psi \mid x_{1:n};\, \theta)$$

The EOS token is masked from the training loss to prevent the model from learning to terminate immediately after the target prefix. This effectiveness stems from updating weights to maximize the likelihood of an affirmative start, which makes complying with the harmful request the most probable continuation path.

All Three Threat Models Head-to-Head

Select a model and adaptation method to compare how each threat model's ASR@10 evolves with TTT steps.

Model

Adaptation

Why this matters

Under LoRA, the generation-phase attack achieves 93% average ASR@10 across models — and individual configurations reach 100%. Even the 70B and 32B models are not safe: Llama3 70B hits 92% and Qwen3 32B reaches 96%. Every prompt is jailbroken in at least one of 10 generations.

Production API transfer →

Chapter 6

Transfer to Production APIs

These aren't just lab attacks. The same technique fully jailbreaks a 120B-parameter model behind a real fine-tuning API — for under $2.

The Tinker API exposes LoRA fine-tuning as a service, allowing users to specify standard hyperparameters (learning rate, rank, number of steps) via API calls. The authors directly re-used local LoRA hyperparameters without any API-specific tuning.

Production API Attack Results

Compare attack success across API-deployed models. The chart shows ASR@10 for each threat model at 5 and 10 TTT steps.

Max ASR@10 (GPT-OSS 120B)

100%

Attack Cost

< $2

API-Specific Tuning

None

Why this matters

This demonstrates that the vulnerability is not an artifact of academic settings. It transfers to real-world deployment with no additional optimization. GPT-OSS 120B reaches 100% ASR@10 for few-shot at 10 steps and 98% for the generation-phase attack. Even a 120-billion-parameter model behind a production API can be fully jailbroken via TTT.

The evaluation pitfall →

Chapter 7

When Judges Get Fooled

TTT can produce degenerate outputs — gibberish that starts with "Sure, here is..." — that standard safety judges classify as successful jailbreaks. The paper fixes this with a validity-aware pipeline.

In plain English

Imagine a plagiarism detector that flags any essay containing the phrase "In conclusion" as plagiarized. A student who writes "In conclusion, blah blah blah gibberish" would be flagged — even though the essay contains no actual stolen content. The detector is fooled by surface-level cues.

That's what happens with TTT-based attacks. Because the model overfits on a single prompt, it sometimes produces degenerate outputs: repetitive text, prompt regurgitation, or gibberish that happens to start with the affirmative prefix learned during adaptation ("Sure, here is..."). Standard safety judges see that opening and say "yep, that's harmful" — even when the actual content is nonsense.

The paper introduces a two-layer fix: first, rule-based filters catch obvious degenerate patterns (repetition, echoing); second, an LLM-based validity judge catches subtler failures. Together, they correct ASR measurements by up to 13 percentage points.

The authors augment the JailbreakBench judge benchmark with 50 degenerate generations collected from TTT experiments. All 50 are classified as unsafe by the standard judge despite containing no actual harmful content, dropping judge accuracy from 91% to 78%.

Judge Accuracy With and Without Validity Checks

The chart shows judge accuracy across different validity-checking strategies. Without validity checks, degenerate outputs cause 50 false positives.

Why this matters

Without validity-aware evaluation, TTT attack success rates are inflated by up to 13 percentage points. The truncation-aware LLM validity judge achieves 92.3% accuracy — the highest — while eliminating all 50 invalid false positives. All ASR results in the paper count a generation as a successful jailbreak only if it is both valid and unsafe.

The perplexity defense →

Chapter 8

The Perplexity Defense

By monitoring how perplexity shifts on a private holdout, providers can detect TTT-based attacks before the response is generated.

In plain English

Imagine a casino that keeps a deck of marked cards hidden behind the counter. When a gambler sits down and starts winning suspiciously, the dealer pulls out the marked deck and asks the gambler to play with it. If the gambler's playing style suddenly changes — they start losing — that confirms they were cheating with their own deck.

The perplexity defense works similarly. The provider maintains a private set of harmful (question, target) pairs that the attacker doesn't know about. Before and after TTT, the provider measures how the model's perplexity on these pairs changes. If the model suddenly finds it much easier to generate harmful content — its perplexity on harmful targets dropped significantly — that's a red flag.

The defense achieves 100% true positive rate (catches every attack) with ≤2% false positive rate (almost never blocks legitimate users). Drag the threshold slider in the simulation below to see how the detection boundary shifts.

The provider maintains two private holdouts: a harmful set $\mathcal{D}_h$ of $N_h=100$ (question, target) pairs from JailbreakBench, and a clean set $\mathcal{D}_c$ of $N_c=100$ questions from GSM8K. For each request, the perplexity change on the harmful holdout is:

Perplexity Shift (Equation 4)

$$\Delta_i^h = \text{PPL}_\theta(t_i \mid q_i) - \text{PPL}_{\theta'}(t_i \mid q_i)$$

Summarized via Cohen's $d$: $d_h = \bar{\Delta}^h / s_{\Delta^h}$. Positive $d_h$ means harmful-target perplexity dropped after TTT. The request is flagged when $d_h > d_h^*$.

TTT Detection Explorer

Select a model and drag the detection threshold. The scatter plot shows $d_h$ (harmful shift) vs $d_c$ (clean shift) for each request. Red crosses are true attacks; blue circles are benign requests. Points to the right of the threshold line are flagged.

Model

Threshold d_h* 0.77

Why this matters

This is the first concrete defense against TTT-based jailbreaks. It's lightweight (two extra forward passes), requires no model changes, and catches 100% of attacks with ≤2% false positives. However, the defense assumes the attacker doesn't know the private holdout — an adaptive attacker could potentially evade it. The authors advocate for TTT-aware dynamic alignment as the long-term solution.

Test-Time TrainingUndermines SafetyGuardrails

The Setup: What Is TTT?

TTT Parameter Shift Simulator

Three Threat Models

Threat Model Explorer

Self-Supervised Erosion

Self-Supervised ASR@10 by Model

The Few-Shot Jailbreak

Few-Shot Attack: Steps vs. ASR@10

The Generation-Phase Attack

All Three Threat Models Head-to-Head

Transfer to Production APIs

Production API Attack Results

When Judges Get Fooled

Judge Accuracy With and Without Validity Checks

The Perplexity Defense

TTT Detection Explorer

Test-Time Training
Undermines Safety
Guardrails