An Interactive Reading of

Risk Under Pressure:
Compute-Aware Adversarial
Robustness in Language Models

Malikeh Ehghaghi, Boglárka Ecsedi, Marsha Chechik & Colin Raffel
University of Toronto · Vector Institute · Hugging Face · June 2026

The paper, in plain English

When researchers report that an attack achieves "94% success rate" on a language model, they hide something crucial: how much it actually costs. Two attacks can both score 100% success, but one might burn through a few minutes of GPU time while the other requires days of gradient optimisation. That cost difference is the difference between a script-kiddie exploit and a nation-state-level effort. This paper asks: what if we measure safety not just by whether an attack works, but by how much compute the attacker had to spend?

The authors propose risk-compute curves that plot cumulative FLOPs against attack success rate, and two summary metrics: C@τ (the compute needed to reach a given risk threshold) and Average Efficiency (risk gained per TFLOP). Think of it like measuring a bank vault not by whether it can be cracked, but by how many jackhammer-hours it takes. Placing three very different attack strategies — gradient-based GCG, iterative PAIR, and cheap template-based JailBroken — on a shared compute axis reveals patterns that success-rate numbers alone completely miss.

The headline findings are startling. Scaling a model from 0.5B to 7B parameters makes gradient-based attacks 19.7× more expensive per TFLOP — but cheap template attacks only 2.6× harder. Safety alignment training can actually make a model more exploitable under certain attacks. And the compute cost to jailbreak a model varies by up to 5× depending on which type of harmful content you're after, meaning aggregate safety scores mask enormous category-level blind spots.

I

Computational Pressure

Standard ASR treats all attacks as equally costly. Measuring cumulative FLOPs reveals orders-of-magnitude differences in the true effort to jailbreak a model.

II

Scaling's Broken Promise

Bigger models resist expensive gradient attacks (20× cost increase) but stay nearly as vulnerable to cheap template attacks — only a 2.8× cost bump.

III

Safety Can Backfire

Dedicated safety-RL training raises aggregate cost but can make specific attack categories more efficiently exploitable. Alignment is not monotonic.

~ 25 minutes · 8 chapters · 7 interactive simulations

Chapter 1

What's an Attack Worth?

Two bank vaults. Both can be cracked. One takes a locksmith sixty seconds with a stethoscope. The other takes a crew with oxyacetylene torches ten hours. Any security consultant would call these very different risks. But in LLM safety evaluation, both would be scored identically: 100% attack success rate.

In plain English

Imagine you're a security manager at a hotel. You test two door locks: both fail if you try long enough. One gives way after 2 gentle picks; the other requires 10 hours of drilling. Reporting "both locks: 100% fail rate" tells your boss nothing useful. What she needs to know is how long each lock actually slows down an intruder.

That's exactly the gap in LLM safety today. "Attack X achieves Y% success after Z queries" answers how often an attack works, but not at what cost. Two models can both score 100% ASR, yet one might cost an attacker 10× more compute to break. In real security, cost is the variable that determines whether an attack is actually feasible.

The paper's core argument: a vulnerability is only operationally relevant if it's exploitable at a cost justified by its value to the attacker. Drag the slider in the simulation below to see how misleading a single ASR number can be.

The standard metric in LLM safety evaluation is Attack Success Rate (ASR) at a fixed query budget $\lambda$. An attack policy $\pi$ proposes prompts, the target model $M$ responds, and a safety judge $E$ scores each response as safe or unsafe. If the attack succeeds within $\lambda$ queries, that's a success.

The problem: ASR collapses the entire cost structure into a binary outcome. An attack that costs 0.5 TFLOPs per query and one that costs 50 TFLOPs per query are treated as equivalent if they both succeed within the same number of steps. The paper's opening example nails this: two models asked to write a defamatory article. One complies immediately; the other resists nine attempts. Both score 100% ASR — a single metric that erases a 10× difference in adversarial effort.

$$ \hat{R}(M, \pi, \lambda) = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}[\text{trial}_i \text{ succeeds within } \lambda \text{ steps}] $$

Equation 2 — Empirical risk: fraction of trials that succeed within the query budget. Captures how often but not at what cost.

Why this matters

Recent defences report near-zero ASR against static attacks, yet adaptive attacks bypass 12 recent defences with >90% ASR (Nasr et al., 2025). The core issue is incomplete cost accounting: all queries are treated as equally expensive, obscuring the true adversarial investment required. A defender's goal should not be perfect robustness but raising the adversarial cost floor high enough to deter realistic threat actors.

Next: how the paper measures what an attack actually costs →

Chapter 2

Computational Pressure

The paper's central move: replace "number of queries" with "total floating-point operations" as the measure of adversarial effort. FLOPs are hardware-agnostic, comparable across heterogeneous attack components, and the invariant predecessor of every operational cost metric — GPU-hours, energy, dollars per breach.

In plain English

Think of FLOPs like the kilowatt-hours on your electricity bill. Whether you're running a space heater, a gaming PC, or charging an EV, the meter reads in the same units. That makes it possible to compare wildly different appliances on a shared scale.

The paper does the same for AI attacks. A gradient-based attack uses backward passes through the model (expensive). A template-based attack just reformats the prompt (cheap). An iterative attack calls a second AI model (medium). By converting everything to FLOPs, the authors put all three on the same axis — like comparing a blowtorch, a lockpick, and a crowbar by their energy consumption rather than by whether the door eventually opens.

Drag the sliders below to see how forward-pass cost scales with model size and sequence length, and why FLOPs reveal what step counts hide.

The foundation is a simple approximation for the cost of a single forward pass through a transformer:

$$ C_{\text{fwd}} \approx 2\,N\,L $$

Equation 1 — Forward-pass FLOPs. $N$ = parameter count, $L$ = sequence length in tokens. Backward passes are charged at $\approx 2\,C_{\text{fwd}}$.

Computational pressure is the cumulative FLOPs incurred over $\lambda$ refinement steps, averaged across prompts. For each query budget $\lambda$, the paper measures the average cumulative FLOPs per prompt consumed up to that budget:

$$ \bar{C}(M, \pi, \lambda) = \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{\min(\lambda,\, t_i^*)} c_\pi(M, t) $$

Equation 3 — Average cumulative cost at budget $\lambda$. $t_i^*$ is the first-success step for trial $i$ (or $\lambda$ if no success). Early-stopping means successful trials cost less than failed ones.

The risk-compute curve plots $(x, y) = (\bar{C}(M, \pi, \lambda),\, \hat{R}(M, \pi, \lambda))$ as $\lambda$ varies from 1 to $\lambda_{\max}$. The curve is summarised by two scalar metrics:

$$ C@\tau(M, \pi) = \min_\lambda \left\{ \bar{C}(M, \pi, \lambda) : \hat{R}(M, \pi, \lambda) \geq \tau \right\} $$

Equation 4 — Compute to $\tau$%-risk: the cumulative FLOPs needed for attacks to first reach a risk threshold. Higher = more robust.

$$ \text{AE} = \frac{\text{CAURC}}{\bar{C}_{\max}} \qquad \text{where } \text{CAURC} = \int_1^{\lambda_{\max}} \hat{R}\, d\bar{C} $$

Equation 5 — Average Efficiency: normalised expected risk per FLOP. Higher AE means an attack extracts substantial risk even under tight compute constraints.

Interactive: FLOPs Calculator

Adjust model size and sequence length. Watch how forward-pass cost scales — and why the same "10 queries" means vastly different things for different attacks. Live updates as you drag.

Model parameters (N, billions): 8.0B

0.5B70B

Sequence length (L, tokens): 200

504,000

Forward pass

3.20 TFLOPs

Backward pass (~2× fwd)

6.40 TFLOPs

Full GCG step

--

Next: the three attack families and their price tags →

Chapter 3

Three Weapons, Three Price Tags

The paper evaluates three attack strategies that span the full cost spectrum: a cheap template scattershot (JailBroken), a guided iterative refinor (PAIR), and a gradient-optimised brute force (GCG). Their per-step costs differ by orders of magnitude.

In plain English

Imagine three ways to break into a building. JailBroken is like trying every key on a janitor's keychain — cheap, fast, and occasionally one just works. PAIR is like hiring a locksmith who watches which way the tumblers move and adjusts — more skilled, but costs extra labour. GCG is like X-raying the lock, computing exactly which pins to push, and fabricating a custom key — devastatingly effective but enormously expensive in equipment and expertise.

On a step count, all three get the same budget: up to 10 attempts. On a compute budget, they're not even close. A single GCG step costs roughly 50× more FLOPs than a JailBroken step on the same model, because GCG requires 128 candidate evaluations plus a full backward pass through the model's parameters.

The simulation below lets you compare the per-step cost of each attack across model sizes. Notice how the gap widens as models grow — GCG's cost scales steeply with parameter count, while JailBroken's barely changes.

GCG

Greedy Coordinate Gradient

White-box attack. Computes gradients w.r.t. token embeddings, identifies top-256 substitutions, evaluates 128 candidates via forward passes, and selects the one minimising cross-entropy toward an affirmative prefix.

PAIR

Prompt Automatic Iterative Refinement

Black-box iterative refinement. An attacker LLM (Qwen2.5-7B) rewrites the jailbreak prompt based on the target's prior response and judge verdict. No gradient access needed — just clever prompt engineering.

JB

JailBroken

Template-based attack. Randomly selects from 8 obfuscation strategies (Base64 encoding, role-play framing, developer mode, etc.) applied to the original harmful prompt. Each step is just one forward pass through the model.

$$ c_{\text{GCG}}(M) = \underbrace{(128 + \beta_{\text{bwd}}) \cdot 2 N_M L_{\text{opt}}}_{\text{candidates + gradient}} + \underbrace{2 N_M L_{\text{gen}}}_{\text{generation}} + \underbrace{2 N_J L_J}_{\text{judge fwd}} $$

Equation 8 — GCG per-step cost. $\beta_{\text{bwd}} = 3$ (2:1 backward-to-forward + 50% overhead). Dominated by the 131 effective forward passes for candidate evaluation.

$$ c_{\text{PAIR}}(M) = c_{\text{JB}}(M) + \underbrace{2 N_A L_A}_{\text{attacker fwd}} \qquad\qquad c_{\text{JB}}(M) = \underbrace{2 N_M L_{\text{gen}}}_{\text{target fwd}} + \underbrace{2 N_J L_J}_{\text{judge fwd}} $$

Equations 6–7 — JailBroken and PAIR per-step costs. JB requires only one forward pass through the target model plus judge evaluation. PAIR adds one forward pass through the attacker LLM.

Interactive: Per-Step Attack Cost Comparison

Drag the model size slider. Watch how GCG's cost scales steeply with parameters while JailBroken stays flat. The cost gap widens as models grow. Live updates as you drag.

Target model size: 8B params

0.5B70B

Why this matters

On a step-count budget of 10 queries, GCG and JailBroken look like peers. On a compute budget, a single GCG step costs as much as dozens of JailBroken steps. This means an attacker with a fixed GPU budget can run far more JailBroken trials than GCG trials, making the cheap attack a far more efficient use of adversarial resources — and a far more realistic threat.

Next: how training stage reshapes the cost landscape →

Chapter 4

The Alignment Paradox

You'd think more safety training makes a model harder to jailbreak. The Tulu3-8B alignment pipeline tells a different story. Base → SFT → DPO → RLVR. Robustness peaks at SFT — then degrades with each additional alignment stage.

In plain English

Imagine a martial arts school that trains students in four stages: raw beginner, basic self-defence, competitive sparring, and championship-level competition. You'd expect each stage to produce a better fighter. But what if the self-defence graduates actually resist attacks better than the competition champions?

That's what happens with Tulu3. After supervised fine-tuning (SFT), the model is remarkably robust — GCG and PAIR can't even reach 50% risk within budget. But after further training with DPO (Direct Preference Optimization) and RLVR (Reinforcement Learning with Verifiable Rewards), the model actually gets easier to jailbreak under certain attacks. The final deployed checkpoint is the weakest of the aligned models.

The compute-aware view reveals the full depth: JailBroken's per-TFLOP exploitability at RLVR is 1.8× that of DPO and 2.1× that of SFT. Standard ASR shows the same trend, but the compute-cost collapse quantifies exactly how much protection was lost. Drag the attack selector below to see each stage's risk-compute curve.

∞

C@0.5 for Tulu3-SFT under GCG — the 50% risk threshold is never reached

521

TFLOPs for Tulu3-DPO under GCG — alignment has collapsed from ∞ to 521

0.90

ASR for Tulu3-RLVR under JailBroken — 40% higher than SFT's 0.50

Interactive: Training Stage Risk-Compute Curves

Select an attack strategy to see how risk-compute curves shift across the four training stages of Tulu3-8B. Notice how SFT's curve stays low while RLVR's rises steeply. Click a card to switch attacks.

GCG

Gradient-based · White-box

Most expensive per step. SFT completely blocks it (C@0.5 = ∞).

PAIR

Iterative refinement · Black-box

Attacker LLM rewrites prompts. Also blocked by SFT.

JailBroken

Template-based · Cheapest

Random obfuscation. Even SFT only raises C@0.5 to 52.4 TFLOPs.

Why this matters

The non-monotone trajectory has a clear mechanism. DPO overfits to fixed preference data with limited adversarial coverage (Xiao et al., 2025; Lin et al., 2024). RLVR's binary rewards can inadvertently deprioritise calibrated refusals (Lambert et al., 2025). More alignment is not always better — and the compute axis reveals precisely where and how the regression happens.

Next: why bigger isn't always safer →

Chapter 5

Scaling's Broken Promise

Scale up a model and it gets smarter. Does it also get safer? Yes — but only against expensive attacks. Against cheap ones, scaling barely moves the needle. The 0.5B-to-7B journey tells two completely different stories depending on which attack you use.

In plain English

Think of a castle. Making it bigger — higher walls, thicker gates — stops the army with siege engines. But a spy in a servant's outfit still walks through the front gate. Scaling the castle defends against the expensive attack but does almost nothing about the cheap one.

The numbers are stark. Scaling Qwen2.5 from 0.5B to 7B parameters makes GCG 19.7× less efficient per TFLOP — the attacker has to spend massively more compute for the same risk. But JailBroken's per-TFLOP exploitability drops only 2.6×. At 7B, the model is still 18× more exploitable per TFLOP under JailBroken than under GCG.

Standard ASR barely captures this divergence. Both attacks score near 100% at 0.5B, and both stay above 73% at 7B. The real story is in the cost axis: scaling increases the attacker's bill dramatically for GCG but barely for JailBroken.

20×

Increase in C@0.5 under GCG when scaling from 0.5B to 7B (20.0 → 399.7 TFLOPs)

2.8×

Increase in C@0.5 under JailBroken for the same scaling (8.2 → 22.8 TFLOPs)

18×

Ratio of JB-to-GCG per-TFLOP exploitability at 7B — a vulnerability gap ASR hides

Interactive: Scaling Effect on Attack Cost

See how C@0.5 and Average Efficiency change as model size grows from 0.5B to 7B. GCG's cost rises steeply; JailBroken's barely budges. Click a metric tab to switch views.

C@0.5

Compute to 50% risk

Higher = more TFLOPs needed to breach. GCG rises 20×; JB only 2.8×.

Avg. Efficiency

Risk per TFLOP

Lower = less exploitable. GCG drops 19.7×; JB only 2.6×.

ASR @ λ=10

Standard metric

Both attacks stay above 73%. ASR barely distinguishes them.

Why this matters

Scaling from 0.5B to 7B provides strong protection against compute-intensive attacks like GCG, while leaving the model nearly as vulnerable to low-cost attacks like JailBroken. A model that looks robust under GCG may be a sitting duck under JailBroken — and you'd never know from ASR alone.

Next: how attackers cut costs by outsourcing to a small model →

Chapter 6

The Surrogate Shortcut

White-box attacks like GCG require model weights. Closed-weight models like GPT-4 or Claude are safe from them — or so the thinking goes. The paper shows that an attacker can optimise a suffix on a tiny open-weight surrogate and transfer it to a much larger target at a fraction of the cost.

In plain English

You want to pick the lock on a high-security vault, but you can't get near it. So you buy the same brand's cheap padlock from a hardware store, practise your technique on that, and then walk up to the vault with a key you already know works. The padlock isn't identical to the vault, but the underlying mechanism is similar enough that your practised key has a decent chance.

That's the transfer attack. The paper optimises GCG suffixes on Qwen2.5-0.5B (tiny, open-weight) and then applies those suffixes directly to Qwen3-8B (16× larger, treated as a proxy for a closed model). The attack costs a fraction of what direct optimisation on the target would require, yet still achieves ASR = 0.15 — not catastrophic, but non-trivial for a budget operation.

Crucially, risk rises quickly over the first few inference steps and then plateaus. The ceiling is set by suffix quality and target robustness, not by additional compute — a pattern invisible to single-point ASR metrics.

Interactive: Surrogate-to-Target Transfer

See how risk-compute curves compare between the small surrogate model (0.5B) and the transfer target (8B). The target's curve plateaus early — additional compute doesn't help. Click to explore.

Why this matters

An attacker need not interact directly with the target model. Optimisation can be performed entirely on a cheap surrogate, with the resulting attack deployed against the target at only a fraction of the original cost. The risk-compute curves reveal that the ceiling is governed by suffix quality and target robustness, not additional compute — a ceiling that fixed-budget ASR may miss entirely.

Next: when safety training makes things worse →

Chapter 7

When Safety Backfires

Qwen3-4B-SafeRL is specifically trained for safety using reinforcement learning on adversarial prompts. You'd expect it to be strictly more robust than the base Qwen3-4B. Under JailBroken and PAIR, it mostly is. Under GCG — it's worse.

In plain English

You install a fancy alarm system that's been tested against burglars who pick locks and break windows. It works great against those attacks. But then someone discovers they can just short-circuit the alarm's own wiring — a technique the alarm was never trained to handle — and the fancy system actually makes the short-circuit easier because the alarm's circuits provide a convenient access point.

That's the SafeRL story. The model is RL-trained on natural-language adversarial prompts, learning to refuse them. But GCG uses gradient optimisation to discover token sequences that bypass safeguards at the logit level — often outside the learned distribution. The safety training didn't prepare the model for this attack surface; in fact, the training distribution mismatch makes the model more exploitable.

The numbers: base Qwen3-4B achieves C@0.5 = ∞ under GCG (never breached at 50%). SafeRL drops to 189 TFLOPs. Average Efficiency more than doubles from 0.9 to 2.1 × 10^-3 risk/TFLOP.

Interactive: Base vs. Safety-Aligned Model

Compare risk-compute curves for Qwen3-4B (base) and Qwen3-4B-SafeRL under each attack. Under GCG, the base model is strictly superior. Click a card to switch attacks.

GCG

Gradient-based · Safety backfires

Base: C@0.5 = ∞. SafeRL: 189 TFLOPs. Safety training made it easier to breach.

PAIR

Iterative · Mixed results

Curves cross: base is stronger at low compute, SafeRL pulls ahead at higher budgets.

JailBroken

Template · Safety helps

SafeRL curve lies strictly below base. Modest but consistent improvement.

Why this matters

This asymmetry reflects a training–distribution mismatch. SafeRL is trained on natural-language adversarial prompts, while GCG discovers token sequences outside that distribution through gradient optimisation. Safety training that isn't adversarially comprehensive can create new attack surfaces even as it closes old ones. More safety data does not guarantee more safety.

Next: which harms are cheapest to exploit →

Chapter 8

Uneven Shields

Aggregate safety scores lump all harm categories together. The paper shows that the compute cost to breach a model varies by up to 5× across categories — and that safety training can actually increase exploitability in specific areas.

In plain English

Imagine a house with six doors. You reinforce three of them with deadbolts. The other three still have the original flimsy locks. A burglar who checks all six doors will find the weak ones every time — and your overall "door security score" of 50% reinforced tells you nothing about which doors are the soft targets.

The paper finds exactly this pattern. Safety-RL training on Qwen3-4B-SafeRL provides strong protection against harassment and misinformation prompts — the "deadbolted doors." But for cybercrime, chemical weapons, and illegal activities, the training actually increases per-TFLOP exploitability. An attacker with a sustained compute budget can extract harmful outputs more efficiently from the aligned model than from the base model in these categories.

The 5× variation across categories likely reflects imbalances in safety training data: some categories receive over 3× more coverage than others. The interactive chart below lets you explore category-level costs for both models.

Interactive: Category-Level Vulnerability

C@0.5 and Average Efficiency by harm category under JailBroken. Categories ordered by SafeRL's performance. Notice where safety training helps — and where it backfires. Click a metric to toggle.

C@0.5

TFLOPs to 50% risk

Higher = more robust. Spans a ~5× range across categories.

Avg. Efficiency

Risk per TFLOP

Lower = safer. Safety-RL can increase AE in some categories.

Why this matters

Safety training datasets are heavily skewed across harm types — some categories get over 3× more coverage than others (Xie et al., 2025). This produces models that are robust to well-represented categories and vulnerable to underrepresented ones. Aggregate safety metrics mask significant heterogeneity in category-level robustness, and defenders who rely on a single score may be unaware of critical blind spots.

The central lesson of this paper is that how you measure safety determines what you see. When every attack is measured in steps, all attacks look similar. When measured in FLOPs, the landscape fractures: expensive attacks become deterred by scale, cheap ones don't; safety training helps against some attacks but backfires against others; and the most dangerous categories are the ones your training data under-represents. The path forward is not just better models, but better measurement.

Read the original

arXiv:2606.11409 · arxiv.org/abs/2606.11409 · Code: r-three/risk-under-pressure

Risk Under Pressure:Compute-Aware AdversarialRobustness in Language Models

What's an Attack Worth?

Computational Pressure

Interactive: FLOPs Calculator

Three Weapons, Three Price Tags

Interactive: Per-Step Attack Cost Comparison

The Alignment Paradox

Interactive: Training Stage Risk-Compute Curves

Scaling's Broken Promise

Interactive: Scaling Effect on Attack Cost

The Surrogate Shortcut

Interactive: Surrogate-to-Target Transfer

When Safety Backfires

Interactive: Base vs. Safety-Aligned Model

Uneven Shields

Interactive: Category-Level Vulnerability

Risk Under Pressure:
Compute-Aware Adversarial
Robustness in Language Models