Risk Under Pressure: Compute-Aware Adversarial Robustness in Language Models
Malikeh Ehghaghi, Boglárka Ecsedi, Marsha Chechik & Colin Raffel University of Toronto · Vector Institute · Hugging Face · June 2026
The paper, in plain English
When researchers report that an attack achieves "94% success rate" on a language model, they hide something crucial: how much it actually costs. Two attacks can both score 100% success, but one might burn through a few minutes of GPU time while the other requires days of gradient optimisation. That cost difference is the difference between a script-kiddie exploit and a nation-state-level effort. This paper asks: what if we measure safety not just by whether an attack works, but by how much compute the attacker had to spend?
The authors propose risk-compute curves that plot cumulative FLOPs against attack success rate, and two summary metrics: C@τ (the compute needed to reach a given risk threshold) and Average Efficiency (risk gained per TFLOP). Think of it like measuring a bank vault not by whether it can be cracked, but by how many jackhammer-hours it takes. Placing three very different attack strategies — gradient-based GCG, iterative PAIR, and cheap template-based JailBroken — on a shared compute axis reveals patterns that success-rate numbers alone completely miss.
The headline findings are startling. Scaling a model from 0.5B to 7B parameters makes gradient-based attacks 19.7× more expensive per TFLOP — but cheap template attacks only 2.6× harder. Safety alignment training can actually make a model more exploitable under certain attacks. And the compute cost to jailbreak a model varies by up to 5× depending on which type of harmful content you're after, meaning aggregate safety scores mask enormous category-level blind spots.
I
Computational Pressure
Standard ASR treats all attacks as equally costly. Measuring cumulative FLOPs reveals orders-of-magnitude differences in the true effort to jailbreak a model.
II
Scaling's Broken Promise
Bigger models resist expensive gradient attacks (20× cost increase) but stay nearly as vulnerable to cheap template attacks — only a 2.8× cost bump.
III
Safety Can Backfire
Dedicated safety-RL training raises aggregate cost but can make specific attack categories more efficiently exploitable. Alignment is not monotonic.
Two bank vaults. Both can be cracked. One takes a locksmith sixty seconds with a stethoscope. The other takes a crew with oxyacetylene torches ten hours. Any security consultant would call these very different risks. But in LLM safety evaluation, both would be scored identically: 100% attack success rate.
The standard metric in LLM safety evaluation is Attack Success Rate (ASR) at a fixed query budget $\lambda$. An attack policy $\pi$ proposes prompts, the target model $M$ responds, and a safety judge $E$ scores each response as safe or unsafe. If the attack succeeds within $\lambda$ queries, that's a success.
The problem: ASR collapses the entire cost structure into a binary outcome. An attack that costs 0.5 TFLOPs per query and one that costs 50 TFLOPs per query are treated as equivalent if they both succeed within the same number of steps. The paper's opening example nails this: two models asked to write a defamatory article. One complies immediately; the other resists nine attempts. Both score 100% ASR — a single metric that erases a 10× difference in adversarial effort.
Equation 2 — Empirical risk: fraction of trials that succeed within the query budget. Captures how often but not at what cost.
Why this matters
Recent defences report near-zero ASR against static attacks, yet adaptive attacks bypass 12 recent defences with >90% ASR (Nasr et al., 2025). The core issue is incomplete cost accounting: all queries are treated as equally expensive, obscuring the true adversarial investment required. A defender's goal should not be perfect robustness but raising the adversarial cost floor high enough to deter realistic threat actors.
The paper's central move: replace "number of queries" with "total floating-point operations" as the measure of adversarial effort. FLOPs are hardware-agnostic, comparable across heterogeneous attack components, and the invariant predecessor of every operational cost metric — GPU-hours, energy, dollars per breach.
The foundation is a simple approximation for the cost of a single forward pass through a transformer:
$$ C_{\text{fwd}} \approx 2\,N\,L $$
Equation 1 — Forward-pass FLOPs. $N$ = parameter count, $L$ = sequence length in tokens. Backward passes are charged at $\approx 2\,C_{\text{fwd}}$.
Computational pressure is the cumulative FLOPs incurred over $\lambda$ refinement steps, averaged across prompts. For each query budget $\lambda$, the paper measures the average cumulative FLOPs per prompt consumed up to that budget:
Equation 3 — Average cumulative cost at budget $\lambda$. $t_i^*$ is the first-success step for trial $i$ (or $\lambda$ if no success). Early-stopping means successful trials cost less than failed ones.
The risk-compute curve plots $(x, y) = (\bar{C}(M, \pi, \lambda),\, \hat{R}(M, \pi, \lambda))$ as $\lambda$ varies from 1 to $\lambda_{\max}$. The curve is summarised by two scalar metrics:
Equation 5 — Average Efficiency: normalised expected risk per FLOP. Higher AE means an attack extracts substantial risk even under tight compute constraints.
Interactive: FLOPs Calculator
Adjust model size and sequence length. Watch how forward-pass cost scales — and why the same "10 queries" means vastly different things for different attacks. Live updates as you drag.
The paper evaluates three attack strategies that span the full cost spectrum: a cheap template scattershot (JailBroken), a guided iterative refinor (PAIR), and a gradient-optimised brute force (GCG). Their per-step costs differ by orders of magnitude.
GCG
Greedy Coordinate Gradient
White-box attack. Computes gradients w.r.t. token embeddings, identifies top-256 substitutions, evaluates 128 candidates via forward passes, and selects the one minimising cross-entropy toward an affirmative prefix.
PAIR
Prompt Automatic Iterative Refinement
Black-box iterative refinement. An attacker LLM (Qwen2.5-7B) rewrites the jailbreak prompt based on the target's prior response and judge verdict. No gradient access needed — just clever prompt engineering.
JB
JailBroken
Template-based attack. Randomly selects from 8 obfuscation strategies (Base64 encoding, role-play framing, developer mode, etc.) applied to the original harmful prompt. Each step is just one forward pass through the model.
Equations 6–7 — JailBroken and PAIR per-step costs. JB requires only one forward pass through the target model plus judge evaluation. PAIR adds one forward pass through the attacker LLM.
Interactive: Per-Step Attack Cost Comparison
Drag the model size slider. Watch how GCG's cost scales steeply with parameters while JailBroken stays flat. The cost gap widens as models grow. Live updates as you drag.
0.5B70B
Why this matters
On a step-count budget of 10 queries, GCG and JailBroken look like peers. On a compute budget, a single GCG step costs as much as dozens of JailBroken steps. This means an attacker with a fixed GPU budget can run far more JailBroken trials than GCG trials, making the cheap attack a far more efficient use of adversarial resources — and a far more realistic threat.
You'd think more safety training makes a model harder to jailbreak. The Tulu3-8B alignment pipeline tells a different story. Base → SFT → DPO → RLVR. Robustness peaks at SFT — then degrades with each additional alignment stage.
∞
C@0.5 for Tulu3-SFT under GCG — the 50% risk threshold is never reached
521
TFLOPs for Tulu3-DPO under GCG — alignment has collapsed from ∞ to 521
0.90
ASR for Tulu3-RLVR under JailBroken — 40% higher than SFT's 0.50
Interactive: Training Stage Risk-Compute Curves
Select an attack strategy to see how risk-compute curves shift across the four training stages of Tulu3-8B. Notice how SFT's curve stays low while RLVR's rises steeply. Click a card to switch attacks.
GCG
Gradient-based · White-box
Most expensive per step. SFT completely blocks it (C@0.5 = ∞).
PAIR
Iterative refinement · Black-box
Attacker LLM rewrites prompts. Also blocked by SFT.
JailBroken
Template-based · Cheapest
Random obfuscation. Even SFT only raises C@0.5 to 52.4 TFLOPs.
Why this matters
The non-monotone trajectory has a clear mechanism. DPO overfits to fixed preference data with limited adversarial coverage (Xiao et al., 2025; Lin et al., 2024). RLVR's binary rewards can inadvertently deprioritise calibrated refusals (Lambert et al., 2025). More alignment is not always better — and the compute axis reveals precisely where and how the regression happens.
Scale up a model and it gets smarter. Does it also get safer? Yes — but only against expensive attacks. Against cheap ones, scaling barely moves the needle. The 0.5B-to-7B journey tells two completely different stories depending on which attack you use.
20×
Increase in C@0.5 under GCG when scaling from 0.5B to 7B (20.0 → 399.7 TFLOPs)
2.8×
Increase in C@0.5 under JailBroken for the same scaling (8.2 → 22.8 TFLOPs)
18×
Ratio of JB-to-GCG per-TFLOP exploitability at 7B — a vulnerability gap ASR hides
Interactive: Scaling Effect on Attack Cost
See how C@0.5 and Average Efficiency change as model size grows from 0.5B to 7B. GCG's cost rises steeply; JailBroken's barely budges. Click a metric tab to switch views.
C@0.5
Compute to 50% risk
Higher = more TFLOPs needed to breach. GCG rises 20×; JB only 2.8×.
Avg. Efficiency
Risk per TFLOP
Lower = less exploitable. GCG drops 19.7×; JB only 2.6×.
ASR @ λ=10
Standard metric
Both attacks stay above 73%. ASR barely distinguishes them.
Why this matters
Scaling from 0.5B to 7B provides strong protection against compute-intensive attacks like GCG, while leaving the model nearly as vulnerable to low-cost attacks like JailBroken. A model that looks robust under GCG may be a sitting duck under JailBroken — and you'd never know from ASR alone.
White-box attacks like GCG require model weights. Closed-weight models like GPT-4 or Claude are safe from them — or so the thinking goes. The paper shows that an attacker can optimise a suffix on a tiny open-weight surrogate and transfer it to a much larger target at a fraction of the cost.
Interactive: Surrogate-to-Target Transfer
See how risk-compute curves compare between the small surrogate model (0.5B) and the transfer target (8B). The target's curve plateaus early — additional compute doesn't help. Click to explore.
Why this matters
An attacker need not interact directly with the target model. Optimisation can be performed entirely on a cheap surrogate, with the resulting attack deployed against the target at only a fraction of the original cost. The risk-compute curves reveal that the ceiling is governed by suffix quality and target robustness, not additional compute — a ceiling that fixed-budget ASR may miss entirely.
Qwen3-4B-SafeRL is specifically trained for safety using reinforcement learning on adversarial prompts. You'd expect it to be strictly more robust than the base Qwen3-4B. Under JailBroken and PAIR, it mostly is. Under GCG — it's worse.
Interactive: Base vs. Safety-Aligned Model
Compare risk-compute curves for Qwen3-4B (base) and Qwen3-4B-SafeRL under each attack. Under GCG, the base model is strictly superior. Click a card to switch attacks.
GCG
Gradient-based · Safety backfires
Base: C@0.5 = ∞. SafeRL: 189 TFLOPs. Safety training made it easier to breach.
PAIR
Iterative · Mixed results
Curves cross: base is stronger at low compute, SafeRL pulls ahead at higher budgets.
JailBroken
Template · Safety helps
SafeRL curve lies strictly below base. Modest but consistent improvement.
Why this matters
This asymmetry reflects a training–distribution mismatch. SafeRL is trained on natural-language adversarial prompts, while GCG discovers token sequences outside that distribution through gradient optimisation. Safety training that isn't adversarially comprehensive can create new attack surfaces even as it closes old ones. More safety data does not guarantee more safety.
Aggregate safety scores lump all harm categories together. The paper shows that the compute cost to breach a model varies by up to 5× across categories — and that safety training can actually increase exploitability in specific areas.
Interactive: Category-Level Vulnerability
C@0.5 and Average Efficiency by harm category under JailBroken. Categories ordered by SafeRL's performance. Notice where safety training helps — and where it backfires. Click a metric to toggle.
C@0.5
TFLOPs to 50% risk
Higher = more robust. Spans a ~5× range across categories.
Avg. Efficiency
Risk per TFLOP
Lower = safer. Safety-RL can increase AE in some categories.
Why this matters
Safety training datasets are heavily skewed across harm types — some categories get over 3× more coverage than others (Xie et al., 2025). This produces models that are robust to well-represented categories and vulnerable to underrepresented ones. Aggregate safety metrics mask significant heterogeneity in category-level robustness, and defenders who rely on a single score may be unaware of critical blind spots.
The central lesson of this paper is that how you measure safety determines what you see. When every attack is measured in steps, all attacks look similar. When measured in FLOPs, the landscape fractures: expensive attacks become deterred by scale, cheap ones don't; safety training helps against some attacks but backfires against others; and the most dangerous categories are the ones your training data under-represents. The path forward is not just better models, but better measurement.