An Interactive Reading of

Measuring Maximum Activations in Open Large Language Models

Luxuan Chen, Han Tian, Xinran Chen, Rui Kong, Fang Wang, Jiamin Chen,
Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Haoyi Xiong, Dawei Yin
Shanghai Jiao Tong University · Baidu Inc. · Nankai University · May 2026 · arXiv:2605.15572

The paper, in plain English

When you run a large language model, some of its internal numbers — called activations — can get staggeringly large. If you're trying to compress the model to run on cheaper hardware using fewer bits (quantization), those extreme values become a problem: the quantization scale has to stretch to cover the outliers, wasting precision on the vast majority of normal values. This paper asks a simple but unanswered question: across today's open-source LLMs, just how large do activations get?

The authors run 27 models from 8 families (Qwen, Gemma, Ling, GPT-OSS) through an identical measurement pipeline — same 5,000 input texts, same hooks capturing every layer's activations — and measure the global maximum $M = \max |a|$. Think of it as an MRI scan for each model's numerical nervous system. What they find is startling: at similar parameter counts, the worst-case activation varies by four orders of magnitude depending on which model family you pick. A Qwen3.5 model peaks around 132; a Gemma3-27B-it reaches 696,320.

The headline result is that maximum activation is a model property tied to family, architecture, and training — not just size. Mixture-of-expert (MoE) models have peaks 14–23× lower than dense counterparts. Instruction tuning mainly compresses late-layer peaks. And a quick INT-8 quantization check confirms that higher peaks translate directly into worse reconstruction quality. The paper argues that $M$ should be reported on every model card, right alongside parameter count and benchmark scores.

I

Four Orders of Magnitude

Global maximum activations range from ~100 (Qwen3.5) to ~700K (Gemma3-27B-it) at comparable model sizes — a range no one had systematically measured.

II

MoE Suppresses Peaks

Mixture-of-expert architectures produce 14–23× lower peak activations than dense models at similar total parameter counts, a strong architectural effect.

III

The Residual Stream Carries the Extremes

In 22 of 24 models, the global maximum lives in the residual stream (hidden states), not in attention or MLP outputs — quantization policies should target this layer.

Chapter 1

The Activation Landscape

Twenty-four checkpoints, eight families, one measurement pipeline. The result is a landscape where peak activations vary by nearly four orders of magnitude — and parameter count alone can't predict where any model lands.

In plain English

Imagine measuring the loudest sound each of 24 different speakers can produce when fed the same playlist. Some speakers — like Qwen3.5 models — max out at a conversational volume (around 100). Others — like Gemma3-27B-it — hit the equivalent of a rock concert: 696,320. Same inputs, wildly different peaks.

What makes this surprising is that the speakers have roughly similar power ratings (parameter counts). The difference comes down to design choices — how the manufacturer wired the circuits, which training recipes they used, whether they use sparse routing. Peak activation is a property of the speaker design, not just the wattage.

In the bar chart below, hover over each model to see its global maximum. The vertical axis is logarithmic — each gridline represents a 10× jump. Notice how the same family (Qwen) spans from 132 to 35,328.

The primary metric throughout this paper is $M = \max |a|$: the single largest absolute activation observed across all hooked components (embeddings, hidden states, attention outputs, MLP/MoE outputs, SwiGLU gates, final LayerNorm) and all layers during a forward pass over the 5,000-sample evaluation corpus.

$$M = \max_{\text{component},\, \text{layer},\, \text{sample}} |a|$$

Global Maximum Activations Across 24 Checkpoints

Hover over bars to see exact values. Toggle families with the legend. The × markers indicate models failing the Sun criterion.

Why this matters

Cross-family variation in $M$ is much larger than within-family scaling. Qwen3.5 and Gemma3 differ by roughly 5,000× at similar parameter counts. For anyone deploying quantized inference, this means the model family — not just the parameter count — determines whether your 8-bit pipeline will struggle.

Next: Where do these peaks form inside the network?

→

Chapter 2

Where Maximum Activations Form

The peak doesn't live in one universal layer. It accumulates through two distinct depth patterns — a sudden jump-and-plateau in some families, and a gradual build in others. The pattern is a fingerprint of the model family.

The layerwise trajectory describes how peak activation magnitude changes with network depth. The paper identifies two broad patterns. The jump-and-plateau pattern (Qwen2.5, GPT-OSS, Ling) shows a sharp magnitude rise in early or middle layers, followed by a sustained high plateau. The gradual-accumulation pattern (Qwen3.5, Gemma) shows a smoother increase that often peaks in later layers.

Layerwise Peak Trajectories by Family

Click a model family card to overlay its trajectory. Multiple selections are supported. The x-axis is normalized depth (0 = first layer, 1 = last).

Why this matters

Peak depth has no universal location. Even within the same family, maxima can occur in shallow, middle, or deep layers. The residual stream carries the global maximum in 22/24 checkpoints — meaning quantization and scaling policies should inspect hidden-state peaks rather than only attention or MLP outputs.

Next: What counts as a "massive" activation?

→

Chapter 3

From Binary to Continuous

Prior work asked "does this model have massive activations?" — a yes-or-no question. This paper replaces that binary flag with a continuous measurement, and shows the two views can disagree.

In plain English

Previous researchers defined a "massive activation" as a value that is simultaneously very large in absolute terms (above 100) and locally extreme (1,000× larger than the median of that same token's other values). It's like saying a skyscraper is "truly tall" only if it's both over 100 meters and at least 1,000 times taller than the surrounding buildings.

Four models in this study fail that test. But here's the twist: some failing models are actually easy to quantize (Qwen3.5-0.8B peaks at just 122), while some passing models are nightmares for 8-bit inference (Gemma3-27B-it hits 696,320). The binary test misses the deployment-relevant question: how large is the peak, regardless of its local neighborhood?

Drag the threshold sliders below to see which models pass or fail under different criteria. Watch how the landscape shifts.

The Sun criterion (from Sun et al., 2024) defines a massive activation coordinate $x_i$ as one satisfying both $|x_i| > 100$ and a local ratio condition:

$$\frac{|x_i|}{\operatorname{median}_{j=1}^{d} |x_j|} > 1000$$

A model passes if any hidden layer contains at least one token-feature coordinate satisfying both thresholds. Four checkpoints fail: Qwen2.5-1.5B (ratio ≈ 574, below 1000×), and three Qwen3.5 models whose activation scale is systematically suppressed.

Activation Magnitude vs. Local Ratio Explorer

Adjust the absolute threshold and local-ratio threshold to see how the pass/fail classification changes. Each point is a model's representative-layer peak.

Absolute threshold |x_i| > 100

Local ratio > 1000×

Models Passing

20

Models Failing

4

Why this matters

The binary massive-activation criterion doesn't capture deployment risk. Qwen3.5-9B fails the Sun test but peaks at 956 — easy to quantize. Gemma3-4B-it passes but peaks at 245,760. The continuous metric $M = \max|a|$ is what actually matters for choosing your quantization scale.

Next: Does bigger model mean bigger activations?

→

Chapter 4

The Scaling Question

Does a bigger model always produce larger activations? Within a single family, often yes. Across families, the answer collapses — Gemma2-9B has lower peaks than Gemma2-2B.

Figure 5 in the paper compares checkpoints of different sizes within the same family. Most families (Qwen2.5, Qwen3.5, Gemma3) show a stable within-family scale effect: $M$ increases with parameter count. Gemma2 is the main non-monotonic exception, with 9B peaking below 2B.

Within-Family Scaling of Global Activation Maxima

Select a family to see how its global maximum activation changes with model size. Hover over points for exact values.

Why this matters

For inference and quantization, family identity and generation are at least as important as parameter count. Assuming bigger model = bigger activations is dangerous. A 9B Gemma2 is easier to quantize than a 2B Gemma2, while a 7B Qwen2.5 has peaks 14× larger than a 9B Qwen3.5.

Next: How does MoE architecture change the picture?

→

Chapter 5

MoE Versus Dense

Mixture-of-expert models activate fewer parameters per token — and they also produce dramatically lower peak activations. Two matched pairs show 14–23× reductions.

The paper compares two matched MoE-vs-dense pairs at the ~30B scale. For Qwen3, the MoE checkpoint (30B-A3B) has $M = 1{,}512$ versus $M = 35{,}328$ for the dense 32B — a $23.4\times$ reduction. For Qwen3.5, the MoE (35B-A3B) has $M = 132$ versus $M = 1{,}546$ for the dense 27B — a $14.0\times$ reduction (using the global $M$ across all components, not the representative-layer Top-1 from Table 1).

$$\frac{M_{\text{dense}}}{M_{\text{MoE}}} \in [14.0,\; 23.4]$$

MoE vs. Dense at Matched Scale

Select a model pair to compare MoE and dense peaks. The ratio is displayed in the readout card.

Qwen3: 30B-A3B vs 32B

MoE vs dense at ~30B scale

Qwen3.5: 35B-A3B vs 27B

MoE vs dense at ~30B scale

Dense Peak (M)

35,328

MoE Peak (M)

1,512

Reduction Ratio

23.4×

Caveat

This contrast rests on only $n = 2$ matched pairs within a single family. The gap is large and consistent, but with such small sample sizes it is observational rather than causal — training recipe differences between MoE and dense checkpoints could contribute.

Next: How do activations evolve across model generations?

→

Chapter 6

Generational Evolution

Newer isn't always better for activation range. Qwen follows an inverted-V trajectory across generations; Gemma shows a sharp increase from Gemma2 to Gemma3.

Figure 7 compares generational trends at similar model sizes. Qwen exhibits an inverted-V: $M$ increases from Qwen2.5 to Qwen3, then drops sharply in Qwen3.5. Gemma shows a monotonic increase from Gemma2 to Gemma3 in both matched size groups. These cross-generation trends are highly family-specific and break any assumption of monotonic scaling with release time.

Matched-Scale Generational Trends

Select a scale group to compare how Qwen and Gemma generations evolve. Points within each family are connected to show the trajectory.

~1-2B Scale

Qwen2.5-1.5B → Qwen3-1.7B → Qwen3.5-0.8B · Gemma2-2B → Gemma3-4B-it

~7-9B Scale

Qwen2.5-7B → Qwen3-8B → Qwen3.5-9B · Gemma2-9B → Gemma3-4B-it

~27-32B Scale

Qwen2.5-32B → Qwen3-32B → Qwen3.5-27B · Gemma2-27B → Gemma3-27B-it

Why this matters

You cannot assume the next generation of a model family will have lower activation peaks. Qwen3.5's dramatic suppression may reflect a deliberate normalization design choice, while Gemma3's increase may stem from instruction-tuning effects. Every new release needs fresh measurement.

Next: What does this mean for actual deployment?

→

Chapter 7

Training Effects & Quantization

Training stage, instruction tuning, and vision-language adaptation all shift peak magnitudes. A lightweight INT-8 probe confirms that higher peaks mean worse quantization quality.

Training Stage & SFT Effects

Toggle between the two explorations. For training stage: the Ling-mini trajectory shows monotonic increase. For SFT: compare Base vs Instruct layerwise peaks.

5T Peak

7,648

20T Peak

10,240

Growth Factor

1.34×

The INT-8 sanity check covers eight representative models. Using per-tensor symmetric quantization with two strategies — max-abs scaling and 99.9% percentile clipping — the paper measures SQNR (signal-to-quantization-noise ratio) at each model's peak hidden layer. Qwen3.5-0.8B, with the lowest $M$, maintains 29.1 dB SQNR. Most high-peak models drop to 0.2–14 dB.

$$\text{SQNR} = 10 \log_{10} \frac{\sum a_i^2}{\sum (a_i - \hat{a}_i)^2} \;\text{dB}$$

INT-8 Quantization Sanity Check

Hover over bars to see SQNR values. The dashed line marks 20 dB — a reasonable quality threshold. Note how low-peak Qwen3.5 models maintain high SQNR while high-peak models suffer.

Why this matters

Maximum activation magnitude translates directly into quantization reconstruction error through scale selection. This makes $M$ a practical model-card statistic — not just a descriptive outlier measure. Anyone deploying 8-bit (or lower) inference should know their model's $M$ before choosing a quantization strategy.

Next: What should you take away from all this?

→

Chapter 8

What to Measure Before You Deploy

The paper's central argument is compact: $M = \max|a|$ is a family- and architecture-dependent model property that should be measured and reported alongside every open-weight release.

The paper establishes three deployment takeaways:

Five Key Findings

1. Four orders of magnitude. Global maxima range from ~100 (Qwen3.5) to ~700K (Gemma3-27B-it) at comparable parameter counts.

2. MoE reduces peaks by 14–23×. Sparse routing through expert modules dramatically suppresses activation extremes versus dense counterparts.

3. The residual stream dominates. In 22/24 models, the global maximum lives in hidden states, not attention or MLP outputs.

4. SFT compresses late layers only. Instruction tuning reduces final-layer peaks by up to 48% but leaves mid-layer peaks unchanged.

5. Higher peaks → worse quantization. INT-8 SQNR drops from 29.1 dB for low-peak models to near 0 dB for high-peak models under max-abs scaling.

The Bottom Line

$M = \max|a|$ is not predicted by parameter count alone. It is a model property tied to family, architecture, and training stage — and it should be reported alongside any open-weight release before low-bit deployment. The code is available at github.com/clx1415926/Max_act_llm.