An Interactive Reading of

Measuring Maximum Activations in Open Large Language Models

The paper, in plain English

When you run a large language model, some of its internal numbers — called activations — can get staggeringly large. If you're trying to compress the model to run on cheaper hardware using fewer bits (quantization), those extreme values become a problem: the quantization scale has to stretch to cover the outliers, wasting precision on the vast majority of normal values. This paper asks a simple but unanswered question: across today's open-source LLMs, just how large do activations get?

The authors run 27 models from 8 families (Qwen, Gemma, Ling, GPT-OSS) through an identical measurement pipeline — same 5,000 input texts, same hooks capturing every layer's activations — and measure the global maximum $M = \max |a|$. Think of it as an MRI scan for each model's numerical nervous system. What they find is startling: at similar parameter counts, the worst-case activation varies by four orders of magnitude depending on which model family you pick. A Qwen3.5 model peaks around 132; a Gemma3-27B-it reaches 696,320.

The headline result is that maximum activation is a model property tied to family, architecture, and training — not just size. Mixture-of-expert (MoE) models have peaks 14–23× lower than dense counterparts. Instruction tuning mainly compresses late-layer peaks. And a quick INT-8 quantization check confirms that higher peaks translate directly into worse reconstruction quality. The paper argues that $M$ should be reported on every model card, right alongside parameter count and benchmark scores.

I
Four Orders of Magnitude
Global maximum activations range from ~100 (Qwen3.5) to ~700K (Gemma3-27B-it) at comparable model sizes — a range no one had systematically measured.
II
MoE Suppresses Peaks
Mixture-of-expert architectures produce 14–23× lower peak activations than dense models at similar total parameter counts, a strong architectural effect.
III
The Residual Stream Carries the Extremes
In 22 of 24 models, the global maximum lives in the residual stream (hidden states), not in attention or MLP outputs — quantization policies should target this layer.
Chapter 1

The Activation Landscape

Twenty-four checkpoints, eight families, one measurement pipeline. The result is a landscape where peak activations vary by nearly four orders of magnitude — and parameter count alone can't predict where any model lands.

The primary metric throughout this paper is $M = \max |a|$: the single largest absolute activation observed across all hooked components (embeddings, hidden states, attention outputs, MLP/MoE outputs, SwiGLU gates, final LayerNorm) and all layers during a forward pass over the 5,000-sample evaluation corpus.

$$M = \max_{\text{component},\, \text{layer},\, \text{sample}} |a|$$

Global Maximum Activations Across 24 Checkpoints

Hover over bars to see exact values. Toggle families with the legend. The × markers indicate models failing the Sun criterion.

Why this matters
Cross-family variation in $M$ is much larger than within-family scaling. Qwen3.5 and Gemma3 differ by roughly 5,000× at similar parameter counts. For anyone deploying quantized inference, this means the model family — not just the parameter count — determines whether your 8-bit pipeline will struggle.
Next: Where do these peaks form inside the network?
Chapter 2

Where Maximum Activations Form

The peak doesn't live in one universal layer. It accumulates through two distinct depth patterns — a sudden jump-and-plateau in some families, and a gradual build in others. The pattern is a fingerprint of the model family.

The layerwise trajectory describes how peak activation magnitude changes with network depth. The paper identifies two broad patterns. The jump-and-plateau pattern (Qwen2.5, GPT-OSS, Ling) shows a sharp magnitude rise in early or middle layers, followed by a sustained high plateau. The gradual-accumulation pattern (Qwen3.5, Gemma) shows a smoother increase that often peaks in later layers.

Layerwise Peak Trajectories by Family

Click a model family card to overlay its trajectory. Multiple selections are supported. The x-axis is normalized depth (0 = first layer, 1 = last).

Why this matters
Peak depth has no universal location. Even within the same family, maxima can occur in shallow, middle, or deep layers. The residual stream carries the global maximum in 22/24 checkpoints — meaning quantization and scaling policies should inspect hidden-state peaks rather than only attention or MLP outputs.
Next: What counts as a "massive" activation?
Chapter 3

From Binary to Continuous

Prior work asked "does this model have massive activations?" — a yes-or-no question. This paper replaces that binary flag with a continuous measurement, and shows the two views can disagree.

The Sun criterion (from Sun et al., 2024) defines a massive activation coordinate $x_i$ as one satisfying both $|x_i| > 100$ and a local ratio condition:

$$\frac{|x_i|}{\operatorname{median}_{j=1}^{d} |x_j|} > 1000$$

A model passes if any hidden layer contains at least one token-feature coordinate satisfying both thresholds. Four checkpoints fail: Qwen2.5-1.5B (ratio ≈ 574, below 1000×), and three Qwen3.5 models whose activation scale is systematically suppressed.

Activation Magnitude vs. Local Ratio Explorer

Adjust the absolute threshold and local-ratio threshold to see how the pass/fail classification changes. Each point is a model's representative-layer peak.

100
1000×
Models Passing
20
Models Failing
4
Why this matters
The binary massive-activation criterion doesn't capture deployment risk. Qwen3.5-9B fails the Sun test but peaks at 956 — easy to quantize. Gemma3-4B-it passes but peaks at 245,760. The continuous metric $M = \max|a|$ is what actually matters for choosing your quantization scale.
Next: Does bigger model mean bigger activations?
Chapter 4

The Scaling Question

Does a bigger model always produce larger activations? Within a single family, often yes. Across families, the answer collapses — Gemma2-9B has lower peaks than Gemma2-2B.

Figure 5 in the paper compares checkpoints of different sizes within the same family. Most families (Qwen2.5, Qwen3.5, Gemma3) show a stable within-family scale effect: $M$ increases with parameter count. Gemma2 is the main non-monotonic exception, with 9B peaking below 2B.

Within-Family Scaling of Global Activation Maxima

Select a family to see how its global maximum activation changes with model size. Hover over points for exact values.

Why this matters
For inference and quantization, family identity and generation are at least as important as parameter count. Assuming bigger model = bigger activations is dangerous. A 9B Gemma2 is easier to quantize than a 2B Gemma2, while a 7B Qwen2.5 has peaks 14× larger than a 9B Qwen3.5.
Next: How does MoE architecture change the picture?
Chapter 5

MoE Versus Dense

Mixture-of-expert models activate fewer parameters per token — and they also produce dramatically lower peak activations. Two matched pairs show 14–23× reductions.

The paper compares two matched MoE-vs-dense pairs at the ~30B scale. For Qwen3, the MoE checkpoint (30B-A3B) has $M = 1{,}512$ versus $M = 35{,}328$ for the dense 32B — a $23.4\times$ reduction. For Qwen3.5, the MoE (35B-A3B) has $M = 132$ versus $M = 1{,}546$ for the dense 27B — a $14.0\times$ reduction (using the global $M$ across all components, not the representative-layer Top-1 from Table 1).

$$\frac{M_{\text{dense}}}{M_{\text{MoE}}} \in [14.0,\; 23.4]$$

MoE vs. Dense at Matched Scale

Select a model pair to compare MoE and dense peaks. The ratio is displayed in the readout card.

Qwen3: 30B-A3B vs 32B
MoE vs dense at ~30B scale
Qwen3.5: 35B-A3B vs 27B
MoE vs dense at ~30B scale
Dense Peak (M)
35,328
MoE Peak (M)
1,512
Reduction Ratio
23.4×
Caveat
This contrast rests on only $n = 2$ matched pairs within a single family. The gap is large and consistent, but with such small sample sizes it is observational rather than causal — training recipe differences between MoE and dense checkpoints could contribute.
Next: How do activations evolve across model generations?
Chapter 6

Generational Evolution

Newer isn't always better for activation range. Qwen follows an inverted-V trajectory across generations; Gemma shows a sharp increase from Gemma2 to Gemma3.

Figure 7 compares generational trends at similar model sizes. Qwen exhibits an inverted-V: $M$ increases from Qwen2.5 to Qwen3, then drops sharply in Qwen3.5. Gemma shows a monotonic increase from Gemma2 to Gemma3 in both matched size groups. These cross-generation trends are highly family-specific and break any assumption of monotonic scaling with release time.

Matched-Scale Generational Trends

Select a scale group to compare how Qwen and Gemma generations evolve. Points within each family are connected to show the trajectory.

~1-2B Scale
Qwen2.5-1.5B → Qwen3-1.7B → Qwen3.5-0.8B · Gemma2-2B → Gemma3-4B-it
~7-9B Scale
Qwen2.5-7B → Qwen3-8B → Qwen3.5-9B · Gemma2-9B → Gemma3-4B-it
~27-32B Scale
Qwen2.5-32B → Qwen3-32B → Qwen3.5-27B · Gemma2-27B → Gemma3-27B-it
Why this matters
You cannot assume the next generation of a model family will have lower activation peaks. Qwen3.5's dramatic suppression may reflect a deliberate normalization design choice, while Gemma3's increase may stem from instruction-tuning effects. Every new release needs fresh measurement.
Next: What does this mean for actual deployment?
Chapter 7

Training Effects & Quantization

Training stage, instruction tuning, and vision-language adaptation all shift peak magnitudes. A lightweight INT-8 probe confirms that higher peaks mean worse quantization quality.

Training Stage & SFT Effects

Toggle between the two explorations. For training stage: the Ling-mini trajectory shows monotonic increase. For SFT: compare Base vs Instruct layerwise peaks.

5T Peak
7,648
20T Peak
10,240
Growth Factor
1.34×

The INT-8 sanity check covers eight representative models. Using per-tensor symmetric quantization with two strategies — max-abs scaling and 99.9% percentile clipping — the paper measures SQNR (signal-to-quantization-noise ratio) at each model's peak hidden layer. Qwen3.5-0.8B, with the lowest $M$, maintains 29.1 dB SQNR. Most high-peak models drop to 0.2–14 dB.

$$\text{SQNR} = 10 \log_{10} \frac{\sum a_i^2}{\sum (a_i - \hat{a}_i)^2} \;\text{dB}$$

INT-8 Quantization Sanity Check

Hover over bars to see SQNR values. The dashed line marks 20 dB — a reasonable quality threshold. Note how low-peak Qwen3.5 models maintain high SQNR while high-peak models suffer.

Why this matters
Maximum activation magnitude translates directly into quantization reconstruction error through scale selection. This makes $M$ a practical model-card statistic — not just a descriptive outlier measure. Anyone deploying 8-bit (or lower) inference should know their model's $M$ before choosing a quantization strategy.
Next: What should you take away from all this?
Chapter 8

What to Measure Before You Deploy

The paper's central argument is compact: $M = \max|a|$ is a family- and architecture-dependent model property that should be measured and reported alongside every open-weight release.

The paper establishes three deployment takeaways:

Five Key Findings

1. Four orders of magnitude. Global maxima range from ~100 (Qwen3.5) to ~700K (Gemma3-27B-it) at comparable parameter counts.
2. MoE reduces peaks by 14–23×. Sparse routing through expert modules dramatically suppresses activation extremes versus dense counterparts.
3. The residual stream dominates. In 22/24 models, the global maximum lives in hidden states, not attention or MLP outputs.
4. SFT compresses late layers only. Instruction tuning reduces final-layer peaks by up to 48% but leaves mid-layer peaks unchanged.
5. Higher peaks → worse quantization. INT-8 SQNR drops from 29.1 dB for low-peak models to near 0 dB for high-peak models under max-abs scaling.
The Bottom Line
$M = \max|a|$ is not predicted by parameter count alone. It is a model property tied to family, architecture, and training stage — and it should be reported alongside any open-weight release before low-bit deployment. The code is available at github.com/clx1415926/Max_act_llm.