When you run a large language model, some of its internal numbers — called activations — can get staggeringly large. If you're trying to compress the model to run on cheaper hardware using fewer bits (quantization), those extreme values become a problem: the quantization scale has to stretch to cover the outliers, wasting precision on the vast majority of normal values. This paper asks a simple but unanswered question: across today's open-source LLMs, just how large do activations get?
The authors run 27 models from 8 families (Qwen, Gemma, Ling, GPT-OSS) through an identical measurement pipeline — same 5,000 input texts, same hooks capturing every layer's activations — and measure the global maximum $M = \max |a|$. Think of it as an MRI scan for each model's numerical nervous system. What they find is startling: at similar parameter counts, the worst-case activation varies by four orders of magnitude depending on which model family you pick. A Qwen3.5 model peaks around 132; a Gemma3-27B-it reaches 696,320.
The headline result is that maximum activation is a model property tied to family, architecture, and training — not just size. Mixture-of-expert (MoE) models have peaks 14–23× lower than dense counterparts. Instruction tuning mainly compresses late-layer peaks. And a quick INT-8 quantization check confirms that higher peaks translate directly into worse reconstruction quality. The paper argues that $M$ should be reported on every model card, right alongside parameter count and benchmark scores.
Twenty-four checkpoints, eight families, one measurement pipeline. The result is a landscape where peak activations vary by nearly four orders of magnitude — and parameter count alone can't predict where any model lands.
The primary metric throughout this paper is $M = \max |a|$: the single largest absolute activation observed across all hooked components (embeddings, hidden states, attention outputs, MLP/MoE outputs, SwiGLU gates, final LayerNorm) and all layers during a forward pass over the 5,000-sample evaluation corpus.
Hover over bars to see exact values. Toggle families with the legend. The × markers indicate models failing the Sun criterion.
The peak doesn't live in one universal layer. It accumulates through two distinct depth patterns — a sudden jump-and-plateau in some families, and a gradual build in others. The pattern is a fingerprint of the model family.
The layerwise trajectory describes how peak activation magnitude changes with network depth. The paper identifies two broad patterns. The jump-and-plateau pattern (Qwen2.5, GPT-OSS, Ling) shows a sharp magnitude rise in early or middle layers, followed by a sustained high plateau. The gradual-accumulation pattern (Qwen3.5, Gemma) shows a smoother increase that often peaks in later layers.
Click a model family card to overlay its trajectory. Multiple selections are supported. The x-axis is normalized depth (0 = first layer, 1 = last).
Prior work asked "does this model have massive activations?" — a yes-or-no question. This paper replaces that binary flag with a continuous measurement, and shows the two views can disagree.
The Sun criterion (from Sun et al., 2024) defines a massive activation coordinate $x_i$ as one satisfying both $|x_i| > 100$ and a local ratio condition:
A model passes if any hidden layer contains at least one token-feature coordinate satisfying both thresholds. Four checkpoints fail: Qwen2.5-1.5B (ratio ≈ 574, below 1000×), and three Qwen3.5 models whose activation scale is systematically suppressed.
Adjust the absolute threshold and local-ratio threshold to see how the pass/fail classification changes. Each point is a model's representative-layer peak.
Does a bigger model always produce larger activations? Within a single family, often yes. Across families, the answer collapses — Gemma2-9B has lower peaks than Gemma2-2B.
Figure 5 in the paper compares checkpoints of different sizes within the same family. Most families (Qwen2.5, Qwen3.5, Gemma3) show a stable within-family scale effect: $M$ increases with parameter count. Gemma2 is the main non-monotonic exception, with 9B peaking below 2B.
Select a family to see how its global maximum activation changes with model size. Hover over points for exact values.
Mixture-of-expert models activate fewer parameters per token — and they also produce dramatically lower peak activations. Two matched pairs show 14–23× reductions.
The paper compares two matched MoE-vs-dense pairs at the ~30B scale. For Qwen3, the MoE checkpoint (30B-A3B) has $M = 1{,}512$ versus $M = 35{,}328$ for the dense 32B — a $23.4\times$ reduction. For Qwen3.5, the MoE (35B-A3B) has $M = 132$ versus $M = 1{,}546$ for the dense 27B — a $14.0\times$ reduction (using the global $M$ across all components, not the representative-layer Top-1 from Table 1).
Select a model pair to compare MoE and dense peaks. The ratio is displayed in the readout card.
Newer isn't always better for activation range. Qwen follows an inverted-V trajectory across generations; Gemma shows a sharp increase from Gemma2 to Gemma3.
Figure 7 compares generational trends at similar model sizes. Qwen exhibits an inverted-V: $M$ increases from Qwen2.5 to Qwen3, then drops sharply in Qwen3.5. Gemma shows a monotonic increase from Gemma2 to Gemma3 in both matched size groups. These cross-generation trends are highly family-specific and break any assumption of monotonic scaling with release time.
Select a scale group to compare how Qwen and Gemma generations evolve. Points within each family are connected to show the trajectory.
Training stage, instruction tuning, and vision-language adaptation all shift peak magnitudes. A lightweight INT-8 probe confirms that higher peaks mean worse quantization quality.
Toggle between the two explorations. For training stage: the Ling-mini trajectory shows monotonic increase. For SFT: compare Base vs Instruct layerwise peaks.
The INT-8 sanity check covers eight representative models. Using per-tensor symmetric quantization with two strategies — max-abs scaling and 99.9% percentile clipping — the paper measures SQNR (signal-to-quantization-noise ratio) at each model's peak hidden layer. Qwen3.5-0.8B, with the lowest $M$, maintains 29.1 dB SQNR. Most high-peak models drop to 0.2–14 dB.
Hover over bars to see SQNR values. The dashed line marks 20 dB — a reasonable quality threshold. Note how low-peak Qwen3.5 models maintain high SQNR while high-peak models suffer.
The paper's central argument is compact: $M = \max|a|$ is a family- and architecture-dependent model property that should be measured and reported alongside every open-weight release.
The paper establishes three deployment takeaways: