An Interactive Reading of

DeepSeek-V4: Towards
Highly Efficient Million-Token
Context Intelligence

DeepSeek-AI
2025 · HuggingFace

The paper, in plain English

Large language models hit a wall when you ask them to think over truly long contexts. Every extra token you feed into the model multiplies the computational cost because the attention mechanism — the core of how the model reads and connects information — scales quadratically. DeepSeek-V4 asks a simple question: what if we could make attention nearly free, even at a million tokens?

The answer comes from a hybrid attention design that aggressively compresses old tokens while keeping recent ones sharp — imagine reading a long novel where you summarize each chapter into a single paragraph once you've moved on, but keep the current chapter word-for-word. Add to this a new way of connecting layers that keeps signal strength stable through 61 layers, and a mathematically-grounded optimizer that converges faster, and you get models that are both smarter and dramatically cheaper to run.

The headline numbers: at 1M tokens, DeepSeek-V4-Pro needs only 27% of the inference FLOPs and 10% of the KV cache compared to its predecessor, while matching or beating frontier closed models on reasoning benchmarks. The smaller DeepSeek-V4-Flash goes further — 10% FLOPs, 7% KV cache — making million-token context practically free. This is the first open model to achieve a Codeforces rating of 3206, matching GPT-5.4.

I

Hybrid Attention (CSA + HCA)

Compressed Sparse Attention and Heavily Compressed Attention slash KV cache to ~2% of baseline at 1M tokens, making million-length contexts practical.

II

Manifold-Constrained Hyper-Connections

Residual connections are constrained to doubly stochastic matrices (the Birkhoff polytope), guaranteeing signal stability across 61+ layers.

III

Muon Optimizer

Matrix orthogonalization via Newton-Schulz iterations replaces Adam for most parameters, delivering faster convergence with fewer instabilities.

Chapter 1

The Architecture Blueprint

DeepSeek-V4 inherits the DeepSeekMoE skeleton but outfits it with three new subsystems — mHC, hybrid attention, and the Muon optimizer — that together break the long-context efficiency barrier.

In plain English

Think of a DeepSeek-V4 model as a massive consulting firm. Most employees (experts) sit idle on any given project, but each client query activates a small, hand-picked team of six. The firm has 384 consultants on staff (Pro) but only dispatches six per question — that is Mixture-of-Experts in a nutshell.

What is new in V4 is how information flows between consulting teams (layers). Previous models used a simple "pass the baton" relay; V4 uses a system called Manifold-Constrained Hyper-Connections that carefully controls how much of the previous team's output to preserve, blend, or transform — keeping the signal strong even after 61 handoffs. Meanwhile, the firm's filing system (attention) has been overhauled so it can look up facts from a million-page archive in seconds, not hours.

Drag the sliders below to see how activated parameters and model scale affect the compute budget.

Model configurations (Table from paper)

DeepSeek-V4-Flash: 284B total, 13B activated, 43 layers, 256 routed experts

DeepSeek-V4-Pro: 1.6T total, 49B activated, 61 layers, 384 routed experts

Both activate 6 experts per token via $\sqrt{\text{Softplus}(\cdot)}$ affinity scoring.

Total Parameters (B) 1600

Activated Parameters (B) 49

Sequence Length (K tokens) 128

Charts update live as you drag

The key insight: only 3% of parameters activate per token (49B / 1.6T for Pro). This sparsity is what makes a trillion-parameter model run at practical speed — you pay for the experts you use, not the ones sitting on the bench.

Manifold-Constrained Hyper-Connections →

Chapter 2

Manifold-Constrained Hyper-Connections

Standard residual connections let each Transformer layer add its output to the input. mHC upgrades this: it expands the residual stream, dynamically mixes contributions, and constrains the mixing matrix to stay numerically stable.

In plain English

Imagine a relay race where each runner hands off a baton. In standard Transformers, the baton is simply passed along — each runner adds their contribution and passes the sum. This works for 20-30 layers, but signal degradation creeps in as you stack 60+ layers.

mHC is like giving each runner a mixing board with sliders. The baton expands from one lane to four ($n_{hc} = 4$), and each runner dynamically adjusts how much of each lane to preserve, blend, or feed into their run. The crucial trick: the mixing matrix is constrained to be a doubly stochastic matrix — every row sums to 1 and every column sums to 1, with all entries non-negative. This guarantees the signal never explodes or vanishes.

Adjust the mixing matrix eigenvalues below to see what happens to signal magnitude across layers.

Core update rule (Equation 1)

$$X_{l+1} = B_l\, X_l + C_l\, F_l(A_l\, X_l)$$

$X_l \in \mathbb{R}^{n_{hc} \times d}$ is the expanded residual state, $A_l \in \mathbb{R}^{1 \times n_{hc}}$ is the input mapping, $B_l \in \mathbb{R}^{n_{hc} \times n_{hc}}$ is the residual mapping (constrained to doubly stochastic matrices), $C_l \in \mathbb{R}^{n_{hc} \times 1}$ is the output mapping, $F_l$ is the layer function (MoE or attention).

Doubly stochastic constraint (Equation 2)

$$B_l \in \mathcal{M} \coloneqq \{M \in \mathbb{R}^{n \times n} \mid M\mathbf{1}_n = \mathbf{1}_n,\; \mathbf{1}_n^T M = \mathbf{1}_n^T,\; M \geq 0\}$$

Max eigenvalue of B 1.00

Number of layers 61

Signal magnitude through layers — constrained (green) vs unconstrained (red)

Signal at layer 61

1.00

Unconstrained at layer 61

1.00

Why this matters: the Birkhoff polytope constraint ensures $\|B_l\|_2 \leq 1$ (non-expansive), and the set is closed under multiplication — so stability compounds across layers. The Sinkhorn-Knopp algorithm projects onto this manifold in $t_{max} = 20$ iterations.

Hybrid Attention: CSA & HCA →

Chapter 3

Hybrid Attention: CSA & HCA

The dominant cost of long-context inference is attention — quadratic in sequence length. DeepSeek-V4 attacks this with two complementary compression strategies: CSA for selective retrieval and HCA for extreme summarization.

In plain English

Think of reading the entire internet to answer a question. Standard attention would require comparing your question against every single word ever written — clearly impossible at a million tokens. DeepSeek-V4 uses two strategies working in tandem.

CSA (Compressed Sparse Attention) is like having a research assistant who first summarizes every group of 4 pages into a single bullet point, then uses an index to pick out only the most relevant bullet points for your question. HCA (Heavily Compressed Attention) goes further — it summarizes every 128 pages into a single sentence, keeping only the broadest themes. Together, they let the model remember the forest and the trees without paying for every leaf.

Adjust the compression rates below to see how KV cache and FLOPs shrink.

CSA compressed KV entry (Equation 12)

$$C_i^{\text{Comp}} = \sum_{j=mi}^{m(i+1)-1} S_j^a \odot C_j^a + \sum_{j=m(i-1)}^{mi-1} S_j^b \odot C_j^b$$

Each compressed entry merges $2m$ KV entries with learned weights $S$ (softmaxed). The overlapping indices mean the effective compression is $1/m$ times the original length.

Lightning Indexer scoring (Equation 16)

$$I_{t,s} = \sum_{h=1}^{n_h^I} w_{t,h}^I \cdot \text{ReLU}\!\left(q_{t,h}^I \cdot K_s^{I,\text{Comp}}\right)$$

CSA compression rate m 4

HCA compression rate m' 128

Sequence Length (K) 1024

CSA top-k 1024

Charts update live as you drag

KV Cache vs Baseline

2%

FLOPs Reduction

27%

The efficiency breakthrough: compared to a standard BF16 GQA8 baseline, DeepSeek-V4's KV cache at 1M tokens is roughly 2% of the baseline size. Even against the already-efficient DeepSeek-V3.2, V4-Pro uses only 10% of the KV cache. This is what makes million-token context economically viable.

The Muon Optimizer →

Chapter 4

The Muon Optimizer

Training a trillion-parameter model efficiently requires rethinking the optimizer itself. DeepSeek-V4 replaces Adam with Muon — a momentum-based optimizer that orthogonalizes the update matrix via Newton-Schulz iterations.

In plain English

Imagine you are hiking down a mountain in fog. Adam, the standard optimizer, takes careful steps downhill, adjusting stride per dimension. Muon does something fundamentally different: it takes the gradient, applies momentum (remembering past directions), then rotates the update so it points in the most informative direction — orthogonal to past updates.

This rotation is done by the Newton-Schulz iteration, a mathematical trick that pushes a matrix toward its nearest orthogonal form. The effect is like a hiker who never steps in the same direction twice, covering more ground with each step. DeepSeek-V4 uses a hybrid version: 8 "fast convergence" iterations followed by 2 "precision" iterations.

Adjust the momentum and iteration count below to see convergence behavior.

Newton-Schulz iteration (Equation 28)

$$M_k = a\,M_{k-1} + b\,(M_{k-1}M_{k-1}^T)\,M_{k-1} + c\,(M_{k-1}M_{k-1}^T)^2\,M_{k-1}$$

Stage 1 (steps 1-8): $(a,b,c) = (3.4445, -4.7750, 2.0315)$ — rapid convergence.
Stage 2 (steps 9-10): $(a,b,c) = (2, -1.5, 0.5)$ — precision stabilization.

Full Muon update (Algorithm 1)

$$M_t = \mu\, M_{t-1} + G_t, \quad O_t' = \text{HybridNewtonSchulz}(\mu\, M_t + G_t)$$ $$O_t = O_t' \cdot \sqrt{\max(n,m)} \cdot \gamma, \quad W_t = W_{t-1}(1 - \eta\lambda) - \eta\, O_t$$

Momentum μ 0.95

Newton-Schulz iterations 10

Weight decay λ 0.10

Simulated convergence of Muon vs Adam on a quadratic loss

Practical impact: Muon converges faster than Adam on the same data, meaning DeepSeek-V4 reaches its final performance with fewer training tokens. The hybrid Newton-Schulz schedule (8 fast + 2 precise) ensures updates are well-conditioned throughout training.

Training at Scale →

Chapter 5

Training at Scale

Training 1.6T parameters on 33 trillion tokens demands infrastructure breakthroughs — from fused MoE kernels to FP4 quantization-aware training to deterministic reproducibility.

In plain English

Training a frontier model is not unlike running a global logistics network. You have thousands of GPUs spread across racks, each processing different parts of the data. The bottleneck is not computation — it is communication. GPUs spend more time waiting for data to arrive from other GPUs than doing actual math.

DeepSeek-V4 solves this with wave-based expert parallelism: instead of waiting for all experts to finish before starting communication, experts are grouped into "waves." While wave 1 computes, wave 2's data is already being transferred, and wave 0's results are being sent back — a fine-grained pipeline that keeps both compute and communication busy at all times.

Adjust the overlap efficiency below to see how wave scheduling affects throughput.

Communication-Computation Balance Condition

$$\frac{C}{B} \leq 2d = 6144 \;\text{FLOPs/Byte}$$

$C$ = peak compute throughput, $B$ = interconnect bandwidth, $d$ = hidden dimension (3072 for Pro). Each GB/s of bandwidth can hide communication for 6.1 TFLOP/s of compute.

Compute & Communication Overlap 0.90

Number of Expert Waves 4

FP4 Weight Compression Yes

Expert parallelism speedup by wave scheduling strategy

Speedup over Naive

1.92x

Memory Savings (FP4)

~50%

Infrastructure highlights: the fused mega-kernel (MegaMoE) achieves 1.50-1.73x speedup for general inference and up to 1.96x for RL rollouts. FP4 quantization-aware training for MoE expert weights is lossless when dequantizing to FP8, because FP8's extra exponent bits absorb the scale variations.

The Benchmark Arena →

Chapter 6

The Benchmark Arena

DeepSeek-V4-Pro-Max redefines the open-source frontier: state-of-the-art on knowledge, competitive with closed models on reasoning, and the first open model to match GPT-5.4 on competitive programming.

In plain English

Every few months, a new AI model claims to be "the best." What makes DeepSeek-V4 different is where it wins. On knowledge benchmarks (SimpleQA, Chinese-SimpleQA), it beats every open-source model by 20+ percentage points. On competitive coding (Codeforces), it achieves a rating of 3206 — the first time any open model has matched a leading closed model on this task.

But the real story is reasoning effort scaling. The same model has three modes: Non-Think (fast, intuitive), Think-High (deliberate), and Think-Max (maximum reasoning). Going from Non-Think to Think-Max on HLE (Human Last Exam) lifts accuracy from 7.7% to 37.7% — a 5x improvement from the same weights, just by thinking longer.

Click between model configurations below to compare benchmark performance.

DeepSeek-V4-Pro-Max

1.6T / 49B activated

DeepSeek-V4-Flash-Max

284B / 13B activated

DeepSeek-V3.2

671B / 37B activated

Standout results: V4-Pro-Max scores 90.2% on Apex Shortlist (vs GPT-5.4 at 78.1%), 93.5% on LiveCodeBench (vs Gemini-3.1-Pro at 91.7%), and a Codeforces rating of 3206. It outperforms Gemini-3.1-Pro on MRCR 1M retrieval (83.5% vs 76.3%).

Million-Token Context →

Chapter 7

The Million-Token Horizon

The whole point of DeepSeek-V4's architecture is making million-token context practical. Here we see the payoff: inference cost that scales sub-linearly with context length, and retrieval accuracy that holds up far better than competitors.

Context Length (K tokens) 1024

Inference FLOPs and KV cache vs context length

The bottom line: DeepSeek-V4 makes million-token context routinely supportable — not as a stunt, but as a production capability. This unlocks long-horizon agent tasks, massive document analysis, and the next frontier of test-time scaling. The KV cache at 1M tokens is just ~2% of a standard BF16 GQA8 baseline.

↑ Back to Cover

DeepSeek-V4: TowardsHighly Efficient Million-TokenContext Intelligence

The Architecture Blueprint

Manifold-Constrained Hyper-Connections

Hybrid Attention: CSA & HCA

The Muon Optimizer

Training at Scale

The Benchmark Arena

The Million-Token Horizon

DeepSeek-V4: Towards
Highly Efficient Million-Token
Context Intelligence