An Interactive Reading of

DeepSeek-V4: Towards
Highly Efficient Million-Token
Context Intelligence

The paper, in plain English

Large language models hit a wall when you ask them to think over truly long contexts. Every extra token you feed into the model multiplies the computational cost because the attention mechanism — the core of how the model reads and connects information — scales quadratically. DeepSeek-V4 asks a simple question: what if we could make attention nearly free, even at a million tokens?

The answer comes from a hybrid attention design that aggressively compresses old tokens while keeping recent ones sharp — imagine reading a long novel where you summarize each chapter into a single paragraph once you've moved on, but keep the current chapter word-for-word. Add to this a new way of connecting layers that keeps signal strength stable through 61 layers, and a mathematically-grounded optimizer that converges faster, and you get models that are both smarter and dramatically cheaper to run.

The headline numbers: at 1M tokens, DeepSeek-V4-Pro needs only 27% of the inference FLOPs and 10% of the KV cache compared to its predecessor, while matching or beating frontier closed models on reasoning benchmarks. The smaller DeepSeek-V4-Flash goes further — 10% FLOPs, 7% KV cache — making million-token context practically free. This is the first open model to achieve a Codeforces rating of 3206, matching GPT-5.4.

I
Hybrid Attention (CSA + HCA)
Compressed Sparse Attention and Heavily Compressed Attention slash KV cache to ~2% of baseline at 1M tokens, making million-length contexts practical.
II
Manifold-Constrained Hyper-Connections
Residual connections are constrained to doubly stochastic matrices (the Birkhoff polytope), guaranteeing signal stability across 61+ layers.
III
Muon Optimizer
Matrix orthogonalization via Newton-Schulz iterations replaces Adam for most parameters, delivering faster convergence with fewer instabilities.
Chapter 1

The Architecture Blueprint

DeepSeek-V4 inherits the DeepSeekMoE skeleton but outfits it with three new subsystems — mHC, hybrid attention, and the Muon optimizer — that together break the long-context efficiency barrier.

Model configurations (Table from paper)

DeepSeek-V4-Flash: 284B total, 13B activated, 43 layers, 256 routed experts

DeepSeek-V4-Pro: 1.6T total, 49B activated, 61 layers, 384 routed experts

Both activate 6 experts per token via $\sqrt{\text{Softplus}(\cdot)}$ affinity scoring.

Charts update live as you drag

The key insight: only 3% of parameters activate per token (49B / 1.6T for Pro). This sparsity is what makes a trillion-parameter model run at practical speed — you pay for the experts you use, not the ones sitting on the bench.
Manifold-Constrained Hyper-Connections
Chapter 2

Manifold-Constrained Hyper-Connections

Standard residual connections let each Transformer layer add its output to the input. mHC upgrades this: it expands the residual stream, dynamically mixes contributions, and constrains the mixing matrix to stay numerically stable.

Core update rule (Equation 1)
$$X_{l+1} = B_l\, X_l + C_l\, F_l(A_l\, X_l)$$

$X_l \in \mathbb{R}^{n_{hc} \times d}$ is the expanded residual state, $A_l \in \mathbb{R}^{1 \times n_{hc}}$ is the input mapping, $B_l \in \mathbb{R}^{n_{hc} \times n_{hc}}$ is the residual mapping (constrained to doubly stochastic matrices), $C_l \in \mathbb{R}^{n_{hc} \times 1}$ is the output mapping, $F_l$ is the layer function (MoE or attention).

Doubly stochastic constraint (Equation 2)
$$B_l \in \mathcal{M} \coloneqq \{M \in \mathbb{R}^{n \times n} \mid M\mathbf{1}_n = \mathbf{1}_n,\; \mathbf{1}_n^T M = \mathbf{1}_n^T,\; M \geq 0\}$$

Signal magnitude through layers — constrained (green) vs unconstrained (red)

Signal at layer 61
1.00
Unconstrained at layer 61
1.00
Why this matters: the Birkhoff polytope constraint ensures $\|B_l\|_2 \leq 1$ (non-expansive), and the set is closed under multiplication — so stability compounds across layers. The Sinkhorn-Knopp algorithm projects onto this manifold in $t_{max} = 20$ iterations.
Hybrid Attention: CSA & HCA
Chapter 3

Hybrid Attention: CSA & HCA

The dominant cost of long-context inference is attention — quadratic in sequence length. DeepSeek-V4 attacks this with two complementary compression strategies: CSA for selective retrieval and HCA for extreme summarization.

CSA compressed KV entry (Equation 12)
$$C_i^{\text{Comp}} = \sum_{j=mi}^{m(i+1)-1} S_j^a \odot C_j^a + \sum_{j=m(i-1)}^{mi-1} S_j^b \odot C_j^b$$

Each compressed entry merges $2m$ KV entries with learned weights $S$ (softmaxed). The overlapping indices mean the effective compression is $1/m$ times the original length.

Lightning Indexer scoring (Equation 16)
$$I_{t,s} = \sum_{h=1}^{n_h^I} w_{t,h}^I \cdot \text{ReLU}\!\left(q_{t,h}^I \cdot K_s^{I,\text{Comp}}\right)$$

Charts update live as you drag

KV Cache vs Baseline
2%
FLOPs Reduction
27%
The efficiency breakthrough: compared to a standard BF16 GQA8 baseline, DeepSeek-V4's KV cache at 1M tokens is roughly 2% of the baseline size. Even against the already-efficient DeepSeek-V3.2, V4-Pro uses only 10% of the KV cache. This is what makes million-token context economically viable.
The Muon Optimizer
Chapter 4

The Muon Optimizer

Training a trillion-parameter model efficiently requires rethinking the optimizer itself. DeepSeek-V4 replaces Adam with Muon — a momentum-based optimizer that orthogonalizes the update matrix via Newton-Schulz iterations.

Newton-Schulz iteration (Equation 28)
$$M_k = a\,M_{k-1} + b\,(M_{k-1}M_{k-1}^T)\,M_{k-1} + c\,(M_{k-1}M_{k-1}^T)^2\,M_{k-1}$$

Stage 1 (steps 1-8): $(a,b,c) = (3.4445, -4.7750, 2.0315)$ — rapid convergence.
Stage 2 (steps 9-10): $(a,b,c) = (2, -1.5, 0.5)$ — precision stabilization.

Full Muon update (Algorithm 1)
$$M_t = \mu\, M_{t-1} + G_t, \quad O_t' = \text{HybridNewtonSchulz}(\mu\, M_t + G_t)$$ $$O_t = O_t' \cdot \sqrt{\max(n,m)} \cdot \gamma, \quad W_t = W_{t-1}(1 - \eta\lambda) - \eta\, O_t$$

Simulated convergence of Muon vs Adam on a quadratic loss

Practical impact: Muon converges faster than Adam on the same data, meaning DeepSeek-V4 reaches its final performance with fewer training tokens. The hybrid Newton-Schulz schedule (8 fast + 2 precise) ensures updates are well-conditioned throughout training.
Training at Scale
Chapter 5

Training at Scale

Training 1.6T parameters on 33 trillion tokens demands infrastructure breakthroughs — from fused MoE kernels to FP4 quantization-aware training to deterministic reproducibility.

Communication-Computation Balance Condition
$$\frac{C}{B} \leq 2d = 6144 \;\text{FLOPs/Byte}$$

$C$ = peak compute throughput, $B$ = interconnect bandwidth, $d$ = hidden dimension (3072 for Pro). Each GB/s of bandwidth can hide communication for 6.1 TFLOP/s of compute.

Expert parallelism speedup by wave scheduling strategy

Speedup over Naive
1.92x
Memory Savings (FP4)
~50%
Infrastructure highlights: the fused mega-kernel (MegaMoE) achieves 1.50-1.73x speedup for general inference and up to 1.96x for RL rollouts. FP4 quantization-aware training for MoE expert weights is lossless when dequantizing to FP8, because FP8's extra exponent bits absorb the scale variations.
The Benchmark Arena
Chapter 6

The Benchmark Arena

DeepSeek-V4-Pro-Max redefines the open-source frontier: state-of-the-art on knowledge, competitive with closed models on reasoning, and the first open model to match GPT-5.4 on competitive programming.

DeepSeek-V4-Pro-Max
1.6T / 49B activated
DeepSeek-V4-Flash-Max
284B / 13B activated
DeepSeek-V3.2
671B / 37B activated
Standout results: V4-Pro-Max scores 90.2% on Apex Shortlist (vs GPT-5.4 at 78.1%), 93.5% on LiveCodeBench (vs Gemini-3.1-Pro at 91.7%), and a Codeforces rating of 3206. It outperforms Gemini-3.1-Pro on MRCR 1M retrieval (83.5% vs 76.3%).
Million-Token Context
Chapter 7

The Million-Token Horizon

The whole point of DeepSeek-V4's architecture is making million-token context practical. Here we see the payoff: inference cost that scales sub-linearly with context length, and retrieval accuracy that holds up far better than competitors.

Inference FLOPs and KV cache vs context length

The bottom line: DeepSeek-V4 makes million-token context routinely supportable — not as a stunt, but as a production capability. This unlocks long-horizon agent tasks, massive document analysis, and the next frontier of test-time scaling. The KV cache at 1M tokens is just ~2% of a standard BF16 GQA8 baseline.
↑ Back to Cover