DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
DeepSeek-AI 2025 · HuggingFace
The paper, in plain English
Large language models hit a wall when you ask them to think over truly long contexts. Every extra token you feed into the model multiplies the computational cost because the attention mechanism — the core of how the model reads and connects information — scales quadratically. DeepSeek-V4 asks a simple question: what if we could make attention nearly free, even at a million tokens?
The answer comes from a hybrid attention design that aggressively compresses old tokens while keeping recent ones sharp — imagine reading a long novel where you summarize each chapter into a single paragraph once you've moved on, but keep the current chapter word-for-word. Add to this a new way of connecting layers that keeps signal strength stable through 61 layers, and a mathematically-grounded optimizer that converges faster, and you get models that are both smarter and dramatically cheaper to run.
The headline numbers: at 1M tokens, DeepSeek-V4-Pro needs only 27% of the inference FLOPs and 10% of the KV cache compared to its predecessor, while matching or beating frontier closed models on reasoning benchmarks. The smaller DeepSeek-V4-Flash goes further — 10% FLOPs, 7% KV cache — making million-token context practically free. This is the first open model to achieve a Codeforces rating of 3206, matching GPT-5.4.
I
Hybrid Attention (CSA + HCA)
Compressed Sparse Attention and Heavily Compressed Attention slash KV cache to ~2% of baseline at 1M tokens, making million-length contexts practical.
II
Manifold-Constrained Hyper-Connections
Residual connections are constrained to doubly stochastic matrices (the Birkhoff polytope), guaranteeing signal stability across 61+ layers.
III
Muon Optimizer
Matrix orthogonalization via Newton-Schulz iterations replaces Adam for most parameters, delivering faster convergence with fewer instabilities.
Chapter 1
The Architecture Blueprint
DeepSeek-V4 inherits the DeepSeekMoE skeleton but outfits it with three new subsystems — mHC, hybrid attention, and the Muon optimizer — that together break the long-context efficiency barrier.
Both activate 6 experts per token via $\sqrt{\text{Softplus}(\cdot)}$ affinity scoring.
Charts update live as you drag
The key insight: only 3% of parameters activate per token (49B / 1.6T for Pro). This sparsity is what makes a trillion-parameter model run at practical speed — you pay for the experts you use, not the ones sitting on the bench.
Standard residual connections let each Transformer layer add its output to the input. mHC upgrades this: it expands the residual stream, dynamically mixes contributions, and constrains the mixing matrix to stay numerically stable.
Core update rule (Equation 1)
$$X_{l+1} = B_l\, X_l + C_l\, F_l(A_l\, X_l)$$
$X_l \in \mathbb{R}^{n_{hc} \times d}$ is the expanded residual state,
$A_l \in \mathbb{R}^{1 \times n_{hc}}$ is the input mapping,
$B_l \in \mathbb{R}^{n_{hc} \times n_{hc}}$ is the residual mapping (constrained to doubly stochastic matrices),
$C_l \in \mathbb{R}^{n_{hc} \times 1}$ is the output mapping,
$F_l$ is the layer function (MoE or attention).
Doubly stochastic constraint (Equation 2)
$$B_l \in \mathcal{M} \coloneqq \{M \in \mathbb{R}^{n \times n} \mid M\mathbf{1}_n = \mathbf{1}_n,\; \mathbf{1}_n^T M = \mathbf{1}_n^T,\; M \geq 0\}$$
Signal magnitude through layers — constrained (green) vs unconstrained (red)
Signal at layer 61
1.00
Unconstrained at layer 61
1.00
Why this matters: the Birkhoff polytope constraint ensures $\|B_l\|_2 \leq 1$ (non-expansive), and the set is closed under multiplication — so stability compounds across layers. The Sinkhorn-Knopp algorithm projects onto this manifold in $t_{max} = 20$ iterations.
The dominant cost of long-context inference is attention — quadratic in sequence length. DeepSeek-V4 attacks this with two complementary compression strategies: CSA for selective retrieval and HCA for extreme summarization.
Each compressed entry merges $2m$ KV entries with learned weights $S$ (softmaxed). The overlapping indices mean the effective compression is $1/m$ times the original length.
The efficiency breakthrough: compared to a standard BF16 GQA8 baseline, DeepSeek-V4's KV cache at 1M tokens is roughly 2% of the baseline size. Even against the already-efficient DeepSeek-V3.2, V4-Pro uses only 10% of the KV cache. This is what makes million-token context economically viable.
Training a trillion-parameter model efficiently requires rethinking the optimizer itself. DeepSeek-V4 replaces Adam with Muon — a momentum-based optimizer that orthogonalizes the update matrix via Newton-Schulz iterations.
Simulated convergence of Muon vs Adam on a quadratic loss
Practical impact: Muon converges faster than Adam on the same data, meaning DeepSeek-V4 reaches its final performance with fewer training tokens. The hybrid Newton-Schulz schedule (8 fast + 2 precise) ensures updates are well-conditioned throughout training.
Training 1.6T parameters on 33 trillion tokens demands infrastructure breakthroughs — from fused MoE kernels to FP4 quantization-aware training to deterministic reproducibility.
$C$ = peak compute throughput, $B$ = interconnect bandwidth, $d$ = hidden dimension (3072 for Pro). Each GB/s of bandwidth can hide communication for 6.1 TFLOP/s of compute.
Expert parallelism speedup by wave scheduling strategy
Speedup over Naive
1.92x
Memory Savings (FP4)
~50%
Infrastructure highlights: the fused mega-kernel (MegaMoE) achieves 1.50-1.73x speedup for general inference and up to 1.96x for RL rollouts. FP4 quantization-aware training for MoE expert weights is lossless when dequantizing to FP8, because FP8's extra exponent bits absorb the scale variations.
DeepSeek-V4-Pro-Max redefines the open-source frontier: state-of-the-art on knowledge, competitive with closed models on reasoning, and the first open model to match GPT-5.4 on competitive programming.
DeepSeek-V4-Pro-Max
1.6T / 49B activated
DeepSeek-V4-Flash-Max
284B / 13B activated
DeepSeek-V3.2
671B / 37B activated
Standout results: V4-Pro-Max scores 90.2% on Apex Shortlist (vs GPT-5.4 at 78.1%), 93.5% on LiveCodeBench (vs Gemini-3.1-Pro at 91.7%), and a Codeforces rating of 3206. It outperforms Gemini-3.1-Pro on MRCR 1M retrieval (83.5% vs 76.3%).
The whole point of DeepSeek-V4's architecture is making million-token context practical. Here we see the payoff: inference cost that scales sub-linearly with context length, and retrieval accuracy that holds up far better than competitors.
Inference FLOPs and KV cache vs context length
The bottom line: DeepSeek-V4 makes million-token context routinely supportable — not as a stunt, but as a production capability. This unlocks long-horizon agent tasks, massive document analysis, and the next frontier of test-time scaling. The KV cache at 1M tokens is just ~2% of a standard BF16 GQA8 baseline.