An Interactive Reading of

Dynamic Chunking for
Diffusion Language Models

The paper, in plain English

Modern language models that generate text via diffusion — iteratively denoising a sequence of masked tokens — have a structural problem: they split sentences into fixed-size blocks of 8 tokens, regardless of what those tokens actually say. A verb and its direct object can land in different blocks. A pronoun and its antecedent get separated. The model is forced to denoise semantically unrelated neighbours together while breaking apart the tokens that actually depend on each other.

DCDM fixes this by letting the model learn its own block boundaries. Instead of position-based splitting, a differentiable "chunking attention" layer projects every token onto K learned subspaces — think of each subspace as a theme — and groups tokens by which theme fits best. The result is a partition where tokens that belong together get denoised together, even if they are far apart in the sequence. This is strictly more general than fixed blocks: if the model learns subspaces that happen to align with contiguous positions, you recover the old block design exactly.

At both 0.5B and 1.5B parameters, DCDM beats every diffusion baseline on 9 downstream benchmarks — general reasoning, math, and code. It reaches BDLM's final accuracy faster, and the gap holds across every scale tested. The key number: +9.6 points on HellaSwag over unstructured diffusion at the 0.5B scale, and the advantage is visible early in training and never closes.

I
Chunking Attention
A differentiable routing layer that assigns tokens to K learned subspaces, producing semantic chunks instead of fixed positional blocks. The bilinear affinity gate activates only when both tokens align with the same subspace.
II
End-to-End Training
The chunking geometry is shaped by the diffusion objective itself, not a detached auxiliary module. Gradients flow through the soft aggregation path back to the cluster centroids.
III
Load Balancing
A two-timescale defense against cluster collapse: a per-sequence Gumbel-softmax auxiliary loss prevents starvation within each example, while a running bias correction stabilises load across batches.
Chapter 1

The Block Structure Bottleneck

Block diffusion language models split every sentence into fixed-size chunks of B tokens. The splits have nothing to do with what the words mean.

Block Diffusion Language Models (BDLMs) combine autoregressive and diffusion modelling by partitioning a sequence of $L$ tokens into $K$ contiguous blocks of fixed length $B$ (with $L = K \cdot B$). The likelihood factorizes as:

$$\log p_\theta(\mathbf{x}) = \sum_{b=1}^{K} \log p_\theta(\mathbf{x}_b \mid \mathbf{x}_{

By tuning $B$, BDLMs interpolate between pure diffusion ($B = L$, $K = 1$) and autoregressive models ($B = 1$, $K = L$). But the partition is always positional — independent of content.

Positional vs. Semantic Chunking

Drag the slider to change block size. Watch how positional splits ignore sentence structure.

Block size (B) 4
Positional split:
Semantic split (what DCDM aims for):
Blocks (K)
6
Semantic splits broken
0
Why this matters
The rigid positional partition wastes structure already present in the sequence. An entity may govern distant mentions, a mathematical derivation may depend on earlier premises, and a code token may be constrained by scope or syntax several lines away. Positional partitioning can separate tokens that should be denoised jointly, while placing weakly related neighbouring tokens in the same diffusion process.
Next: how subspace clustering fixes this
Chapter 2

Clustering in Subspace

A single point centroid in high-dimensional space is a fragile way to group tokens. DCDM promotes each cluster to a low-dimensional subspace instead.

Each cluster is parameterized by a learnable matrix $\mu_k \in \mathbb{R}^{d \times h}$, where $d$ is the model dimension and $h$ is the subspace dimension. A token $\mathbf{x}_\ell$ is projected onto the $k$-th subspace:

$$\mathbf{p}_{k,\ell} = \mu_k^\top \mathbf{x}_\ell \in \mathbb{R}^h$$

The column span $\mathcal{S}_k := \mathrm{col}(\mu_k) \subset \mathbb{R}^d$ is an $h$-dimensional subspace. Tokens are assigned to clusters by their alignment $\|\mathbf{p}_{k,\ell}\|$ with each subspace:

$$c_\ell = \arg\max_k \|\mu_k^\top \mathbf{x}_\ell\|$$

Point vs. Subspace Clustering

Drag subspace dimension $h$ to see how cluster quality changes. Low $h$ = point centroids (fragile). High $h$ = subspace centroids (stable).

Subspace dim (h) 1
Number of clusters (K) 4
Cluster violation
--
Min cluster share
--
Max cluster share
--
Why this matters
The subspace parameterization is not a matter of degree — a slightly better variant settling at a marginally better minimum. It is qualitative: point-based clustering actively degenerates even with load-balancing, while subspace clustering keeps routing well balanced throughout training. The paper reports diffusion loss of 2.304 (subspace) vs. 2.544 (point) at 200k steps — a gap that opens within the first 25k steps.
Next: the full chunking attention mechanism
Chapter 3

Chunking Attention

The soft path trains the cluster geometry; the hard path builds the attention mask. Two paths, one set of parameters.

For each cluster $k$, the module computes a pairwise affinity matrix $\mathbf{A}_k$ whose entries are inner products of the projected tokens:

$$[\mathbf{A}_k]_{\ell,m} = \frac{1}{\sqrt{h}} \mathbf{p}_{k,\ell}^\top \mathbf{p}_{k,m} = \frac{1}{\sqrt{h}} \mathbf{x}_\ell^\top \mu_k \mu_k^\top \mathbf{x}_m$$

The factor $1/\sqrt{h}$ is standard dot-product scaling. Sharing $\mu_k$ across query and key sides turns the affinity into a bilinear gate: $[\mathbf{A}_k]_{\ell,m}$ activates only when $\mathbf{x}_\ell$ and $\mathbf{x}_m$ are simultaneously aligned with $\mathcal{S}_k$. The soft output aggregates all $K$ clusters:

$$\mathbf{Y} = \mathbf{W}_O \frac{1}{\sqrt{K}} \sum_{k=1}^{K} \mathbf{T}_k \mathbf{W}_V \mathbf{H}$$

where $\mathbf{T}_k = \mathrm{softmax}(\mathbf{A}_k)$ and $\mathbf{W}_V, \mathbf{W}_O$ are learnable projections. The chunk-causal mask for downstream layers is built from the hard assignments:

$$M_{\ell,m}^{\text{chunk}} = \mathbb{I}[c_m \leq c_\ell]$$

Affinity and Mask Visualization

Adjust K to see how the affinity matrix and chunk-causal mask change.

Clusters (K) 4
Why this matters
Gradients to $\{\mu_k\}$ flow exclusively through the soft path. The hard routing reads off the same alignment scores but carries no gradient itself. This design ensures the cluster structure is shaped by the diffusion objective end-to-end — not by a detached heuristic.
Next: the full DCDM architecture
Chapter 4

The DCDM Architecture

A chunking stage emits per-token cluster IDs; a denoising stage uses them to factorize the likelihood autoregressively over semantic chunks.

The cluster identifiers define semantic chunks:

$$\mathcal{B}_k = \{\ell \in \{1, \ldots, L\} : c_\ell = k\}, \quad k = 1, \ldots, K$$

The likelihood factorizes autoregressively over these chunks:

$$p_\theta(\mathbf{x}) = \prod_{k=1}^{K} p_\theta\!\left(\mathbf{x}^{(k)} \mid \mathbf{x}^{(

This mirrors BDLM's Eq. 2 but with content-defined chunks. The noise mask prevents information leakage during training:

$$M_{\ell,m}^{\text{noise}} = \mathbb{I}[\nu_m = 0] \lor \mathbb{I}[\nu_\ell = 1 \land \ell = m]$$

Chunk-Causal Information Flow

Adjust K to see how information flows between chunks. White cells = attention allowed. Dark cells = blocked.

Clusters (K) 4
Why this matters
DCDM preserves the full computational structure of block diffusion — KV caching across chunks, parallel sampling within chunks, flexible-length generation — while making the unit of parallel denoising content-adaptive. The chunks may be non-contiguous, variable in size, and sequence-dependent. Positional block diffusion is recovered as a special case when learned chunks happen to coincide with fixed contiguous blocks.
Next: keeping clusters balanced
Chapter 5

Keeping Clusters Alive

Without intervention, a few clusters hog all the tokens and the rest starve. Two mechanisms at two timescales prevent this collapse.

The per-sequence load balancing loss uses a Gumbel-softmax straight-through estimator to produce a differentiable hard sample:

$$\mathcal{L}_{\text{chunk}} = -\frac{1}{BK} \sum_{b=1}^{B} \sum_{k=1}^{K} \log(f_{b,k} + \varepsilon)$$

This attains its minimum when every sequence distributes tokens uniformly across $K$ clusters. The global-batch bias correction adds a non-trainable bias to the hard-assignment step:

$$b_k \leftarrow b_k - \eta_b \left(\frac{N_k}{N} - \frac{1}{K}\right)$$

Routing Collapse Simulation

Toggle load balancing and watch cluster usage over simulated training steps.

Simulated training steps 200
Clusters (K)
8
Max cluster share
--
Min cluster share
--
Cluster violation
--
Why this matters
Without load balancing, the point-clustering ablation never manages to push violation below 2.0 during the entire run, and it rises past 125k steps to 3.33 — the signature of progressive centroid collapse. With subspace clustering plus load balancing, violation drops to near zero within 10k steps and stays flat. The two mechanisms operate at different timescales: the auxiliary loss acts per forward pass, the bias correction acts per update interval.
Next: how many chunks should we use?
Chapter 6

How Many Chunks?

Too few clusters under-partition the sequence; too many over-fragment it. The sweet spot is $K = 16$.

Cluster Count Ablation (Table 3)

Data from the paper's 0.5B-scale ablation. Hover over bars for exact values.

Best K
16
Best average
40.58
Worst K
8
Why this matters
The trend is non-monotonic with a clear interior optimum — average performance peaks at $K = 16$ and degrades at both extremes. This U-shape reflects the two failure modes that bracket the choice of $K$. All configurations tested still improve over the positional-block baseline, confirming that any content-adaptive partitioning beats fixed blocks.
Next: full benchmark results
Chapter 7

Results at Scale

DCDM beats every diffusion baseline at both 0.5B and 1.5B parameters, on 9 downstream benchmarks spanning reasoning, math, and code.

Benchmark Comparison (Table 1)

Toggle scale and hover over bars. All data from the paper.

DCDM average
33.68
BDLM average
32.92
DCDM advantage
+0.76

Training Efficiency (Figure 3b)

DCDM reaches BDLM's final accuracy noticeably earlier. Reconstructed from paper data.

Why this matters
The ordering DCDM > BDLM > MDLM emerges early in training and remains stable. DCDM matches MDLM's final accuracy well before MDLM finishes, and reaches BDLM's final accuracy noticeably earlier — genuinely faster optimization, not convergence to the same point along a different path. Routing tokens into content-defined blocks supplies denoising targets that are coherent at the level of the partition, so gradient signal concentrates on within-block dependencies rather than being diluted across unrelated positions.
Next: what this means for the field
Chapter 8

What This Means

DCDM generalises block diffusion from a positional hack to a learned, content-adaptive partition — and the numbers confirm it works.

Key takeaways

Limitation
DCDM treats $K$ as a fixed hyperparameter shared across all sequences and all training stages. Sequences with little structural variety may be over-partitioned, while topically rich ones may be under-partitioned. Learning $K$ per sequence — or maintaining a distribution over $K$ that can be marginalised at inference — is the natural next step.
The bottom line
Block diffusion works because it groups related tokens together for denoising. But it guesses the groups by position. DCDM learns the groups by content, and the results speak for themselves: consistent gains at every scale, on every task category, with no exotic ingredients.