An Interactive Reading of

Dynamic Chunking for
Diffusion Language Models

Yichen Zhu, Xiaoming Shi, Peng Zhao, Weiyu Chen, Debing Zhang, James Kwok
HKUST · Xiaohongshu · Alibaba · CityUHK · May 2026 · arXiv:2605.15676

The paper, in plain English

Modern language models that generate text via diffusion — iteratively denoising a sequence of masked tokens — have a structural problem: they split sentences into fixed-size blocks of 8 tokens, regardless of what those tokens actually say. A verb and its direct object can land in different blocks. A pronoun and its antecedent get separated. The model is forced to denoise semantically unrelated neighbours together while breaking apart the tokens that actually depend on each other.

DCDM fixes this by letting the model learn its own block boundaries. Instead of position-based splitting, a differentiable "chunking attention" layer projects every token onto K learned subspaces — think of each subspace as a theme — and groups tokens by which theme fits best. The result is a partition where tokens that belong together get denoised together, even if they are far apart in the sequence. This is strictly more general than fixed blocks: if the model learns subspaces that happen to align with contiguous positions, you recover the old block design exactly.

At both 0.5B and 1.5B parameters, DCDM beats every diffusion baseline on 9 downstream benchmarks — general reasoning, math, and code. It reaches BDLM's final accuracy faster, and the gap holds across every scale tested. The key number: +9.6 points on HellaSwag over unstructured diffusion at the 0.5B scale, and the advantage is visible early in training and never closes.

I

Chunking Attention

A differentiable routing layer that assigns tokens to K learned subspaces, producing semantic chunks instead of fixed positional blocks. The bilinear affinity gate activates only when both tokens align with the same subspace.

II

End-to-End Training

The chunking geometry is shaped by the diffusion objective itself, not a detached auxiliary module. Gradients flow through the soft aggregation path back to the cluster centroids.

III

Load Balancing

A two-timescale defense against cluster collapse: a per-sequence Gumbel-softmax auxiliary loss prevents starvation within each example, while a running bias correction stabilises load across batches.

Chapter 1

The Block Structure Bottleneck

Block diffusion language models split every sentence into fixed-size chunks of B tokens. The splits have nothing to do with what the words mean.

In plain English

Imagine typesetting a novel by feeding exactly 40 characters at a time to the printer, line breaks be damned. A chapter heading might get severed mid-word, while the climax of a sentence bleeds into the next page. Block diffusion does exactly this to language: it partitions a sequence into contiguous blocks of 8 tokens by position, not meaning.

The consequence is that semantically related tokens — a subject and its verb, a function name and its arguments — routinely end up in different blocks and must be denoised independently. Meanwhile, unrelated neighbours that happen to sit next to each other are forced through the same denoising pass. The model inherits the block-autoregressive factorization but applies it at a granularity not matched to the semantic structure of the sequence.

Drag the block-size slider below and watch how the same sentence gets carved up. Small blocks fragment every phrase; large ones lump everything together. Neither extreme captures the sentence's natural structure.

Block Diffusion Language Models (BDLMs) combine autoregressive and diffusion modelling by partitioning a sequence of $L$ tokens into $K$ contiguous blocks of fixed length $B$ (with $L = K \cdot B$). The likelihood factorizes as:

$$\log p_\theta(\mathbf{x}) = \sum_{b=1}^{K} \log p_\theta(\mathbf{x}_b \mid \mathbf{x}_{

By tuning $B$, BDLMs interpolate between pure diffusion ($B = L$, $K = 1$) and autoregressive models ($B = 1$, $K = L$). But the partition is always positional — independent of content.

Positional vs. Semantic Chunking

Drag the slider to change block size. Watch how positional splits ignore sentence structure.

Block size (B) 4

Positional split:

Semantic split (what DCDM aims for):

Blocks (K)

6

Semantic splits broken

0

Why this matters

The rigid positional partition wastes structure already present in the sequence. An entity may govern distant mentions, a mathematical derivation may depend on earlier premises, and a code token may be constrained by scope or syntax several lines away. Positional partitioning can separate tokens that should be denoised jointly, while placing weakly related neighbouring tokens in the same diffusion process.

Next: how subspace clustering fixes this →

Chapter 2

Clustering in Subspace

A single point centroid in high-dimensional space is a fragile way to group tokens. DCDM promotes each cluster to a low-dimensional subspace instead.

In plain English

Think of a restaurant that groups diners by their distance to a single landmark — "closer to the fountain" versus "closer to the bar." If two foodies sit near the bar but love different cuisines, that single reference point can't distinguish them. Now imagine each table defines its own menu axis: one table captures "spicy preference," another "wine preference." A diner is seated at the table whose menu best matches their taste profile — even if they sit far from that table physically.

DCDM does exactly this. Each cluster is defined not by a single point in the token embedding space, but by a subspace — a set of learned directions. A token is assigned to whichever subspace it aligns with best. This is far more stable in the high-dimensional spaces (1024 or 2048 dimensions) used by modern language models.

Drag the subspace dimension slider in the simulation below and watch how cluster quality changes. At $h = 1$ (point centroid), clusters collapse. At higher dimensions, they spread out evenly.

Each cluster is parameterized by a learnable matrix $\mu_k \in \mathbb{R}^{d \times h}$, where $d$ is the model dimension and $h$ is the subspace dimension. A token $\mathbf{x}_\ell$ is projected onto the $k$-th subspace:

$$\mathbf{p}_{k,\ell} = \mu_k^\top \mathbf{x}_\ell \in \mathbb{R}^h$$

The column span $\mathcal{S}_k := \mathrm{col}(\mu_k) \subset \mathbb{R}^d$ is an $h$-dimensional subspace. Tokens are assigned to clusters by their alignment $\|\mathbf{p}_{k,\ell}\|$ with each subspace:

$$c_\ell = \arg\max_k \|\mu_k^\top \mathbf{x}_\ell\|$$

Point vs. Subspace Clustering

Drag subspace dimension $h$ to see how cluster quality changes. Low $h$ = point centroids (fragile). High $h$ = subspace centroids (stable).

Subspace dim (h) 1

Number of clusters (K) 4

Cluster violation

--

Min cluster share

--

Max cluster share

--

Why this matters

The subspace parameterization is not a matter of degree — a slightly better variant settling at a marginally better minimum. It is qualitative: point-based clustering actively degenerates even with load-balancing, while subspace clustering keeps routing well balanced throughout training. The paper reports diffusion loss of 2.304 (subspace) vs. 2.544 (point) at 200k steps — a gap that opens within the first 25k steps.

Next: the full chunking attention mechanism →

Chapter 3

Chunking Attention

The soft path trains the cluster geometry; the hard path builds the attention mask. Two paths, one set of parameters.

In plain English

Think of Chunking Attention as a two-lane highway. Lane one (the soft path) is paved and carries gradients smoothly back to the cluster centroids — it uses real-valued attention weights to mix tokens across clusters, so the model can learn which clusters make sense. Lane two (the hard path) is a dirt road used only for building the attention mask — it snaps each token to a single cluster and builds a hard partition.

The trick is that both lanes read from the same gauges. The soft path computes how aligned each token is with each subspace, and those same alignment scores become the hard cluster assignments. So the mask the denoiser uses is shaped by the same geometry the diffusion objective is optimizing.

In the simulation below, watch the affinity matrix and resulting attention mask as you change the number of clusters. The block-diagonal structure is what enables parallel denoising within each chunk.

For each cluster $k$, the module computes a pairwise affinity matrix $\mathbf{A}_k$ whose entries are inner products of the projected tokens:

$$[\mathbf{A}_k]_{\ell,m} = \frac{1}{\sqrt{h}} \mathbf{p}_{k,\ell}^\top \mathbf{p}_{k,m} = \frac{1}{\sqrt{h}} \mathbf{x}_\ell^\top \mu_k \mu_k^\top \mathbf{x}_m$$

The factor $1/\sqrt{h}$ is standard dot-product scaling. Sharing $\mu_k$ across query and key sides turns the affinity into a bilinear gate: $[\mathbf{A}_k]_{\ell,m}$ activates only when $\mathbf{x}_\ell$ and $\mathbf{x}_m$ are simultaneously aligned with $\mathcal{S}_k$. The soft output aggregates all $K$ clusters:

$$\mathbf{Y} = \mathbf{W}_O \frac{1}{\sqrt{K}} \sum_{k=1}^{K} \mathbf{T}_k \mathbf{W}_V \mathbf{H}$$

where $\mathbf{T}_k = \mathrm{softmax}(\mathbf{A}_k)$ and $\mathbf{W}_V, \mathbf{W}_O$ are learnable projections. The chunk-causal mask for downstream layers is built from the hard assignments:

$$M_{\ell,m}^{\text{chunk}} = \mathbb{I}[c_m \leq c_\ell]$$

Affinity and Mask Visualization

Adjust K to see how the affinity matrix and chunk-causal mask change.

Clusters (K) 4

Why this matters

Gradients to $\{\mu_k\}$ flow exclusively through the soft path. The hard routing reads off the same alignment scores but carries no gradient itself. This design ensures the cluster structure is shaped by the diffusion objective end-to-end — not by a detached heuristic.

Next: the full DCDM architecture →

Chapter 4

The DCDM Architecture

A chunking stage emits per-token cluster IDs; a denoising stage uses them to factorize the likelihood autoregressively over semantic chunks.

In plain English

Think of DCDM as a two-stage editorial process for restoring a redacted document. Stage one (chunking): a fast reader scans the partially visible text and groups related passages together — "these fragments are about the same topic, these belong to the same code block." Stage two (denoising): for each group, the editor fills in the blanks using everything already restored from earlier groups, but never peeking at later groups.

The clever part is the dual-stream design during training. One copy of the text is corrupted (the noisy stream); the other stays clean (the teacher-forcing stream). A noise mask ensures the clean copy never leaks answers to the noisy copy through the chunking attention. This prevents the model from cheating during training.

The simulation below shows how the chunk-causal mask controls information flow. Tokens in the same chunk see each other bidirectionally; earlier chunks condition later ones.

The cluster identifiers define semantic chunks:

$$\mathcal{B}_k = \{\ell \in \{1, \ldots, L\} : c_\ell = k\}, \quad k = 1, \ldots, K$$

The likelihood factorizes autoregressively over these chunks:

$$p_\theta(\mathbf{x}) = \prod_{k=1}^{K} p_\theta\!\left(\mathbf{x}^{(k)} \mid \mathbf{x}^{(

This mirrors BDLM's Eq. 2 but with content-defined chunks. The noise mask prevents information leakage during training:

$$M_{\ell,m}^{\text{noise}} = \mathbb{I}[\nu_m = 0] \lor \mathbb{I}[\nu_\ell = 1 \land \ell = m]$$

Chunk-Causal Information Flow

Adjust K to see how information flows between chunks. White cells = attention allowed. Dark cells = blocked.

Clusters (K) 4

Why this matters

DCDM preserves the full computational structure of block diffusion — KV caching across chunks, parallel sampling within chunks, flexible-length generation — while making the unit of parallel denoising content-adaptive. The chunks may be non-contiguous, variable in size, and sequence-dependent. Positional block diffusion is recovered as a special case when learned chunks happen to coincide with fixed contiguous blocks.

Next: keeping clusters balanced →

Chapter 5

Keeping Clusters Alive

Without intervention, a few clusters hog all the tokens and the rest starve. Two mechanisms at two timescales prevent this collapse.

In plain English

Imagine assigning 100 students to 8 study groups by letting each student pick freely. A popular group swells to 60 students, leaving others with 3 or 4. Those tiny groups get a terrible experience and their tutors (gradients) give up on them. Over time, the imbalance worsens — centroid collapse.

DCDM fights this two ways. First, a per-sequence auxiliary loss acts like a minimum-quota rule: on every single document, each group must get at least some students. Second, a global bias correction tracks usage across the entire batch and nudges underused groups upward — like a registrar adding bonus points to unpopular courses so enrollment stays balanced.

Toggle load balancing on and off in the simulation below. Watch how cluster sizes drift into collapse without it, and stay uniform with it enabled.

The per-sequence load balancing loss uses a Gumbel-softmax straight-through estimator to produce a differentiable hard sample:

$$\mathcal{L}_{\text{chunk}} = -\frac{1}{BK} \sum_{b=1}^{B} \sum_{k=1}^{K} \log(f_{b,k} + \varepsilon)$$

This attains its minimum when every sequence distributes tokens uniformly across $K$ clusters. The global-batch bias correction adds a non-trainable bias to the hard-assignment step:

$$b_k \leftarrow b_k - \eta_b \left(\frac{N_k}{N} - \frac{1}{K}\right)$$

Routing Collapse Simulation

Toggle load balancing and watch cluster usage over simulated training steps.

Simulated training steps 200

Clusters (K)

8

Max cluster share

--

Min cluster share

--

Cluster violation

--

Why this matters

Without load balancing, the point-clustering ablation never manages to push violation below 2.0 during the entire run, and it rises past 125k steps to 3.33 — the signature of progressive centroid collapse. With subspace clustering plus load balancing, violation drops to near zero within 10k steps and stays flat. The two mechanisms operate at different timescales: the auxiliary loss acts per forward pass, the bias correction acts per update interval.

Next: how many chunks should we use? →

Chapter 6

How Many Chunks?

Too few clusters under-partition the sequence; too many over-fragment it. The sweet spot is $K = 16$.

In plain English

Think of cutting a cake. Two slices means each person gets an enormous, heterogeneous chunk with frosting, filling, and sponge all mixed together. Thirty-two slices means each person gets a crumb. Somewhere in between, each slice is a coherent piece with a fair share of every layer.

DCDM faces the same tradeoff with its number of clusters $K$. Too few clusters (say $K = 8$) and the bilinear gate can't discriminate well between groups — the model approaches the unstructured MDLM limit. Too many clusters ($K = 64$) and each cluster gets so few tokens that bidirectional denoising within a chunk becomes meaningless, while load balancing strains at its stability margin.

The paper's ablation finds the sweet spot at $K = 16$ for the 0.5B model. Hover over the bars below to see how each benchmark responds.

Cluster Count Ablation (Table 3)

Data from the paper's 0.5B-scale ablation. Hover over bars for exact values.

Best K

16

Best average

40.58

Worst K

8

Why this matters

The trend is non-monotonic with a clear interior optimum — average performance peaks at $K = 16$ and degrades at both extremes. This U-shape reflects the two failure modes that bracket the choice of $K$. All configurations tested still improve over the positional-block baseline, confirming that any content-adaptive partitioning beats fixed blocks.

Next: full benchmark results →

Chapter 7

Results at Scale

DCDM beats every diffusion baseline at both 0.5B and 1.5B parameters, on 9 downstream benchmarks spanning reasoning, math, and code.

Benchmark Comparison (Table 1)

Toggle scale and hover over bars. All data from the paper.

DCDM average

33.68

BDLM average

32.92

DCDM advantage

+0.76

Training Efficiency (Figure 3b)

DCDM reaches BDLM's final accuracy noticeably earlier. Reconstructed from paper data.

Why this matters

The ordering DCDM > BDLM > MDLM emerges early in training and remains stable. DCDM matches MDLM's final accuracy well before MDLM finishes, and reaches BDLM's final accuracy noticeably earlier — genuinely faster optimization, not convergence to the same point along a different path. Routing tokens into content-defined blocks supplies denoising targets that are coherent at the level of the partition, so gradient signal concentrates on within-block dependencies rather than being diluted across unrelated positions.

Next: what this means for the field →

Chapter 8

What This Means

DCDM generalises block diffusion from a positional hack to a learned, content-adaptive partition — and the numbers confirm it works.

In plain English

The big picture: diffusion language models were stuck with a design choice that everyone treated as a neutral engineering detail — split the sequence into blocks of 8. DCDM shows that this choice was not neutral at all. When you let the model learn where the block boundaries should go, based on what the tokens actually mean, you get consistent gains on every task that matters.

The mechanism is elegant: subspace clustering is more stable than point clustering in high dimensions, the soft-hard dual path trains the geometry end-to-end, and load balancing prevents collapse. None of these ingredients is exotic — but putting them together in exactly this way, inside a diffusion objective, is the contribution.

The open question is whether $K$ — the number of chunks — can itself be learned per sequence rather than fixed as an architecture hyperparameter. That would push DCDM even closer to the ideal of fully adaptive granularity.

Key takeaways

Fixed blocks are a limiting abstraction. Positional partitions separate semantically coherent tokens and group unrelated ones. Content-defined chunks avoid this mismatch.
Subspace clustering beats point clustering qualitatively. Not a matter of degree — the point parameterization actively degenerates even with load balancing (violation rises to 3.33 at 200k steps).
The advantage holds across scales. DCDM > BDLM > MDLM at 0.1B, 0.5B, and 1.5B. The gap between structured and unstructured diffusion widens with scale.
DCDM trains faster. Not just to a better final accuracy — it reaches BDLM's ceiling earlier, indicating genuinely faster optimization.
MoE composes cleanly. Adding sparse conditional computation on top of semantic chunking yields a further +0.5 points on average, with no interference.

Limitation

DCDM treats $K$ as a fixed hyperparameter shared across all sequences and all training stages. Sequences with little structural variety may be over-partitioned, while topically rich ones may be under-partitioned. Learning $K$ per sequence — or maintaining a distribution over $K$ that can be marginalised at inference — is the natural next step.

The bottom line

Block diffusion works because it groups related tokens together for denoising. But it guesses the groups by position. DCDM learns the groups by content, and the results speak for themselves: consistent gains at every scale, on every task category, with no exotic ingredients.

Dynamic Chunking forDiffusion Language Models

The Block Structure Bottleneck

Positional vs. Semantic Chunking

Clustering in Subspace

Point vs. Subspace Clustering

Chunking Attention

Affinity and Mask Visualization

The DCDM Architecture

Chunk-Causal Information Flow

Keeping Clusters Alive

Routing Collapse Simulation

How Many Chunks?

Cluster Count Ablation (Table 3)

Results at Scale

Benchmark Comparison (Table 1)

Training Efficiency (Figure 3b)

What This Means

Key takeaways

Dynamic Chunking for
Diffusion Language Models