An Interactive Reading of

ThreadWeaver:
Adaptive Threading for Efficient
Parallel Reasoning

How to make language models reason in parallel without breaking anything

Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li, Adam Yala, Trevor Darrell, Alane Suhr, Yuandong Tian, Xi Victoria Lin
Meta Superintelligence Labs · UC Berkeley · UCSF · November 2025 · arXiv:2512.07843

The paper, in plain English

When you ask a reasoning model a hard math problem, it thinks out loud — one word at a time, left to right. If the answer takes 20,000 tokens, you wait for all 20,000 before you see a result. You could throw more GPUs at the problem, but each GPU would just be sitting idle, waiting for the one before it to finish. It is like having a kitchen full of chefs but only letting one cook at a time.

ThreadWeaver teaches the model to split its own reasoning into independent sub-tasks that can run simultaneously — the way a head chef might assign the sauce, the vegetables, and the plating to three different cooks who all finish at roughly the same time. The key insight is that many math proofs contain genuinely independent sub-problems — compute the horizontal distance and the vertical distance separately, check two solution methods in parallel — and a model can learn to spot those opportunities on its own. The trick is doing this without any modifications to the underlying inference engine: ThreadWeaver uses a lightweight orchestrator that simply sends parallel API requests to a standard LLM server.

The result: ThreadWeaver matches the accuracy of cutting-edge sequential reasoning models (79.9% on AIME24, 71.9% averaged across six benchmarks) while delivering up to 1.53× speedup in token latency and 1.14× speedup in wall-clock time. It establishes a new Pareto frontier: for the first time, you can have your reasoning quality and get your answer faster.

I

Trie-Based Training

Merge parallel training trajectories into a single sequence using a prefix tree. The model learns fork-join reasoning with zero changes to the inference engine.

II

Two-Stage SFT Pipeline

Start with 959 LLM-rewritten trajectories, then scale to 17k through self-training with reward-based filtering — the model teaches itself better parallel structure.

III

P-GRPO Reward

A reinforcement learning reward that jointly optimizes for correctness and acceleration, with thread-wise advantage broadcast and mean-centered normalization for stability.

Chapter 1

The Latency Wall

Why adding more GPUs to a reasoning model does not make it faster — and what to do about it.

In plain English

Imagine a relay race where every runner must wait for the baton from the previous one. You can hire the fastest sprinters in the world, add as many lanes as you want — the total time never drops below the sum of every individual leg. That is the autoregressive bottleneck in language models: each token depends on every token before it.

Now imagine the coach realizes that legs 3 and 4 of the relay cover completely different terrain — they do not need the baton from each other at all. If both runners start at the same time and meet at the finish, the race is only as long as the slower of the two. ThreadWeaver is that coach: it trains the model to identify independent sub-problems and run them in parallel.

The numbers tell the story. On Minerva Math, the average reasoning trace shrinks from 10,600 tokens to 7,300 — a 1.53× speedup. Drag the sliders below to see how critical-path length changes with problem structure.

Autoregressive language models generate tokens sequentially. For a reasoning model tackling hard math problems, a single chain-of-thought can exceed 20,000 tokens. The token latency — the number of tokens on the critical path — directly determines how long the user waits for an answer.

Token latency is defined as the number of tokens on the longest thread of a parallel inference trajectory. For a sequential model, this equals the total number of generated tokens:

$$\text{Token Latency}_{\text{sequential}} = \sum_{t=1}^{T} \mathbf{1}[\text{token}_t \text{ on critical path}] = T$$

When parallelism is introduced, the token latency becomes the length of the critical path through the fork-join structure:

$$\text{Token Latency}_{\text{parallel}} = \sum_{b \in \text{blocks}} \max_{j \in \text{threads}(b)} |T_j| \; + \sum_{s \in \text{sequential}} |S_s|$$

where $|T_j|$ is the length of thread $j$ in a parallel block $b$, and $|S_s|$ is the length of a sequential segment $s$. The speedup is the ratio:

$$\text{Speedup} = \frac{L_{\text{baseline}}^{\text{longest}}}{L_{\text{ours}}^{\text{longest}}}$$

Drag the sliders to see how critical-path length changes with problem structure. The chart updates live.

Total reasoning tokens: 15000

Parallelizable fraction: 0.40

Number of threads: 3

Sequential Latency

15,000

Parallel Latency

—

Speedup

—

The key insight: speedup depends entirely on problem structure. If 40% of a 15k-token reasoning trace is parallelizable into 3 threads, the critical path drops from 15,000 to roughly 11,000 tokens — a 1.36× speedup. But if nothing is parallelizable, the model simply runs sequentially and you lose nothing.

How does the model know when to split? →

Chapter 2

Fork & Join

The markup and state machine that turn a standard LLM into a parallel reasoner — no engine modifications required.

In plain English

Think of a Git branch. A developer creates a feature branch, works on it independently, and then merges it back into main. Other developers can work on their own branches at the same time. The project advances in parallel, but only the main branch carries the full history.

ThreadWeaver uses the same pattern inside a reasoning trace. The model emits special control tokens — <Parallel>, <Outline>, <Thread> — that an external orchestrator reads. When it sees <Parallel>, it sends each thread as a separate API call to the LLM server. When all threads finish, it concatenates the results and hands control back to the model. No changes to position embeddings, KV caches, or attention masks.

Click on the steps in the diagram below to see how the orchestrator processes a parallel reasoning trace.

The parallel trajectory format extends standard autoregressive generation with lightweight control tokens in a fork-join pattern. A sample trajectory for a distance-formula problem:

<Think>
We will use the distance formula d = sqrt((dx)^2 + (dy)^2).

<Parallel>
<Outlines>
  <Outline>1: Compute the squared horizontal difference (dx)^2.</Outline>
  <Outline>2: Compute the squared vertical difference (dy)^2.</Outline>
</Outlines>

<Thread>1: dx = 2 - (-4) = 6, so (dx)^2 = 36.</Thread>
<Thread>2: dy = -6 - 3 = -9, so (dy)^2 = 81.</Thread>
</Parallel>

Sum the results: 36 + 81 = 117.
Distance d = sqrt(117) = sqrt(9*13) = 3*sqrt(13).
</Think>
    

The inference orchestrator is a minimal state machine with five phases:

Click each phase to see how the orchestrator processes the trajectory. The timeline diagram updates live.

1. Sequential

2. Parse Outlines

3. Parallel

4. Join

5. Continue

API Calls

1

Threads Active

0

Tokens on Critical Path

—

The design constraint that makes ThreadWeaver deployable: every parallel request is a standard text-completion API call. No custom attention masks, no position-embedding hacks, no modified KV caches. The orchestrator is a thin client-side wrapper. If you have vLLM or SGLang running, you already have everything you need.

How do you train on fork-join trajectories? →

Chapter 3

Trie-Based Training

Merging parallel branches into a single training sequence — with the right attention mask to prevent cross-thread leakage.

In plain English

Imagine you are writing a choose-your-own-adventure book. Three paths diverge from a common start, each with its own middle section, before reconverging at the ending. If you print all three paths separately, you waste paper repeating the shared beginning and ending three times. Instead, you could print the shared start once, then all three middles, then the shared ending — with a note telling the reader which parts go together.

That is exactly what the trie-based merging does. It packs all the fork-join trajectory segments into a single training sequence, using an ancestor-only attention mask that prevents each thread from "seeing" its siblings. The model trains on one long sequence but learns to generate each branch as if it were isolated.

Explore the interactive trie below: click nodes to see how the tree is flattened into a training sequence.

Training requires every $\langle \text{context}, \text{completion} \rangle$ pair that the orchestrator will encounter during inference. The trie construction has three steps:

Extract all $\langle \text{context}, \text{completion} \rangle$ units from the trajectory.
Insert them into a token-level prefix tree (trie) whose root is the shared prompt.
Flatten the trie into a single training sequence with an ancestor-only attention mask.

The attention mask enforces a critical invariant: token $i$ may attend to token $j$ if and only if $j$ is an ancestor of $i$ in the trie. This prevents cross-thread leakage while preserving shared prefixes.

Drag the slider to control how many threads the trie merges. Watch how the flat sequence length grows.

Number of threads: 3

Shared prefix length (tokens): 500

Avg thread length (tokens): 800

Naive Seq Length

—

Trie Merged Length

—

Savings

—

The trie co-design guarantees something subtle but critical: disabling parallel blocks recovers ordinary autoregressive behavior. The sequential trajectory is always a valid subsequence of the trie-flattened sequence, so the model degrades gracefully — no parallel infrastructure, no problem.

Where does the training data come from? →

Chapter 4

The SFT Data Pipeline

From 959 cold-start examples to 17,491 self-trained trajectories — a two-stage pipeline that scales parallel reasoning data.

In plain English

Suppose you want to teach a student to solve math problems in parallel — tackling sub-problems simultaneously instead of one after another. You could write parallel solutions by hand, but that is slow and expensive. Instead, you start with the student's existing sequential homework, hire a tutor to annotate 1,000 problems with "these steps can run in parallel," and then have the student practice on 53,000 problems using that new skill. Only the problems they get right — and that have valid parallel structure — make it into the final training set.

That is ThreadWeaver's two-stage pipeline. Stage 1: GPT-5 rewrites 959 Qwen3-8B trajectories with parallel annotations. Stage 2: the model itself generates 53k parallel trajectories and keeps only the 17,491 that are both correct and structurally valid. Self-training bridges the gap between what the tutor thinks parallel reasoning looks like and what the student naturally produces.

Drag the sliders below to see how Stage 1 and Stage 2 filtering affect the final dataset size and downstream accuracy.

Stage 1 applies a five-step annotation pipeline to sequential Qwen3-8B traces:

Identify parallel blocks — the LLM annotates line-numbered spans.
Extract canonical threads — enforce contiguity and ordering constraints.
Rewrite for clarity — remove cross-thread dependencies via targeted edits.
Add outlines — generate path-specific plans for each thread.
Filter — discard structurally invalid or degenerate trajectories.

Stage 2 runs the cold-start model on the full 53k prompt set with parallel inference, keeping only trajectories that are both answer-correct and structurally valid.

Adjust the filtering thresholds to see how dataset size and model accuracy change.

Stage 1 rewrite budget: 1000

Answer-correct rate (Stage 2): 0.65

Structural validity rate: 0.50

Stage 1 Samples

959

Stage 2 Samples

—

Est. AIME24 Accuracy

—

The ablation in the paper shows that dataset-model compatibility matters more than teacher strength. Training Qwen3-8B on Multiverse's DeepSeek-R1-curated data yields only 62.2% on AIME24, versus 74.5% from ThreadWeaver's Qwen3-native data — despite Multiverse using a stronger teacher model. RL cannot fully compensate for a distributional mismatch in SFT data.

How does reinforcement learning teach better parallelism? →

Chapter 5

P-GRPO: The Reward Signal

A parallelization-aware reward that jointly optimizes for correctness and speed — with a crucial normalization fix.

In plain English

Imagine training a delivery driver. You reward them for getting packages to the right address (correctness) and for taking less time (speed). But if you weight the speed reward too heavily, they start skipping stops. And if you normalize the rewards across a batch, a batch where every driver delivers perfectly makes the speed signal explode — because the correctness reward washes out to zero.

ThreadWeaver's P-GRPO solves this with two ideas. First, the acceleration reward is only given when the answer is correct — no credit for fast wrong answers. Second, it removes standard-deviation normalization and uses simple mean-centering instead. This keeps the relative weight between correctness and acceleration stable. The model never learns to trade accuracy for speed.

Drag the reward parameters below and watch how the training dynamics shift between accuracy and acceleration.

The total reward for a trajectory $\tau$ combines correctness and acceleration:

$$r(\tau) = R_{\text{correct}}(\tau) + R_{\text{accel}}(\tau)$$ $$R_{\text{correct}}(\tau) = \mathbf{1}\{\text{Correct}(\tau)\}$$ $$R_{\text{accel}}(s) = \mathbf{1}\{\text{Correct}(\tau)\} \cdot \min(\rho \cdot \eta(s),\; \rho_{\text{clip}})$$

The acceleration ratio $\eta$ measures how much of the total work was parallelized:

$$\eta(s) = 1 - \frac{L_{\text{longest}}}{L_{\text{total}}}$$

where $L_{\text{longest}}$ is the length of the longest thread and $L_{\text{total}}$ is the total tokens. In the paper, $\rho = 0.5$ and $\rho_{\text{clip}} = 0.2$.

The P-GRPO advantage uses mean-centering only (no standard-deviation normalization):

$$A^{\text{P-GRPO}}_{p,i} = r_{p,i} - \mu_p$$

This is broadcast to all tokens in the trajectory, including all threads. The loss is:

$$\mathcal{L}^{\text{P-GRPO}}(\theta) = -\frac{1}{\sum_{p \in B} \sum_{i=1}^{k} T_{p,i}} \sum_{p \in B} \sum_{i=1}^{k} A^{\text{P-GRPO}}_{p,i} \sum_{m=1}^{M_i} \sum_{t} \log \pi_\theta(\text{comp}^{(i,m)}_t \mid \text{cont}^{(i,m)}_t)$$

Adjust the reward parameters and see how the accuracy-acceleration trade-off shifts across training steps.

Acceleration scale (rho): 0.50

Acceleration clip: 0.20

Use std normalization:

Final Accuracy

—

Final Speedup

—

Reward Stability

—

The paper's ablation is stark: with standard-deviation normalization, AIME24 accuracy drops to 74.8% and trajectories bloat to 30.1k tokens. With mean-centering only, accuracy rises to 79.9% and trajectories shrink to 21.1k. The normalization choice is not cosmetic — it is the difference between a model that hacks the reward and one that actually learns to reason in parallel.

How much faster is ThreadWeaver in practice? →

Chapter 6

Speedup Across Benchmarks

The per-problem speedup distributions reveal a clear pattern: acceleration is fundamentally question-dependent.

In plain English

Not every math problem benefits from parallelism. "What is 2 + 2?" has no parallel structure — it is one step. But "compute the distance between two points using the formula" has two independent sub-problems (horizontal and vertical components) that can run simultaneously. ThreadWeaver automatically detects which problems have exploitable structure.

The histograms below show the speedup distribution for each benchmark. On MATH500, some problems get a 3× speedup — the model finds rich parallel structure and exploits it aggressively. On AIME25, the average speedup is only 1.03× — the problems are so hard that the model spends more time on self-reflection, which is harder to parallelize. The model is honest: it only parallelizes when it helps.

Click between benchmarks in the interactive chart below to explore the speedup distributions.

The main results across six benchmarks:

Benchmark	Seq. Accuracy	TW Accuracy	Seq. Latency	TW Latency	Speedup
AIME24	78.3%	79.9%	19.4k	16.9k	1.14×
AIME25	61.6%	60.5%	24.6k	24.0k	1.03×
AMC23	92.6%	92.3%	13.8k	12.0k	1.16×
MATH500	91.8%	91.4%	7.2k	6.4k	1.23×
Minerva Math	43.9%	43.7%	10.6k	7.3k	1.53×
OlympiadBench	65.0%	63.5%	15.2k	12.8k	1.21×
Average	72.2%	71.9%	15.1k	13.2k	1.22×

Adjust the number of threads and parallelizable fraction to simulate speedup distributions for different problem types.

Avg threads per problem: 2.5

Avg parallelizable fraction: 0.35

Variance (problem diversity): 0.15

The wall-clock measurement confirms it: on 50 MATH500 problems with 4 GPUs, ThreadWeaver reduces latency from 162s to 142s — a 1.14× wall-clock speedup. The gap between token-latency speedup (1.23×) and wall-clock speedup reflects real scheduling overhead, but the gains are genuine and measurable.

What does the Pareto frontier look like? →

Chapter 7

The Pareto Frontier

ThreadWeaver establishes a new speed-accuracy Pareto frontier, dominating prior adaptive parallel reasoning methods.

In plain English

Think of the Pareto frontier like fuel efficiency in cars. A Prius gets great mileage but is slow. A Ferrari is fast but guzzles gas. A car that is both faster than the Prius and more efficient than the Ferrari would be on a strictly better frontier — it dominates both.

ThreadWeaver achieves exactly this for reasoning models. It is roughly as accurate as the sequential baseline (79.9% vs 78.3% on AIME24) while being 1.14× faster. Compared to the best prior parallel method (Multiverse, 53.8% accuracy, 1.18× self-parallelism), ThreadWeaver is dramatically more accurate at roughly equivalent speedup. It does not trade accuracy for speed — it gets both.

Use the interactive scatter plot below to compare methods across the accuracy-speedup plane.

Comparison of adaptive parallel reasoning methods on AIME24:

Model	Size	Self-Parallelism	Activation Ratio	AIME24 Accuracy
Multiverse-zero	32B	1.04×	—	52.1%
Multiverse	32B	1.18×	—	53.8%
Parallel-R1-Seen	4B	—	27.3%	19.4%
ThreadWeaver	8B	1.25×	85.2%	79.9%

Hover over points in the Pareto frontier chart to compare methods. The chart shows accuracy vs. speedup for all known adaptive parallel reasoning methods.

ThreadWeaver built on an 8B model outperforms Multiverse built on a 32B model — both in accuracy and in measured self-parallelism speedup. Training recipe matters more than model size. The SFT pipeline, self-training, and P-GRPO together extract far more parallel capability from a smaller model than prior methods extract from a much larger one.

What does this mean for the future of reasoning? →

Chapter 8

What It Means

Three takeaways for anyone building, deploying, or relying on reasoning models.

In plain English

ThreadWeaver proves something many people assumed was impossible: you can have parallel reasoning without sacrificing accuracy and without rewriting your inference stack. The model learns to find the natural seams in a problem — the places where two sub-tasks are genuinely independent — and exploits them.

The practical implication is straightforward. If you are running a reasoning model at scale, the hardest queries — the ones that take 20 seconds and cost the most — are exactly the ones where ThreadWeaver helps the most. More compute buys you less waiting, not just more thinking.

The open question is how far this can go. The current system limits itself to a single level of parallelism. Nested threads — threads within threads — could unlock much larger speedups on problems with hierarchical structure. And if the model can learn to reason about the hardware it runs on — how many GPUs, how fast the network — it could adaptively choose how many threads to spawn. ThreadWeaver is a proof that the approach works. The ceiling is still far above.

Three Takeaways

Parallelism is learnable. The model does not need hand-crafted heuristics or external task decomposition. P-GRPO teaches it to find exploitable structure on its own — and to fall back to sequential reasoning when none exists.
Engine compatibility matters. The trie-based training co-design is what makes deployment possible. Any team running vLLM or SGLang can adopt ThreadWeaver with a client-side wrapper. No infrastructure changes.
SFT quality is the bottleneck. The ablation is unambiguous: RL on top of poor SFT data (62.2% on AIME24) barely closes the gap to good SFT data (74.5%). Invest in your cold-start dataset before you invest in RL.

Future Directions

Nested parallelism — threads within threads for hierarchical problem structure.
Hardware-aware spawning — models that reason about GPU count and network topology to choose thread count.
Beyond math — parallelizing agent interactions in software engineering, scientific research, and multi-step tool use.

ThreadWeaver establishes that the trade-off between reasoning quality and inference speed is not fundamental. With the right training pipeline — trie-based merging, self-training, and P-GRPO — an 8B model can match a sequential baseline's accuracy while being 1.14× faster in wall-clock time and up to 1.53× faster in token latency. The Pareto frontier has moved. Everything built on reasoning models should reconsider its latency assumptions.

ThreadWeaver:Adaptive Threading for EfficientParallel Reasoning

The Latency Wall

Fork & Join

Trie-Based Training

The SFT Data Pipeline

P-GRPO: The Reward Signal

Speedup Across Benchmarks

The Pareto Frontier

What It Means

Three Takeaways

Future Directions

ThreadWeaver:
Adaptive Threading for Efficient
Parallel Reasoning