ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning
How to make language models reason in parallel without breaking anything
Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li, Adam Yala, Trevor Darrell, Alane Suhr, Yuandong Tian, Xi Victoria Lin Meta Superintelligence Labs · UC Berkeley · UCSF · November 2025 · arXiv:2512.07843
The paper, in plain English
When you ask a reasoning model a hard math problem, it thinks out loud — one word at a time, left to right. If the answer takes 20,000 tokens, you wait for all 20,000 before you see a result. You could throw more GPUs at the problem, but each GPU would just be sitting idle, waiting for the one before it to finish. It is like having a kitchen full of chefs but only letting one cook at a time.
ThreadWeaver teaches the model to split its own reasoning into independent sub-tasks that can run simultaneously — the way a head chef might assign the sauce, the vegetables, and the plating to three different cooks who all finish at roughly the same time. The key insight is that many math proofs contain genuinely independent sub-problems — compute the horizontal distance and the vertical distance separately, check two solution methods in parallel — and a model can learn to spot those opportunities on its own. The trick is doing this without any modifications to the underlying inference engine: ThreadWeaver uses a lightweight orchestrator that simply sends parallel API requests to a standard LLM server.
The result: ThreadWeaver matches the accuracy of cutting-edge sequential reasoning models (79.9% on AIME24, 71.9% averaged across six benchmarks) while delivering up to 1.53× speedup in token latency and 1.14× speedup in wall-clock time. It establishes a new Pareto frontier: for the first time, you can have your reasoning quality and get your answer faster.
I
Trie-Based Training
Merge parallel training trajectories into a single sequence using a prefix tree. The model learns fork-join reasoning with zero changes to the inference engine.
II
Two-Stage SFT Pipeline
Start with 959 LLM-rewritten trajectories, then scale to 17k through self-training with reward-based filtering — the model teaches itself better parallel structure.
III
P-GRPO Reward
A reinforcement learning reward that jointly optimizes for correctness and acceleration, with thread-wise advantage broadcast and mean-centered normalization for stability.
Chapter 1
The Latency Wall
Why adding more GPUs to a reasoning model does not make it faster — and what to do about it.
Autoregressive language models generate tokens sequentially. For a reasoning model tackling hard math problems, a single chain-of-thought can exceed 20,000 tokens. The token latency — the number of tokens on the critical path — directly determines how long the user waits for an answer.
Token latency is defined as the number of tokens on the longest thread of a parallel inference trajectory. For a sequential model, this equals the total number of generated tokens:
Drag the sliders to see how critical-path length changes with problem structure. The chart updates live.
Sequential Latency
15,000
Parallel Latency
—
Speedup
—
The key insight: speedup depends entirely on problem structure. If 40% of a 15k-token reasoning trace is parallelizable into 3 threads, the critical path drops from 15,000 to roughly 11,000 tokens — a 1.36× speedup. But if nothing is parallelizable, the model simply runs sequentially and you lose nothing.
The markup and state machine that turn a standard LLM into a parallel reasoner — no engine modifications required.
The parallel trajectory format extends standard autoregressive generation with lightweight control tokens in a fork-join pattern. A sample trajectory for a distance-formula problem:
<Think>
We will use the distance formula d = sqrt((dx)^2 + (dy)^2).
<Parallel><Outlines><Outline>1: Compute the squared horizontal difference (dx)^2.</Outline><Outline>2: Compute the squared vertical difference (dy)^2.</Outline></Outlines><Thread>1: dx = 2 - (-4) = 6, so (dx)^2 = 36.</Thread><Thread>2: dy = -6 - 3 = -9, so (dy)^2 = 81.</Thread></Parallel>
Sum the results: 36 + 81 = 117.
Distance d = sqrt(117) = sqrt(9*13) = 3*sqrt(13).
</Think>
The inference orchestrator is a minimal state machine with five phases:
Click each phase to see how the orchestrator processes the trajectory. The timeline diagram updates live.
1. Sequential
2. Parse Outlines
3. Parallel
4. Join
5. Continue
API Calls
1
Threads Active
0
Tokens on Critical Path
—
The design constraint that makes ThreadWeaver deployable: every parallel request is a standard text-completion API call. No custom attention masks, no position-embedding hacks, no modified KV caches. The orchestrator is a thin client-side wrapper. If you have vLLM or SGLang running, you already have everything you need.
Merging parallel branches into a single training sequence — with the right attention mask to prevent cross-thread leakage.
Training requires every $\langle \text{context}, \text{completion} \rangle$ pair that the orchestrator will encounter during inference. The trie construction has three steps:
Extract all $\langle \text{context}, \text{completion} \rangle$ units from the trajectory.
Insert them into a token-level prefix tree (trie) whose root is the shared prompt.
Flatten the trie into a single training sequence with an ancestor-only attention mask.
The attention mask enforces a critical invariant: token $i$ may attend to token $j$ if and only if $j$ is an ancestor of $i$ in the trie. This prevents cross-thread leakage while preserving shared prefixes.
Drag the slider to control how many threads the trie merges. Watch how the flat sequence length grows.
Naive Seq Length
—
Trie Merged Length
—
Savings
—
The trie co-design guarantees something subtle but critical: disabling parallel blocks recovers ordinary autoregressive behavior. The sequential trajectory is always a valid subsequence of the trie-flattened sequence, so the model degrades gracefully — no parallel infrastructure, no problem.
From 959 cold-start examples to 17,491 self-trained trajectories — a two-stage pipeline that scales parallel reasoning data.
Stage 1 applies a five-step annotation pipeline to sequential Qwen3-8B traces:
Identify parallel blocks — the LLM annotates line-numbered spans.
Extract canonical threads — enforce contiguity and ordering constraints.
Rewrite for clarity — remove cross-thread dependencies via targeted edits.
Add outlines — generate path-specific plans for each thread.
Filter — discard structurally invalid or degenerate trajectories.
Stage 2 runs the cold-start model on the full 53k prompt set with parallel inference, keeping only trajectories that are both answer-correct and structurally valid.
Adjust the filtering thresholds to see how dataset size and model accuracy change.
Stage 1 Samples
959
Stage 2 Samples
—
Est. AIME24 Accuracy
—
The ablation in the paper shows that dataset-model compatibility matters more than teacher strength. Training Qwen3-8B on Multiverse's DeepSeek-R1-curated data yields only 62.2% on AIME24, versus 74.5% from ThreadWeaver's Qwen3-native data — despite Multiverse using a stronger teacher model. RL cannot fully compensate for a distributional mismatch in SFT data.
where $L_{\text{longest}}$ is the length of the longest thread and $L_{\text{total}}$ is the total tokens. In the paper, $\rho = 0.5$ and $\rho_{\text{clip}} = 0.2$.
The P-GRPO advantage uses mean-centering only (no standard-deviation normalization):
$$A^{\text{P-GRPO}}_{p,i} = r_{p,i} - \mu_p$$
This is broadcast to all tokens in the trajectory, including all threads. The loss is:
Adjust the reward parameters and see how the accuracy-acceleration trade-off shifts across training steps.
Final Accuracy
—
Final Speedup
—
Reward Stability
—
The paper's ablation is stark: with standard-deviation normalization, AIME24 accuracy drops to 74.8% and trajectories bloat to 30.1k tokens. With mean-centering only, accuracy rises to 79.9% and trajectories shrink to 21.1k. The normalization choice is not cosmetic — it is the difference between a model that hacks the reward and one that actually learns to reason in parallel.
The per-problem speedup distributions reveal a clear pattern: acceleration is fundamentally question-dependent.
The main results across six benchmarks:
Benchmark
Seq. Accuracy
TW Accuracy
Seq. Latency
TW Latency
Speedup
AIME24
78.3%
79.9%
19.4k
16.9k
1.14×
AIME25
61.6%
60.5%
24.6k
24.0k
1.03×
AMC23
92.6%
92.3%
13.8k
12.0k
1.16×
MATH500
91.8%
91.4%
7.2k
6.4k
1.23×
Minerva Math
43.9%
43.7%
10.6k
7.3k
1.53×
OlympiadBench
65.0%
63.5%
15.2k
12.8k
1.21×
Average
72.2%
71.9%
15.1k
13.2k
1.22×
Adjust the number of threads and parallelizable fraction to simulate speedup distributions for different problem types.
The wall-clock measurement confirms it: on 50 MATH500 problems with 4 GPUs, ThreadWeaver reduces latency from 162s to 142s — a 1.14× wall-clock speedup. The gap between token-latency speedup (1.23×) and wall-clock speedup reflects real scheduling overhead, but the gains are genuine and measurable.
ThreadWeaver establishes a new speed-accuracy Pareto frontier, dominating prior adaptive parallel reasoning methods.
Comparison of adaptive parallel reasoning methods on AIME24:
Model
Size
Self-Parallelism
Activation Ratio
AIME24 Accuracy
Multiverse-zero
32B
1.04×
—
52.1%
Multiverse
32B
1.18×
—
53.8%
Parallel-R1-Seen
4B
—
27.3%
19.4%
ThreadWeaver
8B
1.25×
85.2%
79.9%
Hover over points in the Pareto frontier chart to compare methods. The chart shows accuracy vs. speedup for all known adaptive parallel reasoning methods.
ThreadWeaver built on an 8B model outperforms Multiverse built on a 32B model — both in accuracy and in measured self-parallelism speedup. Training recipe matters more than model size. The SFT pipeline, self-training, and P-GRPO together extract far more parallel capability from a smaller model than prior methods extract from a much larger one.
Three takeaways for anyone building, deploying, or relying on reasoning models.
Three Takeaways
Parallelism is learnable. The model does not need hand-crafted heuristics or external task decomposition. P-GRPO teaches it to find exploitable structure on its own — and to fall back to sequential reasoning when none exists.
Engine compatibility matters. The trie-based training co-design is what makes deployment possible. Any team running vLLM or SGLang can adopt ThreadWeaver with a client-side wrapper. No infrastructure changes.
SFT quality is the bottleneck. The ablation is unambiguous: RL on top of poor SFT data (62.2% on AIME24) barely closes the gap to good SFT data (74.5%). Invest in your cold-start dataset before you invest in RL.
Future Directions
Nested parallelism — threads within threads for hierarchical problem structure.
Hardware-aware spawning — models that reason about GPU count and network topology to choose thread count.
Beyond math — parallelizing agent interactions in software engineering, scientific research, and multi-step tool use.
ThreadWeaver establishes that the trade-off between reasoning quality and inference speed is not fundamental. With the right training pipeline — trie-based merging, self-training, and P-GRPO — an 8B model can match a sequential baseline's accuracy while being 1.14× faster in wall-clock time and up to 1.53× faster in token latency. The Pareto frontier has moved. Everything built on reasoning models should reconsider its latency assumptions.