An Interactive Reading of

ACC: Compiling
Agent Trajectories
for Long-Context Training

Qisheng Su, Zhen Fang, Shiting Huang, Yu Zeng, Yiming Zhao, Kou Shi, Ziao Zhang, Lin Chen, Zehui Chen, Lijun Wu, Feng Zhao
USTC · Shanghai AI Lab · May 2026 · arXiv:2605.21850

The paper, in plain English

When an AI agent solves a problem, it hunts through web pages, reads source code, and queries databases across many turns. The evidence it needs to answer the original question is scattered across all those turns. But here is the problem: the standard way we train these agents actively discards most of that evidence during training. Tool responses are masked out. The model learns which button to press next, but never learns to connect distant clues into a coherent answer.

Agent Context Compilation (ACC) is a disarmingly simple fix. Instead of training turn-by-turn, ACC gathers every tool response and environment observation the agent collected, shuffles them into one long context window alongside the original question, and trains the model to answer directly from that compiled evidence. Think of it as giving a detective all the case files at once, rather than handing them one document at a time and asking what to search for next.

The result: a 30-billion-parameter model trained with ACC matches a 235-billion-parameter model on MRCR and GraphWalks, the hardest long-range reasoning benchmarks available. On MRCR, the score jumps from 50.2 to 68.3 (+18.1). On GraphWalks, from 69.9 to 77.5 (+7.6). General capabilities like GPQA and MMLU-Pro hold steady. Under the hood, the model develops task-specific attention patterns — different experts specialise for different reasoning tasks, a flexibility not seen in the baseline.

I

The Supervision Blind Spot

Standard agent SFT masks tool responses, so gradient signals from the final answer decay through intermediate turns. Evidence is collected but never used for learning.

II

Agent Context Compilation

Gather all tool responses into one compiled context. Train the model to answer directly from the assembled evidence — no intermediate turns, no gradient attenuation.

III

Task-Adaptive Restructuring

After ACC training, the model develops different attention and expert routing patterns for different tasks — flexible adaptation rather than a one-size-fits-all fix.

~ 25 minutes · 9 chapters · 8 interactive simulations

CHAPTER 1

The Supervision Blind Spot

A detective collects fingerprints from six rooms, photographs three crime scenes, and interviews ten witnesses. But during debriefing, the sergeant only asks: "Which room did you search next?" Nobody ever asks what the detective found. That is how we currently train AI agents.

In plain English

Imagine studying for an exam by reviewing flashcards. On each card, you see a question, your search strategy, and the answer you found. But your tutor covers the answer with their hand and only quizzes you on your search strategy: "What would you look up next?" You never actually practice synthesising the answers.

That is exactly what standard agent SFT does. The model generates reasoning and tool calls (which are supervised), while the tool's responses — the actual evidence — are masked from the loss. The gradient signal from the final answer has to back-propagate through a chain of intermediate turns to reach the evidence, and it gets weaker at every step.

The consequence: the model learns to press the right buttons but never learns to connect the dots. Drag the slider below to see how many turns it takes for the answer signal to become almost useless to early evidence.

An agent trajectory consists of $k-1$ interaction turns followed by a final answer turn:

$$\tau = (q,\; (r_1, a_1, o_1),\; \ldots,\; (r_{k-1}, a_{k-1}, o_{k-1}),\; (r_k, y))$$

where $r_t$ is the reasoning at turn $t$, $a_t$ is the action, $o_t$ is the tool response (observation), and $(r_k, y)$ is the final reasoning-answer pair. The standard agent SFT objective supervises only the model-generated tokens:

$$\mathcal{L}_{\text{agent}} = -\sum_{t=1}^{k} \sum_{j \in I_t} \log P(\text{token}_j \mid H_{

where $I_t = r_t \cup a_t$ for $t < k$ and $I_k = r_k \cup y$. The observations $o_t$ are excluded from the loss. Grouping by turn reveals two components:

$$\mathcal{L}_{\text{agent}} = \underbrace{-\sum_{t=1}^{k-1} \sum_{j \in r_t \cup a_t} \log P(\text{token}_j \mid \cdots)}_{\text{local next-tool selection}} + \underbrace{-\sum_{j \in r_k \cup y} \log P(\text{token}_j \mid H_{

The first $k-1$ terms supervise only local reasoning and tool selection. Any gradient relevant to the final answer $y$ must back-propagate through a long chain of intermediate turns to reach $o_t$, and is heavily weakened. This is the supervision blind spot.

Gradient Attenuation Explorer

See how the gradient signal from the final answer decays as it passes through masked observations. Drag the sliders to explore.

Number of turns (k): 6

220

Gradient decay per turn (γ): 0.60

0.10 (slow)0.95 (near-lossless)

Signal reaching turn 1 observation

—

Effective supervision ratio

—

Blind spot severity

—

Why this matters

The blind spot means that even when an agent collects exactly the right evidence, the model never receives a training signal telling it that evidence was important. It learns to select tools but not to reason over the evidence those tools return. This is the core problem ACC solves.

Next: Three agent worlds, three kinds of evidence →

CHAPTER 2

Three Agents, Three Worlds

Not all evidence is created equal. A web page reads differently from a database table, which reads differently from a Python source file. ACC works across all three — and their diversity turns out to be crucial.

ACC applies to three representative agent classes, each producing trajectories with distinct evidence structures:

Search Agent

Retrieves web pages to answer complex multi-hop questions. Evidence is prose text — articles, summaries, and search snippets. Includes both visited pages and unvisited candidate results as distractors.

Trajectories 3,369 Evidence Web pages

SWE Agent

Inspects source files to locate and fix bugs. Evidence is source code — function bodies, class definitions, and import chains. Includes both files involved in the correct patch and unopened files as distractors.

Trajectories 4,368 Evidence Source code

SQL Agent

Queries relational tables for structured analytics. Evidence is tabular data — rows and columns encoding multi-hop graph structures. Extracts complete contents of all queried tables.

Trajectories 3,065 Evidence Database tables

Trajectory Length Distribution

Each agent type produces trajectories of different lengths. Click the legend to toggle agent types.

Why this matters

The three agent types provide complementary training signal. Search data teaches multi-hop reasoning over prose. SQL data teaches relational graph traversal. SWE data teaches code comprehension. The full mixture outperforms any single type alone.

Next: The ACC method — closing the blind spot →

CHAPTER 3

Agent Context Compilation

If the problem is that evidence is scattered and the gradient cannot reach it, the solution is obvious in hindsight: gather all the evidence into one place, then train directly on it.

In plain English

Imagine a law student preparing for a bar exam. In the old method (standard agent SFT), the student practices legal research: given a case, which database should I search? Which statute should I look up? The student gets good at finding things but never practises writing legal arguments from the materials found.

ACC changes the curriculum. It takes all the materials the research process uncovered — cases, statutes, precedents — bundles them into one packet, and trains the student to write the legal argument directly from that packet. No more "what should I search next?" — just "here is everything, now reason."

The key insight: by removing intermediate turns, the final answer's gradient reaches every piece of evidence directly. No attenuation. No blind spot. The simulation below shows the difference.

ACC converts each trajectory into a training example $\tau_i \mapsto (x_i, y_i, r_i)$, producing a dataset $\mathcal{D} = \{(x_i, y_i, r_i)\}_{i=1}^{M}$. Here $x_i = (q_i, C_i)$ combines the original query with the compiled context, $y_i$ is the final answer, and $r_i$ is the reasoning trace. The new training objective:

$$\mathcal{L}_{\text{ACC}} = -\sum_{j \in r \cup y} \log P(\text{token}_j \mid q, C, \text{token}_{

Unlike $\mathcal{L}_{\text{agent}}$, this objective contains no intermediate action terms. The final answer supervision reaches every evidence token directly without being filtered through turn-level tool selection.

Loss Comparison: Standard SFT vs ACC

See how gradient signal flows differently through the two training objectives. Drag the slider to change the number of trajectory turns.

Trajectory turns: 5

215

Why this matters

ACC's loss function is strictly simpler than standard agent SFT — fewer terms, no intermediate supervision, direct gradient paths. Simplicity is the point. By removing the turn-level structure that filters gradient signal, ACC ensures that every piece of evidence receives meaningful supervision from the final answer.

Next: How the long context is actually built →

CHAPTER 4

Building the Long Context

Gathering evidence is only half the battle. How you arrange it — and what you leave out — determines whether the model learns to find signals in noise or just memorises positions.

In plain English

Imagine studying for a history exam where the textbook always puts the most important paragraph first. You would learn to read beginnings carefully and skim the rest. The exam would be easy — but you would not actually learn to find information, only to trust its position.

ACC avoids this trap by shuffling the evidence pieces randomly before concatenation. Sometimes the key document is first, sometimes last, sometimes buried in the middle. The model must learn to locate relevant information by semantic association, not by sequential position — just like a real researcher scanning a dossier.

ACC also adds distractors: web pages the agent never visited, code files it never opened. These teach the model to filter noise. Try the simulation below — shuffle the evidence and watch how the same answer can be buried at different positions.

For each trajectory, ACC extracts structured evidence pieces $\text{Evi}(\tau) = [e_1, \ldots, e_m]$ and applies a random permutation $\pi$ over $\{1, \ldots, m\}$:

$$C_i = \text{Concat}(e_{\pi(1)}, e_{\pi(2)}, \ldots, e_{\pi(m)}), \quad |C_i| \leq B$$

where $B$ is the token budget (131,072 tokens). Because evidence pieces are self-contained, shuffling forces the model to locate relevant information via semantic association rather than sequential position. Distractors are added to increase difficulty.

Context Assembly Simulator

Watch how ACC assembles a compiled context from a search trajectory. Evidence pieces are shuffled; distractors are mixed in. Click "Shuffle" to randomise the order.

Click to randomise evidence ordering

Why this matters

The ablation study shows that removing distractors from Search and SWE lowers MRCR by 3.3 and 3.8 points respectively. The model needs noise in training to learn evidence localisation. Shuffling prevents positional shortcuts. Together, these force genuine long-range reasoning.

Next: The headline results — 30B matches 235B →

CHAPTER 5

Punching Above Weight

A 30-billion-parameter model should not be able to compete with a 235-billion-parameter one. On long-range dependency benchmarks, ACC makes it happen.

In plain English

In weightlifting, a 70 kg athlete who clean-and-jerks 180 kg is more impressive than a 110 kg athlete who lifts the same weight. The lighter lifter has learned to use technique and leverage to compensate for raw mass. ACC does something similar for language models.

The benchmarks are MRCR (Multi-Round Coreference Resolution) — which tests whether the model can resolve references across multiple rounds of dialogue in a long context — and GraphWalks — which tests whether the model can traverse a graph described across many tokens. Both require long-range dependency tracking, not just local pattern matching.

Qwen3-30B-A3B trained with ACC scores 68.3 on MRCR (+18.1 over baseline) and 77.5 on GraphWalks (+7.6). These numbers match or beat Qwen3-235B-A22B, which has nearly 8× more active parameters. Explore the results in the charts below.

+18.1

MRCR improvement
(50.2 → 68.3)

+7.6

GraphWalks improvement
(69.9 → 77.5)

×8

Parameter advantage matched
(30B vs 235B active params)

Benchmark Comparison

Compare models across MRCR and GraphWalks benchmarks. Click legend entries to toggle models.

Why this matters

These results suggest that long-range reasoning is not primarily a function of model size — it is a function of training data quality. ACC provides the right training signal at a fraction of the compute cost. A 30B model with the right data can match a 235B model with generic training.

Next: Does long-context training hurt general abilities? →

CHAPTER 6

What Generalises, What Doesn't

Long-context training often raises a fear: does specialising for long-range reasoning come at the cost of general intelligence? The short answer: no.

In plain English

Imagine a medical student who spends months specialising in cardiology. You would worry they have forgotten their general training — how to set a bone, treat an infection, read a chart. ACC is the specialisation; GPQA, MMLU-Pro, and AIME are the general checkups.

The ACC-trained model actually improves on GPQA-Diamond (+2.49), MMLU-Pro (+1.50), and AIME'25 (+3.33), while AIME'24 and IFEval stay essentially flat. The gains are small but consistent, and none of the benchmarks show degradation.

The authors verify this is not data leakage by comparing the semantic distribution of training queries against benchmark questions. The average nearest-neighbour cosine similarity stays below 0.36, and a classifier achieves AUC 0.9986 in separating training queries from benchmarks. The gains are transferable reasoning, not memorisation.

General Capability Preservation

Compare baseline vs ACC-trained model across general capability benchmarks. No degradation — slight improvements.

Why this matters

Long-context training does not have to be a zero-sum game. ACC's training signal — reading scattered evidence and producing a reasoning trace — is general enough that it strengthens rather than competes with other capabilities. The model learns to reason better, and that transfers everywhere.

Next: What happens inside the model? →

CHAPTER 7

Inside the Model

ACC does not just change what the model knows — it changes how the model thinks. Attention patterns restructure. Experts specialise. And the restructuring is task-specific, not one-size-fits-all.

In plain English

Think of a mixed-martial-arts fighter. In training camp, they do not just get stronger — their body restructures. Different muscle groups develop depending on the opponent. For a grappler, core and grip strengthen. For a striker, shoulders and calves. The adaptation is task-specific.

ACC does something analogous inside the transformer. After training, the model's attention heads restructure differently for GraphWalks vs MRCR. On GraphWalks, attention mass increases at both nearby and far distances (local neighbourhood checks + distant node jumps). On MRCR, it concentrates at nearby distances (scanning and verifying candidate segments).

Meanwhile, the MoE experts specialise. On GraphWalks, several experts share the load for distant tokens. On MRCR, one expert dominates. The three layers showing the biggest changes are completely different between the two tasks. Explore the heatmaps below.

For layer $l$ and head $h$, the per-layer per-bin mean attention is:

$$\mu_{l,b} = \frac{1}{H} \sum_{h=1}^{H} m_{l,h,b}$$

The reported heatmap shows $\Delta_{l,b} = \mu^{\text{SFT}}_{l,b} - \mu^{\text{Base}}_{l,b}$. Positive values indicate increased attention mass at that distance after ACC training.

Attention Distance & Expert Routing

Select a task to see how attention and expert routing change after ACC training.

GraphWalks

Graph traversal over extended contexts

Increased attention at both nearby and far distances. Multiple experts activate for distant tokens.

MRCR

Multi-round coreference resolution

Attention concentrates at nearby bins for verification. One dominant expert specialises for scanning.

Why this matters

The model does not develop a single "long-context mode." It develops multiple task-specific strategies and selects the right one at inference time. This flexibility — not just raw capability — is what allows a 30B model to match one nearly 8× its size.

Next: Ablations, contributions, and where to go from here →

CHAPTER 8

Ablations & Conclusion

Which ingredients actually matter? If you remove one agent type, or strip the distractors, how much do you lose? The ablations tell a clean story: every component contributes, and the whole is greater than the sum.

Ablation Heatmap

See how each training configuration affects MRCR and GraphWalks scores relative to the baseline.

Training Configuration

The full ACC setup compiles 10,802 trajectories (Search: 3,369; SWE: 4,368; SQL: 3,065) with context lengths from 2K to 128K tokens. Training uses sequence length 131,072 tokens, global batch size 16, learning rate $1 \times 10^{-5}$ with cosine schedule, AdamW optimizer ($\beta_1=0.9$, $\beta_2=0.999$, weight decay 0.1), and 4 epochs.

Why this matters

ACC is deliberately simple — standard SFT on compiled data, no architectural changes, no RL pipeline, no special inference tricks. It complements rather than competes with existing methods. Any team with access to agent trajectories can apply it tomorrow.

The quiet insight of this paper is not that agent trajectories contain useful training signal — it is that the standard way of using them actively discards most of that signal. ACC does not invent new data; it stops throwing away what we already have. The result is a 30B model that reasons over long contexts as well as one nearly eight times its size.

Read the original

arXiv:2605.21850 · arxiv.org/abs/2605.21850

ACC: CompilingAgent Trajectoriesfor Long-Context Training

The Supervision Blind Spot

Gradient Attenuation Explorer

Three Agents, Three Worlds

Trajectory Length Distribution

Agent Context Compilation

Loss Comparison: Standard SFT vs ACC

Building the Long Context

Context Assembly Simulator

Punching Above Weight

Benchmark Comparison

What Generalises, What Doesn't

General Capability Preservation

Inside the Model

Attention Distance & Expert Routing

Ablations & Conclusion

Ablation Heatmap

Training Configuration

ACC: Compiling
Agent Trajectories
for Long-Context Training