An Interactive Reading of

ACC: Compiling
Agent Trajectories
for Long-Context Training

The paper, in plain English

When an AI agent solves a problem, it hunts through web pages, reads source code, and queries databases across many turns. The evidence it needs to answer the original question is scattered across all those turns. But here is the problem: the standard way we train these agents actively discards most of that evidence during training. Tool responses are masked out. The model learns which button to press next, but never learns to connect distant clues into a coherent answer.

Agent Context Compilation (ACC) is a disarmingly simple fix. Instead of training turn-by-turn, ACC gathers every tool response and environment observation the agent collected, shuffles them into one long context window alongside the original question, and trains the model to answer directly from that compiled evidence. Think of it as giving a detective all the case files at once, rather than handing them one document at a time and asking what to search for next.

The result: a 30-billion-parameter model trained with ACC matches a 235-billion-parameter model on MRCR and GraphWalks, the hardest long-range reasoning benchmarks available. On MRCR, the score jumps from 50.2 to 68.3 (+18.1). On GraphWalks, from 69.9 to 77.5 (+7.6). General capabilities like GPQA and MMLU-Pro hold steady. Under the hood, the model develops task-specific attention patterns — different experts specialise for different reasoning tasks, a flexibility not seen in the baseline.

I
The Supervision Blind Spot
Standard agent SFT masks tool responses, so gradient signals from the final answer decay through intermediate turns. Evidence is collected but never used for learning.
II
Agent Context Compilation
Gather all tool responses into one compiled context. Train the model to answer directly from the assembled evidence — no intermediate turns, no gradient attenuation.
III
Task-Adaptive Restructuring
After ACC training, the model develops different attention and expert routing patterns for different tasks — flexible adaptation rather than a one-size-fits-all fix.
~ 25 minutes · 9 chapters · 8 interactive simulations
CHAPTER 1

The Supervision Blind Spot

A detective collects fingerprints from six rooms, photographs three crime scenes, and interviews ten witnesses. But during debriefing, the sergeant only asks: "Which room did you search next?" Nobody ever asks what the detective found. That is how we currently train AI agents.

An agent trajectory consists of $k-1$ interaction turns followed by a final answer turn:

$$\tau = (q,\; (r_1, a_1, o_1),\; \ldots,\; (r_{k-1}, a_{k-1}, o_{k-1}),\; (r_k, y))$$

where $r_t$ is the reasoning at turn $t$, $a_t$ is the action, $o_t$ is the tool response (observation), and $(r_k, y)$ is the final reasoning-answer pair. The standard agent SFT objective supervises only the model-generated tokens:

$$\mathcal{L}_{\text{agent}} = -\sum_{t=1}^{k} \sum_{j \in I_t} \log P(\text{token}_j \mid H_{

where $I_t = r_t \cup a_t$ for $t < k$ and $I_k = r_k \cup y$. The observations $o_t$ are excluded from the loss. Grouping by turn reveals two components:

$$\mathcal{L}_{\text{agent}} = \underbrace{-\sum_{t=1}^{k-1} \sum_{j \in r_t \cup a_t} \log P(\text{token}_j \mid \cdots)}_{\text{local next-tool selection}} + \underbrace{-\sum_{j \in r_k \cup y} \log P(\text{token}_j \mid H_{

The first $k-1$ terms supervise only local reasoning and tool selection. Any gradient relevant to the final answer $y$ must back-propagate through a long chain of intermediate turns to reach $o_t$, and is heavily weakened. This is the supervision blind spot.

Gradient Attenuation Explorer

See how the gradient signal from the final answer decays as it passes through masked observations. Drag the sliders to explore.

220
0.10 (slow)0.95 (near-lossless)
Signal reaching turn 1 observation
Effective supervision ratio
Blind spot severity
Why this matters
The blind spot means that even when an agent collects exactly the right evidence, the model never receives a training signal telling it that evidence was important. It learns to select tools but not to reason over the evidence those tools return. This is the core problem ACC solves.
Next: Three agent worlds, three kinds of evidence
CHAPTER 2

Three Agents, Three Worlds

Not all evidence is created equal. A web page reads differently from a database table, which reads differently from a Python source file. ACC works across all three — and their diversity turns out to be crucial.

ACC applies to three representative agent classes, each producing trajectories with distinct evidence structures:

SWE Agent
Inspects source files to locate and fix bugs. Evidence is source code — function bodies, class definitions, and import chains. Includes both files involved in the correct patch and unopened files as distractors.
Trajectories 4,368   Evidence Source code
SQL Agent
Queries relational tables for structured analytics. Evidence is tabular data — rows and columns encoding multi-hop graph structures. Extracts complete contents of all queried tables.
Trajectories 3,065   Evidence Database tables

Trajectory Length Distribution

Each agent type produces trajectories of different lengths. Click the legend to toggle agent types.

Why this matters
The three agent types provide complementary training signal. Search data teaches multi-hop reasoning over prose. SQL data teaches relational graph traversal. SWE data teaches code comprehension. The full mixture outperforms any single type alone.
Next: The ACC method — closing the blind spot
CHAPTER 3

Agent Context Compilation

If the problem is that evidence is scattered and the gradient cannot reach it, the solution is obvious in hindsight: gather all the evidence into one place, then train directly on it.

ACC converts each trajectory into a training example $\tau_i \mapsto (x_i, y_i, r_i)$, producing a dataset $\mathcal{D} = \{(x_i, y_i, r_i)\}_{i=1}^{M}$. Here $x_i = (q_i, C_i)$ combines the original query with the compiled context, $y_i$ is the final answer, and $r_i$ is the reasoning trace. The new training objective:

$$\mathcal{L}_{\text{ACC}} = -\sum_{j \in r \cup y} \log P(\text{token}_j \mid q, C, \text{token}_{

Unlike $\mathcal{L}_{\text{agent}}$, this objective contains no intermediate action terms. The final answer supervision reaches every evidence token directly without being filtered through turn-level tool selection.

Loss Comparison: Standard SFT vs ACC

See how gradient signal flows differently through the two training objectives. Drag the slider to change the number of trajectory turns.

215
Why this matters
ACC's loss function is strictly simpler than standard agent SFT — fewer terms, no intermediate supervision, direct gradient paths. Simplicity is the point. By removing the turn-level structure that filters gradient signal, ACC ensures that every piece of evidence receives meaningful supervision from the final answer.
Next: How the long context is actually built
CHAPTER 4

Building the Long Context

Gathering evidence is only half the battle. How you arrange it — and what you leave out — determines whether the model learns to find signals in noise or just memorises positions.

For each trajectory, ACC extracts structured evidence pieces $\text{Evi}(\tau) = [e_1, \ldots, e_m]$ and applies a random permutation $\pi$ over $\{1, \ldots, m\}$:

$$C_i = \text{Concat}(e_{\pi(1)}, e_{\pi(2)}, \ldots, e_{\pi(m)}), \quad |C_i| \leq B$$

where $B$ is the token budget (131,072 tokens). Because evidence pieces are self-contained, shuffling forces the model to locate relevant information via semantic association rather than sequential position. Distractors are added to increase difficulty.

Context Assembly Simulator

Watch how ACC assembles a compiled context from a search trajectory. Evidence pieces are shuffled; distractors are mixed in. Click "Shuffle" to randomise the order.

Click to randomise evidence ordering
Why this matters
The ablation study shows that removing distractors from Search and SWE lowers MRCR by 3.3 and 3.8 points respectively. The model needs noise in training to learn evidence localisation. Shuffling prevents positional shortcuts. Together, these force genuine long-range reasoning.
Next: The headline results — 30B matches 235B
CHAPTER 5

Punching Above Weight

A 30-billion-parameter model should not be able to compete with a 235-billion-parameter one. On long-range dependency benchmarks, ACC makes it happen.

+18.1
MRCR improvement
(50.2 → 68.3)
+7.6
GraphWalks improvement
(69.9 → 77.5)
×8
Parameter advantage matched
(30B vs 235B active params)

Benchmark Comparison

Compare models across MRCR and GraphWalks benchmarks. Click legend entries to toggle models.

Why this matters
These results suggest that long-range reasoning is not primarily a function of model size — it is a function of training data quality. ACC provides the right training signal at a fraction of the compute cost. A 30B model with the right data can match a 235B model with generic training.
Next: Does long-context training hurt general abilities?
CHAPTER 6

What Generalises, What Doesn't

Long-context training often raises a fear: does specialising for long-range reasoning come at the cost of general intelligence? The short answer: no.

General Capability Preservation

Compare baseline vs ACC-trained model across general capability benchmarks. No degradation — slight improvements.

Why this matters
Long-context training does not have to be a zero-sum game. ACC's training signal — reading scattered evidence and producing a reasoning trace — is general enough that it strengthens rather than competes with other capabilities. The model learns to reason better, and that transfers everywhere.
Next: What happens inside the model?
CHAPTER 7

Inside the Model

ACC does not just change what the model knows — it changes how the model thinks. Attention patterns restructure. Experts specialise. And the restructuring is task-specific, not one-size-fits-all.

For layer $l$ and head $h$, the per-layer per-bin mean attention is:

$$\mu_{l,b} = \frac{1}{H} \sum_{h=1}^{H} m_{l,h,b}$$

The reported heatmap shows $\Delta_{l,b} = \mu^{\text{SFT}}_{l,b} - \mu^{\text{Base}}_{l,b}$. Positive values indicate increased attention mass at that distance after ACC training.

Attention Distance & Expert Routing

Select a task to see how attention and expert routing change after ACC training.

GraphWalks
Graph traversal over extended contexts
Increased attention at both nearby and far distances. Multiple experts activate for distant tokens.
MRCR
Multi-round coreference resolution
Attention concentrates at nearby bins for verification. One dominant expert specialises for scanning.
Why this matters
The model does not develop a single "long-context mode." It develops multiple task-specific strategies and selects the right one at inference time. This flexibility — not just raw capability — is what allows a 30B model to match one nearly 8× its size.
Next: Ablations, contributions, and where to go from here
CHAPTER 8

Ablations & Conclusion

Which ingredients actually matter? If you remove one agent type, or strip the distractors, how much do you lose? The ablations tell a clean story: every component contributes, and the whole is greater than the sum.

Ablation Heatmap

See how each training configuration affects MRCR and GraphWalks scores relative to the baseline.

Training Configuration

The full ACC setup compiles 10,802 trajectories (Search: 3,369; SWE: 4,368; SQL: 3,065) with context lengths from 2K to 128K tokens. Training uses sequence length 131,072 tokens, global batch size 16, learning rate $1 \times 10^{-5}$ with cosine schedule, AdamW optimizer ($\beta_1=0.9$, $\beta_2=0.999$, weight decay 0.1), and 4 epochs.

Why this matters
ACC is deliberately simple — standard SFT on compiled data, no architectural changes, no RL pipeline, no special inference tricks. It complements rather than competes with existing methods. Any team with access to agent trajectories can apply it tomorrow.
The quiet insight of this paper is not that agent trajectories contain useful training signal — it is that the standard way of using them actively discards most of that signal. ACC does not invent new data; it stops throwing away what we already have. The result is a 30B model that reasons over long contexts as well as one nearly eight times its size.
Read the original
arXiv:2605.21850 · arxiv.org/abs/2605.21850