An Interactive Reading of

CoCoDA: Co-evolving Compositional DAG
for Tool-Augmented Agents

Ziyang Yu, Qiyue Li & Liang Zhao
Emory University · Beijing Normal-Hong Kong Baptist University · May 2026

The paper, in plain English

Give a small language model a toolbox of Python functions and it can solve problems its parametric memory could never handle alone. But there is a catch: every new tool you add makes it harder for the model to find the right tool, because searching a flat list costs more and more tokens. The library needs to grow with the model's experience, yet the search cost must stay inside a fixed context window. Two goals pulling in opposite directions.

CoCoDA resolves the tension with a single data structure: a compositional code DAG. Tools are nodes. Edges encode which tool calls which other tool. Each node stores four layers of metadata — typed signature, description, behavioral contract, and worked examples — from cheapest to most expensive to check. At inference time, retrieval cascades through these layers, pruning aggressively at each stage so the model only pays for tools that survive. At training time, successful solution traces are folded into new composite tools, and the planner is rewarded for using them. The same DAG that keeps retrieval cheap also trains the planner to exploit abstractions.

The headline number: an 8-billion-parameter student matches or exceeds its 32-billion-parameter teacher on GSM8K (93.67 vs 93.40) and MATH (63.18 vs 61.62). Retrieval cost grows sublinearly in library size: at 1,600 nodes, flat retrieval costs 11.4× more tokens than CoCoDA's typed DAG retrieval. Every component matters — removing the DAG structure, the cascade retrieval, or the shaped reward each costs 1–3.5 points of accuracy.

I

The Compositional DAG

Tools organized as a directed acyclic graph where edges are invocation dependencies, not text similarity. Each node carries a four-layer record that retrieval peels from cheap to expensive.

II

Typed DAG Retrieval

A four-stage cascade — signature pruning, semantic ranking, specification filtering, example reranking — that makes retrieval cost sublinear in library size.

III

Co-Evolution

The planner and the library grow together: successful trajectories become composite tools, and a saved-call reward trains the planner to prefer them over primitive sequences.

Chapter 1

The Scaling Trap

A small language model with a tool library can punch above its weight class. But every new tool makes the next query harder to answer. Two forces, pulling in opposite directions.

In plain English

Imagine you run a hardware store with a wall of tools: hammers, saws, drills, each with an instruction card. When you stock 20 tools, any employee can scan the wall and grab the right one. When you stock 2,000, the employee spends more time reading cards than doing the job. The store gets better tools, but finding them gets slower.

That is the tension inside every tool-augmented language model. The model has a fixed-size “attention span” — a context window measured in tokens. Every tool description you show it eats into that budget. Flat retrieval means the model must read every card before it picks.

Drag the slider below to watch how retrieval cost explodes with library size under flat retrieval — and how a structured approach changes the curve.

CoCoDA frames the joint planner–library optimization problem explicitly. Given a planner $\pi_\theta$, a tool library $\mathcal{L} = (V, E, I)$, and a query $q$, the planner samples a trajectory $\tau \sim \pi_\theta(\cdot \mid q, C_q(\mathcal{L}))$ invoking tools $(v_1, \ldots, v_T)$ via an executor. The optimization objective is:

$$ (\pi_\theta^\star, \mathcal{L}^\star) = \arg\max_{\pi_\theta, \mathcal{L}} \; \mathbb{E}_{(q,y) \sim \mathcal{D}_{\text{ev}}} \left[ R_{\text{res}}\bigl(\mathcal{O}_{\pi_\theta, \mathcal{L}}(q)\bigr) - \mathbb{E}_\tau \left[ T(\tau) \cdot \bar{C}_{\text{retr}}(q, \mathcal{L}) \right] \right] $$

where $R_{\text{res}} \in [0,1]$ is the task verifier reward, $T(\tau)$ counts tool calls, and $\bar{C}_{\text{retr}}$ is the average per-call retrieval token cost.

Three structural axes drive cost: (A) per-call retrieval cost $\bar{C}_{\text{retr}}$, which is linear in $|V|$ for flat retrieval; (B) the number of tool calls $T(\tau)$, which composites can shrink; and (C) the per-call context limit $W$, which flat retrieval breaches on any non-trivial library.

Flat vs. Structured Retrieval Cost

Drag the library size slider to see how cost scales. Chart updates live.

Library size (tools) 200

201600

Avg tokens per tool record 120

40300

Flat retrieval cost

24,000

DAG retrieval cost (est.)

2,880

Savings ratio

8.3x

Why this matters

A flat library makes prompt cost linear in library size. CoCoDA's Typed DAG Retrieval makes it sublinear — the more tools you add, the bigger the gap. The system scales into its library rather than drowning in it.

The compositional DAG structure →

Chapter 2

The Compositional DAG

Tools live in a directed acyclic graph, not a flat list. Edges are invocation dependencies; depth measures abstraction. Click a node to see its four-layer record.

In plain English

Think of a restaurant kitchen. At the bottom are raw ingredients — flour, eggs, butter. Above them are prepared components: pie crust, custard, meringue. At the top are complete dishes: lemon meringue pie. Each level builds on the one below. The executive chef doesn't need to know how to separate eggs — they just call “lemon pie” and the subsystem handles it.

CoCoDA's DAG works the same way. Primitive nodes are atomic functions (add two numbers, solve an equation). Composite nodes are macro-actions that invoke other tools. The graph is a dependency map: edges point from a composite to each tool it calls.

Click on the nodes in the interactive diagram below to see each tool's four-layer record — signature, description, specification, and examples.

The library $\mathcal{L}$ is a DAG $G = (V, E)$ where $V = V_p \cup V_c$ partitions tools into primitive nodes (atomic functions, $d(v) = 0$) and composite nodes (macro-actions, $d(v) = 1 + \max_{u \in \text{Ch}(v)} d(u)$). Each node carries a record $I(v) = (L_1, L_2, L_3, L_4)$:

$$ \text{flat}(v) = \begin{cases} 1 & v \in V_p \\ \sum_{u \in \text{Ch}(v)} \text{flat}(u) & v \in V_c \end{cases} \qquad \Phi(v) := \text{flat}(v) - 1 $$

$\text{flat}(v)$ counts primitive leaves in the recursive expansion of $v$; $\Phi(v)$ is the saved-call count — how many primitive calls a composite saves.

Interactive Tool DAG

Click any node to inspect its four-layer record. Composite nodes (navy) invoke children; primitive nodes (teal) are atomic.

Click a node above

Select any tool node to see its typed signature, description, pre/post specifications, and worked examples.

Why this matters

The DAG is not organized by text similarity — it follows real invocation structure. A composite tool for “solve quadratic equation” sits above its constituent primitives because it calls them, not because the descriptions look alike. This is the difference between a topic index and a call graph.

How retrieval traverses the DAG →

Chapter 3

Typed DAG Retrieval

A four-stage cascade that prunes candidates from thousands to one, paying the most expensive checks only on the survivors.

In plain English

Imagine hiring for a job with 1,600 applicants. You don't interview all 1,600. First, you filter by required qualifications (degree, years of experience) — fast, cheap, knocks out most candidates. Then you read cover letters for the survivors. Then you check references. Finally, you do in-person interviews for the last two or three.

CoCoDA's Typed DAG Retrieval does exactly this for tools. L1 checks type signatures (free — no LLM call needed). L2 ranks surviving descriptions semantically. L3 checks behavioral contracts. L4 uses worked examples for final disambiguation. Each stage is more expensive and operates on fewer candidates.

Watch the candidate set shrink through each stage in the simulation below by adjusting the library size and selectivity parameters.

Given a query-induced subgoal, retrieval constructs a nested sequence $V = S_0 \supseteq S_1 \supseteq S_2 \supseteq S_3 \supseteq S_4$:

$$ \begin{aligned} S_1 &= \{v \in V : L_1(v) \text{ unifies with subgoal signature}\} & &\text{(symbolic, free)} \\ S_2 &= \text{Top-}k_2\text{ by } \pi_\theta(\text{intent}, L_2(v)) : v \in S_1 & &\text{(semantic rank)} \\ S_3 &= \{v \in S_2 : L_3(v) \text{ satisfies pre/post of intent}\} & &\text{(contract filter)} \\ S_4 &= \arg\max_{v \in S_3} \pi_\theta(\text{intent}, L_4(v)) & &\text{(example rerank)} \end{aligned} $$

Retrieval Cascade Simulator

Adjust library size and filter selectivity to watch candidates cascade through the four stages.

Library size (nodes) 500

501600

L1 type selectivity 0.30

5%80%

L2 top-k₂ shortlist 10

330

L3 pass rate 0.60

20%95%

Candidates after L1

150

After L2

10

After L3

6

Final (L4)

1

The theoretical guarantee

Theorem 1: $C_{\text{hier}} \leq \alpha_1 \alpha_2 \, C_{\text{flat}} + o(C_{\text{flat}})$ as $n \to \infty$. Retrieval cost is sublinear in library size. Corollary 2: Retrieval time is $O(\tau_1 \log n + (\tau_2 + \tau_3 + \tau_4) k_2)$ vs. $\Theta((\tau_2 + \tau_3 + \tau_4) n)$ for flat retrieval.

How the planner and library grow together →

Chapter 4

Co-Evolution

The planner learns to use new tools. The library learns from the planner's successes. One coupled update, one DAG, one reward signal.

In plain English

A jazz band doesn't rehearse alone. The saxophonist hears the pianist's new voicing and weaves it into their next solo. The drummer catches the bass player's groove and adjusts the ride pattern. Every round of playing makes the group tighter, not just the individual.

CoCoDA works on the same principle. When the planner solves a problem successfully, a fixed teacher model abstracts the solution into candidate composite tools. Each candidate is validated: does it preserve acyclicity? Do its dependencies check out? Only then is it inserted into the DAG. Meanwhile, the planner gets a GRPO update that rewards it for using deeper composites. The planner learns to use what the library just learned to provide.

The simulation below shows how library size and compositional depth evolve together during training.

At iteration $k$, the coupled update rule is:

$$ \pi_{\theta}^{k+1} = \text{GRPO}\!\left(\pi_{\theta}^{k};\, \{(\tau^{(i)}, R(\tau^{(i)}))\}\right), \qquad \mathcal{L}^{k+1} = \text{FOLD}\!\left(\text{INSERT\_TOOL},\, M_T(\mathcal{T}_q^+),\, \mathcal{L}^{k}\right) $$

The planner update is a clipped GRPO step on $R = R_{\text{res}} + \lambda R_{\text{comp}}$. FOLD applies INSERT\_TOOL to each candidate from the teacher, performing ADD/MERGE/REJECT after spec and acyclicity validation.

Co-Evolution Training Dynamics

Adjust the learning rate and reward weight to see how library growth and accuracy respond.

Reward coefficient λ 0.30

0.051.00

Composite insertion rate 0.40

10%80%

Final library size

187

Mean compositional depth

3.2

GSM8K accuracy

92.6%

Why this matters

Theorem 4: Under conservative updates, $J(\pi_{\theta}^{k}, \mathcal{L}^{k})$ is monotonically non-decreasing in $k$. Each co-evolution step makes the system strictly better (or no worse). The library growth saturates — CoCoDA commits to a compact set of reusable abstractions rather than growing without bound.

The reward that makes it work →

Chapter 5

The Saved-Call Reward

Not all trajectories are equal, even when they get the same answer. The DAG knows which one used fewer primitive calls — and rewards it.

In plain English

Two chefs both produce a perfect lemon meringue pie. Chef A walks to the pantry for flour, then sugar, then lemons, then butter — four trips. Chef B grabs the “pie kit” that the sous-chef pre-staged yesterday — one trip. Same pie, but Chef B is faster and more scalable.

CoCoDA's reward captures exactly this. The saved-call reward $R_{\text{comp}}$ counts how many primitive calls a trajectory avoids by using composite tools. A primitive-only trajectory earns $R_{\text{comp}} = 0$. A trajectory that uses a composite subsuming 5 primitives earns $R_{\text{comp}} = 4$. The shaped reward is $R = R_{\text{res}} + \lambda \cdot R_{\text{comp}}$.

Try different trajectory compositions in the simulator below to see how the reward changes.

The cost block $\mathbb{E}_\tau[T(\tau) \cdot \bar{C}_{\text{retr}}]$ reduces to maximizing $-T(\tau)$ over the policy. Since every invocation satisfies $\text{flat}(t_i) - \Phi(t_i) = 1$:

$$ -T(\tau) = \sum_{i=1}^{T(\tau)} \Phi(t_i) - \sum_{i=1}^{T(\tau)} \text{flat}(t_i) = \underbrace{\sum_{i=1}^{T(\tau)} \Phi(t_i)}_{\text{saved-call credit}} - \underbrace{T_{\text{prim}}(\tau)}_{\text{primitive workload}} $$

$T_{\text{prim}}(\tau)$ is invariant under lossless substitution (same computation, different phrasing). Within a GRPO group on the same query, group-mean subtraction absorbs it, leaving only the saved-call credit as the gradient signal: $R_{\text{comp}}(\tau) = \sum_i \Phi(t_i) = \sum_i (\text{flat}(t_i) - 1)$.

Reward Simulator

Add tools to a trajectory. Compare a primitive-only path vs. one that uses composites.

Number of primitive calls 6

115

Composites used 1

05

Avg primitives per composite 4

28

Reward weight λ 0.30

0.051.00

R_comp (primitive path)

0.00

R_comp (composite path)

3.00

Advantage gap

+0.90

The structural guarantee

Theorem 3: If $\tau_c$ replaces a contiguous $m$-step sub-trajectory of $\tau_p$ with a composite $t^\star$ where $\text{flat}(t^\star) \geq m$, then $R(\tau_c) - R(\tau_p) = \lambda \Phi(t^\star) \geq \lambda(m-1) > 0$. Length-cheating is structurally impossible: padding a primitive contributes $\Phi = 0$.

Retrieval cost at scale →

Chapter 6

Retrieval at Scale

At 1,600 tools, flat retrieval costs 11.4× more tokens than CoCoDA. The gap widens with every new tool added to the library.

In plain English

A library with 1,600 books can be organized three ways. Toss them all on a shelf (flat retrieval) — you scan every spine. Cluster them by topic into a summary tree (text-hierarchical RAG) — better, but “topic” is a fuzzy match for code. Or organize them by a call graph: “this function calls that function” (typed DAG retrieval). Now you can prune by type signature first, cheaply, before reading any descriptions.

The paper tests all three on the same library. At 1,600 nodes, flat retrieval burns 11.4× the tokens of typed DAG retrieval. A RAPTOR-style text hierarchy sits in the middle at 4.7×. The gap exists because code has properties that natural text does not: typed signatures, invocation edges, and behavioral contracts.

Drag the slider below to sweep library size and see the three curves diverge.

Retrieval Cost: Three Methods Compared

Sweep the library size to see how flat retrieval, text-hierarchical RAG, and typed DAG retrieval scale differently.

Max library size 800

1001600

Tokens per tool (flat) 150

50300

Flat cost at max size

120,000

Text-RAG cost

49,500

TDR cost (ours)

10,530

Flat / TDR ratio

11.4x

Why this matters

The three code-specificity claims — typed prune (L1 is free), edge-guided expansion (climb/descend the DAG), behavior-preserving substitution (one composite token replaces a subtree) — are what separate typed DAG retrieval from any text-based hierarchy. You cannot get this gap from embedding similarity alone.

Benchmark results and ablation →

Chapter 7

Results & Ablation

An 8B student matches a 32B teacher on math. Every component of CoCoDA earns its place. The biggest ablation drop comes from removing the DAG itself.

In plain English

Imagine a 12-year-old with a well-organized reference binder beating a college graduate with no notes on a calculus exam. That is what CoCoDA does: an 8-billion-parameter model with a co-evolved tool library matches a 32-billion-parameter model relying on parametric memory alone.

On GSM8K (grade-school math), the 8B CoCoDA student scores 93.67% vs. the 32B teacher's 93.40%. On MATH (competition-level), it scores 63.18% vs. 61.62%. The gains are largest at the smallest model sizes: the 0.6B student jumps +10.7 points on GSM8K over plain chain-of-thought.

The ablation study dissects why. Removing the DAG structure (flattening to a text list) costs the most: −3.49 points on average. Removing typed retrieval costs −1.79. Removing the shaped reward costs −1.40. Removing all three drops to barely above plain CoT. Use the interactive chart below to compare.

Benchmark Comparison

Select a model size and a benchmark category to compare methods.

Student model size

Ablation: What Happens When You Remove Each Component?

Chart shows accuracy drop from removing each of CoCoDA's three core components (4B student, averaged across benchmarks).

w/o DAG (flat library)

−3.49

w/o TDR (flat retrieval)

−1.79

w/o GAR (exec-only reward)

−1.40

w/o all three

−5.18

Why this matters

The DAG structure is the single biggest contributor. Without it, the system collapses to a CREATOR/TroVE-style flat collection and the planner must re-derive every intermediate. The retrieval cascade and shaped reward are complementary: removing both costs more than the sum of removing each individually. All three components are necessary; none is sufficient.

CoCoDA: Co-evolving Compositional DAGfor Tool-Augmented Agents

The Scaling Trap

Flat vs. Structured Retrieval Cost

The Compositional DAG

Interactive Tool DAG

Click a node above

Typed DAG Retrieval

Retrieval Cascade Simulator

Co-Evolution

Co-Evolution Training Dynamics

The Saved-Call Reward

Reward Simulator

Retrieval at Scale

Retrieval Cost: Three Methods Compared

Results & Ablation

Benchmark Comparison

Ablation: What Happens When You Remove Each Component?

CoCoDA: Co-evolving Compositional DAG
for Tool-Augmented Agents