An Interactive Reading of

CoCoDA: Co-evolving Compositional DAG
for Tool-Augmented Agents

The paper, in plain English

Give a small language model a toolbox of Python functions and it can solve problems its parametric memory could never handle alone. But there is a catch: every new tool you add makes it harder for the model to find the right tool, because searching a flat list costs more and more tokens. The library needs to grow with the model's experience, yet the search cost must stay inside a fixed context window. Two goals pulling in opposite directions.

CoCoDA resolves the tension with a single data structure: a compositional code DAG. Tools are nodes. Edges encode which tool calls which other tool. Each node stores four layers of metadata — typed signature, description, behavioral contract, and worked examples — from cheapest to most expensive to check. At inference time, retrieval cascades through these layers, pruning aggressively at each stage so the model only pays for tools that survive. At training time, successful solution traces are folded into new composite tools, and the planner is rewarded for using them. The same DAG that keeps retrieval cheap also trains the planner to exploit abstractions.

The headline number: an 8-billion-parameter student matches or exceeds its 32-billion-parameter teacher on GSM8K (93.67 vs 93.40) and MATH (63.18 vs 61.62). Retrieval cost grows sublinearly in library size: at 1,600 nodes, flat retrieval costs 11.4× more tokens than CoCoDA's typed DAG retrieval. Every component matters — removing the DAG structure, the cascade retrieval, or the shaped reward each costs 1–3.5 points of accuracy.

I
The Compositional DAG
Tools organized as a directed acyclic graph where edges are invocation dependencies, not text similarity. Each node carries a four-layer record that retrieval peels from cheap to expensive.
II
Typed DAG Retrieval
A four-stage cascade — signature pruning, semantic ranking, specification filtering, example reranking — that makes retrieval cost sublinear in library size.
III
Co-Evolution
The planner and the library grow together: successful trajectories become composite tools, and a saved-call reward trains the planner to prefer them over primitive sequences.
Chapter 1

The Scaling Trap

A small language model with a tool library can punch above its weight class. But every new tool makes the next query harder to answer. Two forces, pulling in opposite directions.

CoCoDA frames the joint planner–library optimization problem explicitly. Given a planner $\pi_\theta$, a tool library $\mathcal{L} = (V, E, I)$, and a query $q$, the planner samples a trajectory $\tau \sim \pi_\theta(\cdot \mid q, C_q(\mathcal{L}))$ invoking tools $(v_1, \ldots, v_T)$ via an executor. The optimization objective is:

$$ (\pi_\theta^\star, \mathcal{L}^\star) = \arg\max_{\pi_\theta, \mathcal{L}} \; \mathbb{E}_{(q,y) \sim \mathcal{D}_{\text{ev}}} \left[ R_{\text{res}}\bigl(\mathcal{O}_{\pi_\theta, \mathcal{L}}(q)\bigr) - \mathbb{E}_\tau \left[ T(\tau) \cdot \bar{C}_{\text{retr}}(q, \mathcal{L}) \right] \right] $$

where $R_{\text{res}} \in [0,1]$ is the task verifier reward, $T(\tau)$ counts tool calls, and $\bar{C}_{\text{retr}}$ is the average per-call retrieval token cost.

Three structural axes drive cost: (A) per-call retrieval cost $\bar{C}_{\text{retr}}$, which is linear in $|V|$ for flat retrieval; (B) the number of tool calls $T(\tau)$, which composites can shrink; and (C) the per-call context limit $W$, which flat retrieval breaches on any non-trivial library.

Flat vs. Structured Retrieval Cost

Drag the library size slider to see how cost scales. Chart updates live.

201600
40300
Flat retrieval cost
24,000
DAG retrieval cost (est.)
2,880
Savings ratio
8.3x
Why this matters
A flat library makes prompt cost linear in library size. CoCoDA's Typed DAG Retrieval makes it sublinear — the more tools you add, the bigger the gap. The system scales into its library rather than drowning in it.
The compositional DAG structure
Chapter 2

The Compositional DAG

Tools live in a directed acyclic graph, not a flat list. Edges are invocation dependencies; depth measures abstraction. Click a node to see its four-layer record.

The library $\mathcal{L}$ is a DAG $G = (V, E)$ where $V = V_p \cup V_c$ partitions tools into primitive nodes (atomic functions, $d(v) = 0$) and composite nodes (macro-actions, $d(v) = 1 + \max_{u \in \text{Ch}(v)} d(u)$). Each node carries a record $I(v) = (L_1, L_2, L_3, L_4)$:

$$ \text{flat}(v) = \begin{cases} 1 & v \in V_p \\ \sum_{u \in \text{Ch}(v)} \text{flat}(u) & v \in V_c \end{cases} \qquad \Phi(v) := \text{flat}(v) - 1 $$

$\text{flat}(v)$ counts primitive leaves in the recursive expansion of $v$; $\Phi(v)$ is the saved-call count — how many primitive calls a composite saves.

Interactive Tool DAG

Click any node to inspect its four-layer record. Composite nodes (navy) invoke children; primitive nodes (teal) are atomic.

Click a node above

Select any tool node to see its typed signature, description, pre/post specifications, and worked examples.
Why this matters
The DAG is not organized by text similarity — it follows real invocation structure. A composite tool for “solve quadratic equation” sits above its constituent primitives because it calls them, not because the descriptions look alike. This is the difference between a topic index and a call graph.
How retrieval traverses the DAG
Chapter 3

Typed DAG Retrieval

A four-stage cascade that prunes candidates from thousands to one, paying the most expensive checks only on the survivors.

Given a query-induced subgoal, retrieval constructs a nested sequence $V = S_0 \supseteq S_1 \supseteq S_2 \supseteq S_3 \supseteq S_4$:

$$ \begin{aligned} S_1 &= \{v \in V : L_1(v) \text{ unifies with subgoal signature}\} & &\text{(symbolic, free)} \\ S_2 &= \text{Top-}k_2\text{ by } \pi_\theta(\text{intent}, L_2(v)) : v \in S_1 & &\text{(semantic rank)} \\ S_3 &= \{v \in S_2 : L_3(v) \text{ satisfies pre/post of intent}\} & &\text{(contract filter)} \\ S_4 &= \arg\max_{v \in S_3} \pi_\theta(\text{intent}, L_4(v)) & &\text{(example rerank)} \end{aligned} $$

Retrieval Cascade Simulator

Adjust library size and filter selectivity to watch candidates cascade through the four stages.

501600
5%80%
330
20%95%
Candidates after L1
150
After L2
10
After L3
6
Final (L4)
1
The theoretical guarantee
Theorem 1: $C_{\text{hier}} \leq \alpha_1 \alpha_2 \, C_{\text{flat}} + o(C_{\text{flat}})$ as $n \to \infty$. Retrieval cost is sublinear in library size. Corollary 2: Retrieval time is $O(\tau_1 \log n + (\tau_2 + \tau_3 + \tau_4) k_2)$ vs. $\Theta((\tau_2 + \tau_3 + \tau_4) n)$ for flat retrieval.
How the planner and library grow together
Chapter 4

Co-Evolution

The planner learns to use new tools. The library learns from the planner's successes. One coupled update, one DAG, one reward signal.

At iteration $k$, the coupled update rule is:

$$ \pi_{\theta}^{k+1} = \text{GRPO}\!\left(\pi_{\theta}^{k};\, \{(\tau^{(i)}, R(\tau^{(i)}))\}\right), \qquad \mathcal{L}^{k+1} = \text{FOLD}\!\left(\text{INSERT\_TOOL},\, M_T(\mathcal{T}_q^+),\, \mathcal{L}^{k}\right) $$

The planner update is a clipped GRPO step on $R = R_{\text{res}} + \lambda R_{\text{comp}}$. FOLD applies INSERT\_TOOL to each candidate from the teacher, performing ADD/MERGE/REJECT after spec and acyclicity validation.

Co-Evolution Training Dynamics

Adjust the learning rate and reward weight to see how library growth and accuracy respond.

0.051.00
10%80%
Final library size
187
Mean compositional depth
3.2
GSM8K accuracy
92.6%
Why this matters
Theorem 4: Under conservative updates, $J(\pi_{\theta}^{k}, \mathcal{L}^{k})$ is monotonically non-decreasing in $k$. Each co-evolution step makes the system strictly better (or no worse). The library growth saturates — CoCoDA commits to a compact set of reusable abstractions rather than growing without bound.
The reward that makes it work
Chapter 5

The Saved-Call Reward

Not all trajectories are equal, even when they get the same answer. The DAG knows which one used fewer primitive calls — and rewards it.

The cost block $\mathbb{E}_\tau[T(\tau) \cdot \bar{C}_{\text{retr}}]$ reduces to maximizing $-T(\tau)$ over the policy. Since every invocation satisfies $\text{flat}(t_i) - \Phi(t_i) = 1$:

$$ -T(\tau) = \sum_{i=1}^{T(\tau)} \Phi(t_i) - \sum_{i=1}^{T(\tau)} \text{flat}(t_i) = \underbrace{\sum_{i=1}^{T(\tau)} \Phi(t_i)}_{\text{saved-call credit}} - \underbrace{T_{\text{prim}}(\tau)}_{\text{primitive workload}} $$

$T_{\text{prim}}(\tau)$ is invariant under lossless substitution (same computation, different phrasing). Within a GRPO group on the same query, group-mean subtraction absorbs it, leaving only the saved-call credit as the gradient signal: $R_{\text{comp}}(\tau) = \sum_i \Phi(t_i) = \sum_i (\text{flat}(t_i) - 1)$.

Reward Simulator

Add tools to a trajectory. Compare a primitive-only path vs. one that uses composites.

115
05
28
0.051.00
Rcomp (primitive path)
0.00
Rcomp (composite path)
3.00
Advantage gap
+0.90
The structural guarantee
Theorem 3: If $\tau_c$ replaces a contiguous $m$-step sub-trajectory of $\tau_p$ with a composite $t^\star$ where $\text{flat}(t^\star) \geq m$, then $R(\tau_c) - R(\tau_p) = \lambda \Phi(t^\star) \geq \lambda(m-1) > 0$. Length-cheating is structurally impossible: padding a primitive contributes $\Phi = 0$.
Retrieval cost at scale
Chapter 6

Retrieval at Scale

At 1,600 tools, flat retrieval costs 11.4× more tokens than CoCoDA. The gap widens with every new tool added to the library.

Retrieval Cost: Three Methods Compared

Sweep the library size to see how flat retrieval, text-hierarchical RAG, and typed DAG retrieval scale differently.

1001600
50300
Flat cost at max size
120,000
Text-RAG cost
49,500
TDR cost (ours)
10,530
Flat / TDR ratio
11.4x
Why this matters
The three code-specificity claims — typed prune (L1 is free), edge-guided expansion (climb/descend the DAG), behavior-preserving substitution (one composite token replaces a subtree) — are what separate typed DAG retrieval from any text-based hierarchy. You cannot get this gap from embedding similarity alone.
Benchmark results and ablation
Chapter 7

Results & Ablation

An 8B student matches a 32B teacher on math. Every component of CoCoDA earns its place. The biggest ablation drop comes from removing the DAG itself.

Benchmark Comparison

Select a model size and a benchmark category to compare methods.

Ablation: What Happens When You Remove Each Component?

Chart shows accuracy drop from removing each of CoCoDA's three core components (4B student, averaged across benchmarks).

w/o DAG (flat library)
−3.49
w/o TDR (flat retrieval)
−1.79
w/o GAR (exec-only reward)
−1.40
w/o all three
−5.18
Why this matters
The DAG structure is the single biggest contributor. Without it, the system collapses to a CREATOR/TroVE-style flat collection and the planner must re-derive every intermediate. The retrieval cascade and shaped reward are complementary: removing both costs more than the sum of removing each individually. All three components are necessary; none is sufficient.