Give a small language model a toolbox of Python functions and it can solve problems its parametric memory could never handle alone. But there is a catch: every new tool you add makes it harder for the model to find the right tool, because searching a flat list costs more and more tokens. The library needs to grow with the model's experience, yet the search cost must stay inside a fixed context window. Two goals pulling in opposite directions.
CoCoDA resolves the tension with a single data structure: a compositional code DAG. Tools are nodes. Edges encode which tool calls which other tool. Each node stores four layers of metadata — typed signature, description, behavioral contract, and worked examples — from cheapest to most expensive to check. At inference time, retrieval cascades through these layers, pruning aggressively at each stage so the model only pays for tools that survive. At training time, successful solution traces are folded into new composite tools, and the planner is rewarded for using them. The same DAG that keeps retrieval cheap also trains the planner to exploit abstractions.
The headline number: an 8-billion-parameter student matches or exceeds its 32-billion-parameter teacher on GSM8K (93.67 vs 93.40) and MATH (63.18 vs 61.62). Retrieval cost grows sublinearly in library size: at 1,600 nodes, flat retrieval costs 11.4× more tokens than CoCoDA's typed DAG retrieval. Every component matters — removing the DAG structure, the cascade retrieval, or the shaped reward each costs 1–3.5 points of accuracy.
A small language model with a tool library can punch above its weight class. But every new tool makes the next query harder to answer. Two forces, pulling in opposite directions.
CoCoDA frames the joint planner–library optimization problem explicitly. Given a planner $\pi_\theta$, a tool library $\mathcal{L} = (V, E, I)$, and a query $q$, the planner samples a trajectory $\tau \sim \pi_\theta(\cdot \mid q, C_q(\mathcal{L}))$ invoking tools $(v_1, \ldots, v_T)$ via an executor. The optimization objective is:
where $R_{\text{res}} \in [0,1]$ is the task verifier reward, $T(\tau)$ counts tool calls, and $\bar{C}_{\text{retr}}$ is the average per-call retrieval token cost.
Three structural axes drive cost: (A) per-call retrieval cost $\bar{C}_{\text{retr}}$, which is linear in $|V|$ for flat retrieval; (B) the number of tool calls $T(\tau)$, which composites can shrink; and (C) the per-call context limit $W$, which flat retrieval breaches on any non-trivial library.
Drag the library size slider to see how cost scales. Chart updates live.
Tools live in a directed acyclic graph, not a flat list. Edges are invocation dependencies; depth measures abstraction. Click a node to see its four-layer record.
The library $\mathcal{L}$ is a DAG $G = (V, E)$ where $V = V_p \cup V_c$ partitions tools into primitive nodes (atomic functions, $d(v) = 0$) and composite nodes (macro-actions, $d(v) = 1 + \max_{u \in \text{Ch}(v)} d(u)$). Each node carries a record $I(v) = (L_1, L_2, L_3, L_4)$:
$\text{flat}(v)$ counts primitive leaves in the recursive expansion of $v$; $\Phi(v)$ is the saved-call count — how many primitive calls a composite saves.
Click any node to inspect its four-layer record. Composite nodes (navy) invoke children; primitive nodes (teal) are atomic.
A four-stage cascade that prunes candidates from thousands to one, paying the most expensive checks only on the survivors.
Given a query-induced subgoal, retrieval constructs a nested sequence $V = S_0 \supseteq S_1 \supseteq S_2 \supseteq S_3 \supseteq S_4$:
Adjust library size and filter selectivity to watch candidates cascade through the four stages.
The planner learns to use new tools. The library learns from the planner's successes. One coupled update, one DAG, one reward signal.
At iteration $k$, the coupled update rule is:
The planner update is a clipped GRPO step on $R = R_{\text{res}} + \lambda R_{\text{comp}}$. FOLD applies INSERT\_TOOL to each candidate from the teacher, performing ADD/MERGE/REJECT after spec and acyclicity validation.
Adjust the learning rate and reward weight to see how library growth and accuracy respond.
Not all trajectories are equal, even when they get the same answer. The DAG knows which one used fewer primitive calls — and rewards it.
The cost block $\mathbb{E}_\tau[T(\tau) \cdot \bar{C}_{\text{retr}}]$ reduces to maximizing $-T(\tau)$ over the policy. Since every invocation satisfies $\text{flat}(t_i) - \Phi(t_i) = 1$:
$T_{\text{prim}}(\tau)$ is invariant under lossless substitution (same computation, different phrasing). Within a GRPO group on the same query, group-mean subtraction absorbs it, leaving only the saved-call credit as the gradient signal: $R_{\text{comp}}(\tau) = \sum_i \Phi(t_i) = \sum_i (\text{flat}(t_i) - 1)$.
Add tools to a trajectory. Compare a primitive-only path vs. one that uses composites.
At 1,600 tools, flat retrieval costs 11.4× more tokens than CoCoDA. The gap widens with every new tool added to the library.
Sweep the library size to see how flat retrieval, text-hierarchical RAG, and typed DAG retrieval scale differently.
An 8B student matches a 32B teacher on math. Every component of CoCoDA earns its place. The biggest ablation drop comes from removing the DAG itself.
Select a model size and a benchmark category to compare methods.
Chart shows accuracy drop from removing each of CoCoDA's three core components (4B student, averaged across benchmarks).