An Interactive Reading of

GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration

Yeahia Sarker, Md Rahmat Ullah, Musa Molla & Shafiq Joty · MTSU · InfinitiBit GmbH · Salesforce Research · March 2026 · arXiv:2605.13848

Most multi-agent AI frameworks ask the LLM to decide which agent runs next. That sounds smart until you realise the model sometimes invents agents that don't exist, gets trapped in infinite loops, or produces different results from the same input — and there's no way to audit why.

GraphBit takes a radically different approach. Think of a factory assembly line: the conveyor belt doesn't ask each robot what to do next — the route is fixed. GraphBit treats every workflow as a directed acyclic graph, then hands all routing decisions to a deterministic Rust-based engine. The LLM focuses purely on reasoning inside its assigned node, while the engine handles everything else: parallel dispatch, state management, tool invocation, error recovery.

On the GAIA benchmark of 68 real-world tasks, GraphBit achieves 67.6% accuracy with zero framework-induced hallucinations — 14.7 points ahead of the next-best framework. It processes tasks in 11.9 ms of overhead (5.9× faster than AutoGen) while using half the memory. The three-tier memory architecture alone accounts for a 14.7-point gain over single-tier baselines.

I

Deterministic DAG Execution

Workflows are directed acyclic graphs. The engine, not the LLM, decides every routing decision — eliminating hallucinated routing by construction.

II

Three-Tier Memory

Ephemeral scratch, structured state, and external connectors are isolated, preventing cascading context bloat that degrades LLM reasoning.

III

Zero Hallucination, Peak Accuracy

The only framework to combine 0% hallucination with the highest accuracy (67.6%), beating six established baselines.

Chapter 1

When the LLM Steers Itself

Every major multi-agent framework — LangChain, CrewAI, AutoGen — lets the LLM decide which agent runs next. It works, until it doesn't. What happens when the model hallucinates a route?

Imagine a hospital where the receptionist decides which specialist you see — by guessing. Most of the time she gets it right. But sometimes she sends a cardiac patient to the podiatrist, or forgets you're in the waiting room and you sit there forever. That's prompted orchestration: the LLM is the receptionist, and it guesses the next step based on vibes rather than a fixed schedule. GraphBit replaces the guessing receptionist with a railway switch: the tracks are laid before the train leaves the station.

Prompt-Orchestrated Routing

At each timestep $t$, the LLM selects an action:

$$a_t = \text{LLM}(s_t,\; C_t)$$

where $s_t$ is the current state and $C_t$ is the accumulated context. Crucially, $C_t$ grows unboundedly as the conversation progresses, and $a_t$ is stochastic — identical inputs can yield different routes on repeated runs. The framework has no mechanism to guarantee $a_t \in \mathcal{A}$, where $\mathcal{A}$ is the set of valid agent actions, opening the door to hallucinated agents and impossible transitions.

Failure-Mode Comparison

Select an orchestration mode to compare failure patterns.

The three failure modes are not edge cases. On web-enabled tasks, LangGraph hallucinates on 69% of executions — more than two-thirds of runs fail not because the LLM reasoned poorly, but because the framework routed it wrong.

Next: Drawing the Blueprint →

Chapter 2

Drawing the Blueprint

Before any agent runs, GraphBit requires the developer to draw a map: a directed acyclic graph specifying every node, every edge, every decision point. No ambiguity, no surprises at runtime.

Think of a DAG like a recipe with strict ordering: you must chop the onions before you sauté them, but the salad and the sauce can be prepared in parallel. A cycle would mean the onions have to be chopped after the soup is done — which is impossible if the soup needs onions. GraphBit rejects cycles at graph-construction time, so infinite loops are architecturally impossible.

DAG Formalism

A GraphBit workflow is defined as a directed acyclic graph:

$$G = (V,\; E)$$

where the node set partitions into three types:

$$V = V_A \cup V_T \cup V_C$$

$V_A$: Agent nodes — LLM-backed reasoning units that consume inputs and produce outputs.
$V_T$: Tool nodes — external service interfaces (search, file I/O, API calls) with typed request/response schemas.
$V_C$: Control nodes — routing primitives (merge, fork, condition, map-reduce) that structure data flow without invoking the LLM.

Edges $E \subseteq V \times V$ carry typed data dependencies. A node is ready to execute when all its predecessors have completed:

$$\text{ready}(v) \;\iff\; \forall\, u \in \text{pred}(v):\; u \text{ completed}$$

The acyclicity constraint guarantees termination: there exists no infinite execution path, so every workflow halts in finite time.

Explore a Sample DAG

Click a node in the graph to inspect its type, inputs, and outputs.

Because edges carry typed data with schema validation at node boundaries, a misconfigured workflow fails immediately at construction time — not silently at runtime after burning through API credits.

Next: The Rust Engine →

Chapter 3

The Rust Engine

GraphBit's execution engine is written in Rust, not Python. It maintains a ready queue of nodes whose inputs are satisfied, dispatches independent nodes in parallel, and enforces correctness invariants that eliminate common failure modes.

Picture a train dispatcher at a busy junction. She doesn't drive the trains — she looks at her board, sees which platforms are clear, and signals the next departure. The trains (agents) run on fixed tracks (DAG edges). She can dispatch two trains simultaneously to different destinations (parallel execution), but a connecting train waits until its predecessor arrives (sequential dependency). The dispatcher is deterministic: given the same board state, she always makes the same call. That's the GraphBit engine.

Dataflow Execution Model

The engine maintains a ready queue of executable nodes:

$$Q = \{\, v \in V : \text{ready}(v) \;\land\; v \notin \text{visited} \,\}$$

Nodes in $Q$ are dispatched to a thread pool. The per-node processing latency decomposes as:

$$t_{\text{proc}} = t_{\text{enqueue}} + t_{\text{dispatch}} + t_{\text{execute}} + t_{\text{commit}}$$

Measured overhead per node: $t_{\text{proc}} = 11.9$ ms (excluding LLM inference time). The engine achieves a throughput of 5,025 operations per minute with 21.1% CPU utilisation and 126.1 MB memory footprint.

Throughput vs Latency Simulator

Number of parallel branches: 4

Nodes per branch: 5

The Rust engine eliminates Python interpreter overhead during orchestration. The result: 5,025 operations per minute throughput, a 3× improvement over the fastest comparable baseline, with memory consumption 24% lower.

Next: Isolating Context →

Chapter 4

Isolating Context

A single bloated context window is the silent killer of multi-step LLM reasoning. GraphBit's memory system splits into three isolated tiers, each with a specific purpose, preventing the cascading context growth that degrades other frameworks.

Imagine three desks in an office. The scratch desk is for working papers — you use it during a meeting, then clear it entirely before the next one. The filing cabinet stores the official project state: what's been decided, what's pending, what's done. The library has reference materials you can consult but can't take home. Each desk only has what it needs — no clutter from other projects. That's GraphBit's three-tier memory: ephemeral scratch, structured state, external connectors.

Memory Architecture

GraphBit partitions memory into three isolated tiers:

$$M = \{\, M_e,\; M_s,\; M_x \,\}$$

$M_e$: Ephemeral scratch — allocated when a node begins execution and deallocated on completion. Each node sees only its own scratch space.
$M_s$: Structured state — a typed key-value store with atomic updates. Persists across the workflow lifetime.
$M_x$: External connectors — managed interfaces to external resources (databases, APIs, file systems) with connection pooling and caching.

Access is scoped per node. A node $v$ can read only the state keys it declares:

$$\text{reads}(v) \subseteq \text{declared}(v) \subseteq M_s$$

This scoping prevents cross-contamination: a node cannot accidentally observe or overwrite another node's working data.

Context Growth Simulator

Number of workflow steps: 10

The ablation study is stark: removing structured state drops accuracy by 10.2 percentage points. The single-tier baseline — combining all memory into one space — degrades to 52.9% accuracy with 2× higher memory. Memory segregation isn't a nice-to-have; it's fundamental.

Next: The GAIA Arena →

Chapter 5

68 Tasks, 7 Frameworks

The GAIA benchmark tests general AI assistants on real-world tasks requiring multi-step reasoning, tool use, and web navigation. The authors curated 68 tasks from three workflow types: zero-tool, document-augmented, and web-enabled.

Think of GAIA like an obstacle course with three lanes. The first lane is a written exam — pure brainpower, no tools. The second lane lets you consult reference books (attached PDFs, spreadsheets). The third lane gives you a web browser. Every contestant gets the same questions and the same LLM brain (GPT-5.2). The only difference is the coach — the framework that decides when to consult which tool, how to route information between steps, and how to recover from mistakes. That's where GraphBit's deterministic coach shines.

Benchmark Setup

68 tasks distributed across three categories:

No-tool tasks: 7 tasks — pure LLM reasoning, no external tools.
Document-augmented tasks: 19 tasks — provided PDFs, spreadsheets, or images as context.
Web-enabled tasks: 42 tasks — require web search, browsing, or API access.

Six evaluation metrics: accuracy, hallucination rate, processing latency (ms), CPU utilisation (%), memory consumption (MB), throughput (ops/min). All seven frameworks use GPT-5.2 with identical temperature and token-limit settings. The evaluation is controlled: accuracy differences arise solely from orchestration architecture.

Task Breakdown by Framework

The evaluation is deliberately fair: all frameworks get the same LLM, same temperature, same token limits. Differences in accuracy come entirely from orchestration architecture — not from model quality.

Next: Accuracy vs Hallucination →

Chapter 6

Accuracy vs Hallucination

GraphBit achieves 67.6% accuracy with 0% hallucination — a combination no other framework matches. Let's explore where the gaps come from.

This is like comparing airlines by two metrics: do you arrive at the right airport (accuracy), and do you ever get stranded mid-flight (hallucination)? Most frameworks sacrifice one for the other. Pydantic AI and LlamaIndex never strand you — but they don't always get you to the right place. LangChain and LangGraph sometimes arrive correctly, but 41–47% of the time they crash mid-flight. GraphBit is the airline that both arrives correctly and never crashes. That shouldn't be remarkable — but in the LLM framework world, it is.

Key Results from Table 1

Framework	Accuracy (%)	Hallucination (%)	Latency (ms)	CPU (%)	Memory (MB)
GraphBit	67.6	0.0	11.9	21.1	126.1
Pydantic AI	52.9	0.0	18.3	24.2	166.5
LlamaIndex	50.0	0.0	15.0	25.2	165.4
CrewAI	44.9	14.3	31.0	30.7	202.2
LangChain	38.2	41.2	36.1	24.7	234.4
LangGraph	36.8	47.1	31.5	26.1	208.0
AutoGen	35.3	33.8	70.0	27.0	274.8

Hallucination scales sharply with tool complexity. For LangGraph, the hallucination rate on web-enabled tasks alone:

$$h_{\text{web}} = 69.0\% \quad\text{vs}\quad h_{\text{notool}} = 0\%$$

The more tools available, the more routing decisions the LLM must make — and the more it hallucinates. GraphBit's deterministic engine is immune to this scaling effect.

Accuracy–Hallucination Trade-off

Accuracy by Task Type

On web-enabled tasks — 61.8% of the evaluation — GraphBit's 69.0% accuracy vs Pydantic AI's 54.8% isn't close. And LangGraph's 69.0% hallucination rate means over two-thirds of web-task runs fail due to the framework itself, not the LLM.

Next: What Actually Matters →

Chapter 7

What Actually Matters

The ablation study isolates the contribution of each architectural component. The results reveal that deterministic execution provides the greatest gains, but every memory tier contributes measurably.

Imagine removing ingredients from a recipe one at a time. Remove the salt — the dish is bland. Remove the heat — it's raw. Remove the pan — you can't cook at all. GraphBit's ablation works the same way: remove ephemeral scratch and accuracy drops 2.9 points; remove structured state and it plummets 10.2 points; remove external connectors and you lose 7.3 points. Remove all three tiers (single baseline) and you're at 52.9% — exactly where Pydantic AI sits with its simpler architecture. Every layer matters, but structured state matters most.

Ablation Results (Table 3)

Configuration	Accuracy (%)	Δ Accuracy	Memory (MB)
Full GraphBit	67.6	—	126.1
w/o Ephemeral Scratch ($M_e$)	64.7	−2.9	189.2
w/o Structured State ($M_s$)	57.4	−10.2	138.7
w/o External Connectors ($M_x$)	60.3	−7.3	130.4
Single-tier Baseline	52.9	−14.7	247.8

Token efficiency further illustrates the benefit of memory isolation. GraphBit uses 1,916 tokens per task on average, compared to 6,276 for Pydantic AI (3.3× more) and 13,638 for CrewAI (7.1× more). Fewer wasted tokens means lower cost, lower latency, and more relevant context at each reasoning step.

Ablation Impact

Memory Consumption by Configuration

Of the 22 tasks GraphBit failed, zero were caused by orchestration errors. All failures were LLM reasoning mistakes or ambiguous task specifications. Meanwhile, 69% of LangGraph's web-task failures stem from hallucinated routing — problems that deterministic orchestration eliminates entirely.

GraphBit demonstrates that you don't need the LLM to orchestrate itself. A deterministic engine with structured memory produces better results, faster, with fewer errors. The insight generalises: whenever reliability matters — regulated industries, enterprise automation, scientific pipelines — separating the planner from the worker pays dividends.