Chip architects draw block diagrams to plan how processors, memory controllers, and peripherals talk to each other. These system-level diagrams are the blueprints of every chip — but unlike circuit schematics, they use non-standardized symbols, inconsistent wiring conventions, and vary wildly across companies. No existing AI model can reliably read them.
The authors build DiagramNet: the first dataset of 1,000 annotated system-level diagrams (10,977 connection pairs, 15,515 QA pairs) and a three-agent AI pipeline that decomposes the problem. A Perception Agent detects components with YOLO. A Reasoning Agent (a 3B-parameter vision-language model) predicts how they connect. A Knowledge Agent answers circuit questions via LoRA adapters. The whole system is trained through supervised fine-tuning, reinforcement learning, and low-rank adaptation.
The headline result: their 3B-parameter model achieves an overall score of 0.671, surpassing the 2025 EDA Elite Challenge winner and outperforming GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro by over 2× in end-to-end evaluation. The multi-agent workflow alone boosts Gemini-2.5-Pro's Task 1 score by 128.7×. With only 60 adaptation images, the system transfers to a completely different circuit benchmark.
Transistor-level schematics have standard symbols. Gate-level netlists have standard formats. But system-level diagrams — the architectural blueprints that sit at the top of the chip design hierarchy — are the Wild West. No two companies draw them the same way.
You can't ask a model to "read the diagram" in one shot — the output space is too large and the spatial constraints are too complex. The paper decomposes the problem into four well-defined subtasks across three semantic levels.
$$\begin{aligned} \text{Listing:} \quad & f_{\text{list}} : I \to C = \{c_1, \ldots, c_n\} \\[4pt] \text{Localization:} \quad & f_{\text{loc}} : (I, c_i) \to b_i \in [0,1]^4 \\[4pt] \text{Connection:} \quad & f_{\text{conn}} : (I, c_i, C) \to T_i \subseteq C \\[4pt] \text{Circuit QA:} \quad & f_{\text{qa}} : (I, q) \to (r, a) \end{aligned}$$
Here $I$ is the input image, $C$ is the component set ordered by row-major position index, $b_i = (x, y, w, h)$ is a normalized bounding box, $T_i \subseteq C \setminus \{c_i\}$ is the set of output targets from $c_i$, and $(r, a)$ denotes stepwise reasoning followed by the final answer.
Three semantic levels organize these tasks:
Click each level to explore its subtasks and see how data flows through the pipeline. Click to explore.
No dataset existed for system-level diagram understanding. So the authors built one: 1,000 diagrams from major chip design and computer architecture venues, with 10,977 connection annotations and 15,515 chain-of-thought QA pairs.
Compare DiagramNet with existing AMS circuit datasets across key metrics. Charts update as you toggle categories.
End-to-end MLLMs suffer from visual grounding bottlenecks and spatial hallucinations on dense diagrams. The solution: decompose recognition into three specialized agents, each handling one aspect of the problem.
Click each agent to see its role, inputs, and outputs. Arrows show data flow. Click to explore.
Algorithm 1 from the paper summarises the inference procedure:
The Reasoning Agent is trained through a progressive three-phase pipeline: supervised fine-tuning builds base competence, reinforcement learning on hard samples improves robustness, and LoRA adaptation specialises the model for specific downstream tasks.
Phase 1 — Supervised Fine-Tuning Loss:
$$\mathcal{L}_{\text{SFT}} = -\sum_{i=1}^{L} \log P(y_i \mid X_v, X_t, y_{
Phase 3 — LoRA Forward Pass:
$$h = W_0 x + \Delta W \, x = W_0 x + \frac{\alpha}{r} BAx$$
Where $W_0 \in \mathbb{R}^{d \times k}$ are frozen base weights, $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ are trainable low-rank matrices with $r \ll \min(d, k)$.
Toggle each training phase on or off to see its contribution. Charts update instantly.
Phase 2 of training uses reinforcement learning on hard samples. But each subtask needs its own reward signal — bounding box accuracy, connection precision, ordering preservation, and answer correctness all require different metrics.
Total Reward:
$$R_{\text{total}} = \sum_{t \in \{\text{loc, conn, qa, list}\}} \left(\lambda_{f,t} \, R_{\text{fmt}}^{(t)} + \lambda_{a,t} \, R_{\text{acc}}^{(t)}\right)$$
$R_{\text{acc}}^{\text{(loc)}} = \text{IoU}(\text{bbox}_{\text{pred}}, \text{bbox}_{\text{gt}})$
$R_{\text{acc}}^{\text{(conn)}} = \alpha \cdot F_1(P, G) + (1 - \alpha) \cdot R_{\text{len}}$
$R_{\text{acc}}^{\text{(list)}} = \beta_1 F_1^{\text{multi}} + \beta_2 R_{\text{len}} + \beta_3 \dfrac{\text{LCS}(A,B)}{\max(|A|,|B|)}$
$R_{\text{acc}}^{\text{(qa)}} = \mathbb{1}(\text{answer}_{\text{pred}} = \text{answer}_{\text{gt}})$
Adjust the reward parameters to see how they shape the penalty landscape for each subtask. Charts update as you drag.
On the DiagramNet evaluation benchmark (100 difficult diagrams from the 2025 EDA Elite Challenge), DiagramNet-3B achieves an overall score of 0.671 — surpassing the competition winner and outperforming GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro by over 2× end-to-end.
Evaluation Score Formulas:
$$\text{Score}_{\text{Task1}} = 0.4 \, S_1 + 0.2 \, S_2 + 0.4 \, S_3, \qquad \text{Score}_{\text{overall}} = 0.6 \, \text{Score}_{\text{Task1}} + 0.4 \, \text{Score}_{\text{Task2}}$$
Select models to compare. Adjust S1, S2, S3 sliders to explore how the score formula weights different subtasks. Charts update as you drag.
The multi-agent workflow is model-agnostic: it boosts any VLM that plugs into it. And the trained model transfers to entirely different circuit benchmarks with only 60 adaptation images — matching GPT-5 and Claude-Sonnet-4 on zero-shot connectivity reasoning.
Compare end-to-end vs. multi-agent workflow performance for each model. The gain multiplier shows how much the workflow improves Task 1 score. Charts update as you drag.