An Interactive Reading of

DiagramNet: An End-to-End
Recognition Framework
for System-Level Diagrams

The paper, in plain English

Chip architects draw block diagrams to plan how processors, memory controllers, and peripherals talk to each other. These system-level diagrams are the blueprints of every chip — but unlike circuit schematics, they use non-standardized symbols, inconsistent wiring conventions, and vary wildly across companies. No existing AI model can reliably read them.

The authors build DiagramNet: the first dataset of 1,000 annotated system-level diagrams (10,977 connection pairs, 15,515 QA pairs) and a three-agent AI pipeline that decomposes the problem. A Perception Agent detects components with YOLO. A Reasoning Agent (a 3B-parameter vision-language model) predicts how they connect. A Knowledge Agent answers circuit questions via LoRA adapters. The whole system is trained through supervised fine-tuning, reinforcement learning, and low-rank adaptation.

The headline result: their 3B-parameter model achieves an overall score of 0.671, surpassing the 2025 EDA Elite Challenge winner and outperforming GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro by over 2× in end-to-end evaluation. The multi-agent workflow alone boosts Gemini-2.5-Pro's Task 1 score by 128.7×. With only 60 adaptation images, the system transfers to a completely different circuit benchmark.

I
Task Decomposition
Four subtasks across three semantic levels: Listing, Localization, Connection, and Circuit QA — turning a messy visual problem into structured, trainable steps.
II
Multi-Agent Workflow
Perception + Reasoning + Knowledge agents decouple detection from topological reasoning, eliminating the visual grounding bottleneck that cripples end-to-end MLLMs.
III
Compound RL Rewards
Task-specific rewards — IoU for localization, F1 + length penalty for connections, LCS for ordering — trained on hard samples via reinforcement learning to improve robustness.
~ 20 minutes · 8 chapters · 7 interactive visualizations
CHAPTER 1

Why system-level diagrams are hard

Transistor-level schematics have standard symbols. Gate-level netlists have standard formats. But system-level diagrams — the architectural blueprints that sit at the top of the chip design hierarchy — are the Wild West. No two companies draw them the same way.

Non-standardized Symbols
No upper limit on symbol categories. Even functionally identical components can appear in wildly different visual forms across sources. Template-based detection is infeasible.
Impact: A "memory controller" block looks different in every diagram.
Implicit Connectivity
Wires may be directed or undirected. Jump labels, connection markers, and crossing conventions are all non-standardized across organizations.
Impact: A crossing might mean "these wires connect" or "these wires don't" — and there's no universal convention.
Semantic Gap
Diagrams represent abstract architectures, not real circuits. Understanding them requires deep circuit knowledge beyond visual pattern matching.
Impact: You can't just "see" the connections — you need to reason about what the blocks mean.
Why this matters
The fourth challenge is data scarcity. No annotated system-level diagram dataset existed before this work. Existing circuit datasets are locked behind NDAs or lack structured annotations. Manual annotation requires rare domain expertise and is extremely time-consuming — precisely the gap DiagramNet fills.
Next: how the problem is decomposed into trainable subtasks
CHAPTER 2

Four subtasks, three levels

You can't ask a model to "read the diagram" in one shot — the output space is too large and the spatial constraints are too complex. The paper decomposes the problem into four well-defined subtasks across three semantic levels.

$$\begin{aligned} \text{Listing:} \quad & f_{\text{list}} : I \to C = \{c_1, \ldots, c_n\} \\[4pt] \text{Localization:} \quad & f_{\text{loc}} : (I, c_i) \to b_i \in [0,1]^4 \\[4pt] \text{Connection:} \quad & f_{\text{conn}} : (I, c_i, C) \to T_i \subseteq C \\[4pt] \text{Circuit QA:} \quad & f_{\text{qa}} : (I, q) \to (r, a) \end{aligned}$$

Here $I$ is the input image, $C$ is the component set ordered by row-major position index, $b_i = (x, y, w, h)$ is a normalized bounding box, $T_i \subseteq C \setminus \{c_i\}$ is the set of output targets from $c_i$, and $(r, a)$ denotes stepwise reasoning followed by the final answer.

Three semantic levels organize these tasks:

Interactive: The task hierarchy

Click each level to explore its subtasks and see how data flows through the pipeline. Click to explore.

Why this matters
Directly predicting the full circuit connection topology in one shot is difficult for current MLLMs. By splitting the problem into four subtasks, each with its own training objective and evaluation metric, the paper makes progress measurable at every level. You can see exactly where the pipeline fails — and fix that specific stage.
Next: the dataset that makes it all possible
CHAPTER 3

The DiagramNet dataset

No dataset existed for system-level diagram understanding. So the authors built one: 1,000 diagrams from major chip design and computer architecture venues, with 10,977 connection annotations and 15,515 chain-of-thought QA pairs.

1,000
Annotated system-level diagrams from chip design and architecture venues
10,977
Connection pair annotations across all diagrams
15,515
Chain-of-thought QA pairs spanning seven circuit domains

Interactive: Dataset comparison

Compare DiagramNet with existing AMS circuit datasets across key metrics. Charts update as you toggle categories.

ConnectionsQA PairsScope (tasks)
Why this matters
Previous datasets like AMSBench and Netlistify focus on standardized AMS schematics — circuits drawn from fixed component libraries. DiagramNet targets a completely different abstraction level: the architectural diagrams that sit at the top of the design hierarchy. The annotations cover all components per diagram (not just one), with accurate spatial indices and multimodal QA with chain-of-thought reasoning.
Next: the three agents that read the diagram
CHAPTER 4

Three agents, one diagram

End-to-end MLLMs suffer from visual grounding bottlenecks and spatial hallucinations on dense diagrams. The solution: decompose recognition into three specialized agents, each handling one aspect of the problem.

Interactive: Multi-agent architecture

Click each agent to see its role, inputs, and outputs. Arrows show data flow. Click to explore.

Click an agent node above to see its architecture and design rationale.

Algorithm 1 from the paper summarises the inference procedure:

  1. Perception Agent: Detect bounding boxes $B \leftarrow \text{DETECT}(I)$, sort row-major $B \leftarrow \text{SORT\_ROW\_MAJOR}(B)$, extract component names $C \leftarrow \text{EXTRACT\_NAMES}(I, B)$.
  2. Reasoning Agent: For each component $c_i \in C$, predict output connections $T_i \leftarrow f_{\text{conn}}(I, c_i, C)$. Accumulate edges to build directed topology graph $G = (C, E)$.
  3. Knowledge Agent: Answer queries $A \leftarrow f_{\text{qa}}(I, Q; \theta_{\text{LoRA}})$ using task-specific LoRA weights.
Why this matters
The multi-agent workflow provides 128.7× improvement for Gemini-2.5-Pro, 12.4× for GPT-5, and 1.7× for Claude-Sonnet-4 on Task 1 — without retraining these models. The workflow is model-agnostic: it decomposes the problem so that any VLM benefits from structured perception, regardless of its underlying architecture.
Next: the three-phase training recipe
CHAPTER 5

Supervised, reinforced, adapted

The Reasoning Agent is trained through a progressive three-phase pipeline: supervised fine-tuning builds base competence, reinforcement learning on hard samples improves robustness, and LoRA adaptation specialises the model for specific downstream tasks.

Phase 1 — Supervised Fine-Tuning Loss:

$$\mathcal{L}_{\text{SFT}} = -\sum_{i=1}^{L} \log P(y_i \mid X_v, X_t, y_{

Phase 3 — LoRA Forward Pass:

$$h = W_0 x + \Delta W \, x = W_0 x + \frac{\alpha}{r} BAx$$

Where $W_0 \in \mathbb{R}^{d \times k}$ are frozen base weights, $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ are trainable low-rank matrices with $r \ll \min(d, k)$.

Interactive: Training pipeline ablation

Toggle each training phase on or off to see its contribution. Charts update instantly.

S2 (Output Count)
0.828
S3 (Connection)
0.735
Task 1
0.855
Overall
0.671
Why this matters
Table 4 from the paper tells the story: the base Qwen2.5-VL-3B scores 0.447 overall. Each phase adds measurable improvement — SFT brings it to 0.610, RL to 0.650, LoRA to 0.671. The final model improves S2 by 2.2× and S3 by 3.6× over the base. S1 stays fixed at 0.988 because it's handled by a separate YOLO detector throughout.
Next: the reward functions that drive reinforcement learning
CHAPTER 6

Compound reward design

Phase 2 of training uses reinforcement learning on hard samples. But each subtask needs its own reward signal — bounding box accuracy, connection precision, ordering preservation, and answer correctness all require different metrics.

Total Reward:

$$R_{\text{total}} = \sum_{t \in \{\text{loc, conn, qa, list}\}} \left(\lambda_{f,t} \, R_{\text{fmt}}^{(t)} + \lambda_{a,t} \, R_{\text{acc}}^{(t)}\right)$$

Localization Reward

$R_{\text{acc}}^{\text{(loc)}} = \text{IoU}(\text{bbox}_{\text{pred}}, \text{bbox}_{\text{gt}})$

Connection Reward

$R_{\text{acc}}^{\text{(conn)}} = \alpha \cdot F_1(P, G) + (1 - \alpha) \cdot R_{\text{len}}$

Listing Reward

$R_{\text{acc}}^{\text{(list)}} = \beta_1 F_1^{\text{multi}} + \beta_2 R_{\text{len}} + \beta_3 \dfrac{\text{LCS}(A,B)}{\max(|A|,|B|)}$

QA Reward

$R_{\text{acc}}^{\text{(qa)}} = \mathbb{1}(\text{answer}_{\text{pred}} = \text{answer}_{\text{gt}})$

Interactive: Reward function explorer

Adjust the reward parameters to see how they shape the penalty landscape for each subtask. Charts update as you drag.

Length only (0)F1 only (1)
No ordering (0)Full ordering (1)
Connection Reward @ F1=0.8
0.80
Listing Reward @ F1=0.8
0.72
Why this matters
The LCS component in the listing reward is critical: it ensures the model learns row-major ordering, not just which components are present. Without ordering consistency, the downstream Reasoning Agent would receive ambiguous input — two identically named components would be indistinguishable. The reward design directly encodes the paper's insight that spatial structure is a necessary precondition for topological reasoning.
Next: how it all performs on the benchmark
CHAPTER 7

Benchmark results

On the DiagramNet evaluation benchmark (100 difficult diagrams from the 2025 EDA Elite Challenge), DiagramNet-3B achieves an overall score of 0.671 — surpassing the competition winner and outperforming GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro by over 2× end-to-end.

Evaluation Score Formulas:

$$\text{Score}_{\text{Task1}} = 0.4 \, S_1 + 0.2 \, S_2 + 0.4 \, S_3, \qquad \text{Score}_{\text{overall}} = 0.6 \, \text{Score}_{\text{Task1}} + 0.4 \, \text{Score}_{\text{Task2}}$$

Interactive: Benchmark comparison

Select models to compare. Adjust S1, S2, S3 sliders to explore how the score formula weights different subtasks. Charts update as you drag.

DiagramNet-3B
3B params · multi-agent
The full pipeline: YOLO perception + 3B VLM reasoning + LoRA knowledge. Best overall score.
EDA Elite Winner
Competition winner 2025
The 2025 EDA Elite Challenge Problem Two winner. Best prior art.
GPT-5 (E2E)
Commercial MLLM
End-to-end inference with detailed prompts. Scores 0.327 overall.
Task 1 Score
0.855
Task 2 Score
0.395
Overall Score
0.671
Why this matters
The most striking result isn't that DiagramNet-3B wins overall — it's where it wins. On S3 (connection identification), the model scores 0.735 versus 0.029 for GPT-5 end-to-end — a 25× gap. Commercial MLLMs fail not because they can't reason about circuits, but because they can't reliably detect components in non-standardized diagrams. The multi-agent workflow fixes exactly this bottleneck.
Next: does it generalize beyond system-level diagrams?
CHAPTER 8

Generalization beyond DiagramNet

The multi-agent workflow is model-agnostic: it boosts any VLM that plugs into it. And the trained model transfers to entirely different circuit benchmarks with only 60 adaptation images — matching GPT-5 and Claude-Sonnet-4 on zero-shot connectivity reasoning.

Interactive: Workflow effect across models

Compare end-to-end vs. multi-agent workflow performance for each model. The gain multiplier shows how much the workflow improves Task 1 score. Charts update as you drag.

128.7×
Task 1 improvement for Gemini-2.5-Pro with multi-agent workflow
12.4×
Task 1 improvement for GPT-5 with multi-agent workflow
60
Images needed for zero-shot transfer to AMSBench — matching GPT-5 on connectivity
Why this matters
The workflow is not a one-trick pony tied to DiagramNet-3B. It's a model-agnostic paradigm. Gemini-2.5-Pro with the workflow achieves a Task 1 score of 0.901 — actually surpassing DiagramNet-3B itself (0.855). The three agents are abstract functional roles; the underlying model can be swapped independently. This means the pipeline will likely benefit from future, more capable VLMs without any architectural changes.
Read the paper
arXiv:2605.01338 · arxiv.org/abs/2605.01338