An Interactive Reading of

DiagramNet: An End-to-End
Recognition Framework
for System-Level Diagrams

Jincheng Lou, Ruohan Xu, Jiapeng Li, Junyin Pi, Runzhe Tao, Weijian Fan, Xiao Tan, Guojie Luo & Yibo Lin
Peking University · Xi'an Jiaotong · Tsinghua · May 2026 · arXiv:2605.01338

The paper, in plain English

Chip architects draw block diagrams to plan how processors, memory controllers, and peripherals talk to each other. These system-level diagrams are the blueprints of every chip — but unlike circuit schematics, they use non-standardized symbols, inconsistent wiring conventions, and vary wildly across companies. No existing AI model can reliably read them.

The authors build DiagramNet: the first dataset of 1,000 annotated system-level diagrams (10,977 connection pairs, 15,515 QA pairs) and a three-agent AI pipeline that decomposes the problem. A Perception Agent detects components with YOLO. A Reasoning Agent (a 3B-parameter vision-language model) predicts how they connect. A Knowledge Agent answers circuit questions via LoRA adapters. The whole system is trained through supervised fine-tuning, reinforcement learning, and low-rank adaptation.

The headline result: their 3B-parameter model achieves an overall score of 0.671, surpassing the 2025 EDA Elite Challenge winner and outperforming GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro by over 2× in end-to-end evaluation. The multi-agent workflow alone boosts Gemini-2.5-Pro's Task 1 score by 128.7×. With only 60 adaptation images, the system transfers to a completely different circuit benchmark.

I

Task Decomposition

Four subtasks across three semantic levels: Listing, Localization, Connection, and Circuit QA — turning a messy visual problem into structured, trainable steps.

II

Multi-Agent Workflow

Perception + Reasoning + Knowledge agents decouple detection from topological reasoning, eliminating the visual grounding bottleneck that cripples end-to-end MLLMs.

III

Compound RL Rewards

Task-specific rewards — IoU for localization, F1 + length penalty for connections, LCS for ordering — trained on hard samples via reinforcement learning to improve robustness.

~ 20 minutes · 8 chapters · 7 interactive visualizations

CHAPTER 1

Why system-level diagrams are hard

Transistor-level schematics have standard symbols. Gate-level netlists have standard formats. But system-level diagrams — the architectural blueprints that sit at the top of the chip design hierarchy — are the Wild West. No two companies draw them the same way.

In plain English

Think of subway maps. London's Tube map uses straight lines and 45-degree angles. Tokyo's metro map uses curved, organic shapes. New York's subway map sits on top of a geographic street grid. All three show the same kind of thing — a transit network — but a tourist who memorised the London map would struggle to read Tokyo's at a glance.

System-level diagrams are like subway maps for chips. They show how processors, memory controllers, PLLs, and ADCs connect. But unlike subway maps, there is no standard visual language. Every chip company, every textbook, every conference paper draws them differently. Same component, ten different-looking icons. Same wire, ten different conventions for where it goes.

Now imagine asking GPT-5 to read all these subway maps and tell you which stations connect. That's roughly what this paper tackles — and why even frontier models fail at it.

❖

Non-standardized Symbols

No upper limit on symbol categories. Even functionally identical components can appear in wildly different visual forms across sources. Template-based detection is infeasible.

Impact: A "memory controller" block looks different in every diagram.

⇄

Implicit Connectivity

Wires may be directed or undirected. Jump labels, connection markers, and crossing conventions are all non-standardized across organizations.

Impact: A crossing might mean "these wires connect" or "these wires don't" — and there's no universal convention.

∆

Semantic Gap

Diagrams represent abstract architectures, not real circuits. Understanding them requires deep circuit knowledge beyond visual pattern matching.

Impact: You can't just "see" the connections — you need to reason about what the blocks mean.

Why this matters

The fourth challenge is data scarcity. No annotated system-level diagram dataset existed before this work. Existing circuit datasets are locked behind NDAs or lack structured annotations. Manual annotation requires rare domain expertise and is extremely time-consuming — precisely the gap DiagramNet fills.

Next: how the problem is decomposed into trainable subtasks →

CHAPTER 2

Four subtasks, three levels

You can't ask a model to "read the diagram" in one shot — the output space is too large and the spatial constraints are too complex. The paper decomposes the problem into four well-defined subtasks across three semantic levels.

In plain English

Imagine you're asked to describe a city's road network from a satellite photo. You wouldn't try to name every street and every intersection in one breath. You'd break it down: first identify the neighbourhoods (what's there), then locate them on the map (where they are), then trace the roads between them (how they connect), and finally answer questions like "what's the fastest route from A to B?" (reasoning about the network).

DiagramNet does the same thing for chip diagrams. Listing identifies what components are present. Localization finds where each one sits. Connection traces the wiring between them. Circuit QA answers questions that require understanding what the circuit actually does. Each level builds on the one below.

The insight: by decomposing one impossible task into four manageable ones, each subtask becomes tractable for current vision-language models.

$$\begin{aligned} \text{Listing:} \quad & f_{\text{list}} : I \to C = \{c_1, \ldots, c_n\} \\[4pt] \text{Localization:} \quad & f_{\text{loc}} : (I, c_i) \to b_i \in [0,1]^4 \\[4pt] \text{Connection:} \quad & f_{\text{conn}} : (I, c_i, C) \to T_i \subseteq C \\[4pt] \text{Circuit QA:} \quad & f_{\text{qa}} : (I, q) \to (r, a) \end{aligned}$$

Here $I$ is the input image, $C$ is the component set ordered by row-major position index, $b_i = (x, y, w, h)$ is a normalized bounding box, $T_i \subseteq C \setminus \{c_i\}$ is the set of output targets from $c_i$, and $(r, a)$ denotes stepwise reasoning followed by the final answer.

Three semantic levels organize these tasks:

Perception level — Listing and Localization: extracting what's in the diagram and where it sits.
Structure level — Connection: predicting the directed topology between components.
Semantic level — Circuit QA: reasoning about circuit behavior from the extracted topology.

Interactive: The task hierarchy

Click each level to explore its subtasks and see how data flows through the pipeline. Click to explore.

Why this matters

Directly predicting the full circuit connection topology in one shot is difficult for current MLLMs. By splitting the problem into four subtasks, each with its own training objective and evaluation metric, the paper makes progress measurable at every level. You can see exactly where the pipeline fails — and fix that specific stage.

Next: the dataset that makes it all possible →

CHAPTER 3

The DiagramNet dataset

No dataset existed for system-level diagram understanding. So the authors built one: 1,000 diagrams from major chip design and computer architecture venues, with 10,977 connection annotations and 15,515 chain-of-thought QA pairs.

In plain English

Imagine trying to train a self-driving car with no road data — no dashcam footage, no lane markings, no traffic signs. You'd need to build a dataset first. That's exactly the situation here. No one had ever collected and annotated system-level chip diagrams at scale.

The authors scraped 1,000 diagrams from public conference papers and journals — figures that chip architects publish in papers about their designs. Then they hired domain experts to annotate every component, every connection, and to write chain-of-thought QA pairs that walk through the reasoning behind each answer.

The result: 1.8× more connections and 12.3× more QA samples than the nearest competing dataset (AMSBench), covering all components per diagram rather than just one.

1,000

Annotated system-level diagrams from chip design and architecture venues

10,977

Connection pair annotations across all diagrams

15,515

Chain-of-thought QA pairs spanning seven circuit domains

Interactive: Dataset comparison

Compare DiagramNet with existing AMS circuit datasets across key metrics. Charts update as you toggle categories.

Metric: Connections

ConnectionsQA PairsScope (tasks)

Why this matters

Previous datasets like AMSBench and Netlistify focus on standardized AMS schematics — circuits drawn from fixed component libraries. DiagramNet targets a completely different abstraction level: the architectural diagrams that sit at the top of the design hierarchy. The annotations cover all components per diagram (not just one), with accurate spatial indices and multimodal QA with chain-of-thought reasoning.

Next: the three agents that read the diagram →

CHAPTER 4

Three agents, one diagram

End-to-end MLLMs suffer from visual grounding bottlenecks and spatial hallucinations on dense diagrams. The solution: decompose recognition into three specialized agents, each handling one aspect of the problem.

In plain English

Think of a medical diagnosis team. A radiographer takes the X-ray and marks the anatomical regions. A radiologist reads the image and identifies abnormalities. A specialist physician interprets the findings in the context of the patient's history. Each role is specialized; each builds on the previous one's output.

The DiagramNet workflow works the same way. The Perception Agent is the radiographer — it uses YOLO to detect and locate every component on the diagram. The Reasoning Agent is the radiologist — a vision-language model that looks at the diagram and each component in turn, predicting which other components it connects to. The Knowledge Agent is the specialist — it loads domain-specific knowledge (via LoRA adapters) to answer circuit questions.

Click each agent in the diagram below to see what it does and why it matters.

Interactive: Multi-agent architecture

Click each agent to see its role, inputs, and outputs. Arrows show data flow. Click to explore.

Click an agent node above to see its architecture and design rationale.

Algorithm 1 from the paper summarises the inference procedure:

Perception Agent: Detect bounding boxes $B \leftarrow \text{DETECT}(I)$, sort row-major $B \leftarrow \text{SORT\_ROW\_MAJOR}(B)$, extract component names $C \leftarrow \text{EXTRACT\_NAMES}(I, B)$.
Reasoning Agent: For each component $c_i \in C$, predict output connections $T_i \leftarrow f_{\text{conn}}(I, c_i, C)$. Accumulate edges to build directed topology graph $G = (C, E)$.
Knowledge Agent: Answer queries $A \leftarrow f_{\text{qa}}(I, Q; \theta_{\text{LoRA}})$ using task-specific LoRA weights.

Why this matters

The multi-agent workflow provides 128.7× improvement for Gemini-2.5-Pro, 12.4× for GPT-5, and 1.7× for Claude-Sonnet-4 on Task 1 — without retraining these models. The workflow is model-agnostic: it decomposes the problem so that any VLM benefits from structured perception, regardless of its underlying architecture.

Next: the three-phase training recipe →

CHAPTER 5

Supervised, reinforced, adapted

The Reasoning Agent is trained through a progressive three-phase pipeline: supervised fine-tuning builds base competence, reinforcement learning on hard samples improves robustness, and LoRA adaptation specialises the model for specific downstream tasks.

In plain English

Think of training a chef. Phase 1 (SFT) is culinary school: the chef learns every standard technique from textbooks and demonstrations. Phase 2 (RL) is the pressure test: you send the chef into a kitchen during a dinner rush, undercooking and burning dishes, and they learn to handle the chaos. You deliberately pick the hardest, most stressful scenarios. Phase 3 (LoRA) is the specialist stage: the chef already knows how to cook, but now they learn one specific cuisine — Japanese kaiseki, say — by adding a small set of new techniques on top of what they already know.

That's exactly what happens here. The model first learns from labelled examples, then gets pushed on the hardest diagrams via reinforcement learning, and finally gets a lightweight LoRA adapter for Circuit QA.

Drag the sliders below to see how each training phase contributes to the final score.

Phase 1 — Supervised Fine-Tuning Loss:

$$\mathcal{L}_{\text{SFT}} = -\sum_{i=1}^{L} \log P(y_i \mid X_v, X_t, y_{

Phase 3 — LoRA Forward Pass:

$$h = W_0 x + \Delta W \, x = W_0 x + \frac{\alpha}{r} BAx$$

Where $W_0 \in \mathbb{R}^{d \times k}$ are frozen base weights, $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ are trainable low-rank matrices with $r \ll \min(d, k)$.

Interactive: Training pipeline ablation

Toggle each training phase on or off to see its contribution. Charts update instantly.

Training Phase

S2 (Output Count)

0.828

S3 (Connection)

0.735

Task 1

0.855

Overall
0.671

Why this matters

Table 4 from the paper tells the story: the base Qwen2.5-VL-3B scores 0.447 overall. Each phase adds measurable improvement — SFT brings it to 0.610, RL to 0.650, LoRA to 0.671. The final model improves S2 by 2.2× and S3 by 3.6× over the base. S1 stays fixed at 0.988 because it's handled by a separate YOLO detector throughout.

Next: the reward functions that drive reinforcement learning →

CHAPTER 6

Compound reward design

Phase 2 of training uses reinforcement learning on hard samples. But each subtask needs its own reward signal — bounding box accuracy, connection precision, ordering preservation, and answer correctness all require different metrics.

In plain English

Imagine grading a student's architecture exam. You don't give one overall score. You grade the floor plan (is each room in the right place?), the structural integrity (do the load-bearing walls line up?), and the aesthetics (does it look good?). Each dimension gets its own rubric, and the final grade is a weighted combination.

DiagramNet does the same for its RL rewards. Localization is graded by IoU — how much the predicted bounding box overlaps the ground truth. Connection is graded by F1 score plus a length penalty that discourages guessing too many or too few connections. Listing uses F1 plus LCS (Longest Common Subsequence) to reward correct ordering. Circuit QA uses exact string matching. The total reward is a weighted sum across all tasks.

Drag the sliders below to see how adjusting the reward weights changes the training landscape.

Total Reward:

$$R_{\text{total}} = \sum_{t \in \{\text{loc, conn, qa, list}\}} \left(\lambda_{f,t} \, R_{\text{fmt}}^{(t)} + \lambda_{a,t} \, R_{\text{acc}}^{(t)}\right)$$

Localization Reward

$R_{\text{acc}}^{\text{(loc)}} = \text{IoU}(\text{bbox}_{\text{pred}}, \text{bbox}_{\text{gt}})$

Connection Reward

$R_{\text{acc}}^{\text{(conn)}} = \alpha \cdot F_1(P, G) + (1 - \alpha) \cdot R_{\text{len}}$

Listing Reward

$R_{\text{acc}}^{\text{(list)}} = \beta_1 F_1^{\text{multi}} + \beta_2 R_{\text{len}} + \beta_3 \dfrac{\text{LCS}(A,B)}{\max(|A|,|B|)}$

QA Reward

$R_{\text{acc}}^{\text{(qa)}} = \mathbb{1}(\text{answer}_{\text{pred}} = \text{answer}_{\text{gt}})$

Interactive: Reward function explorer

Adjust the reward parameters to see how they shape the penalty landscape for each subtask. Charts update as you drag.

α (F1 vs. length penalty for Connection)

Length only (0)F1 only (1)

β₃ (LCS ordering weight for Listing)

No ordering (0)Full ordering (1)

Connection Reward @ F1=0.8

0.80

Listing Reward @ F1=0.8

0.72

Why this matters

The LCS component in the listing reward is critical: it ensures the model learns row-major ordering, not just which components are present. Without ordering consistency, the downstream Reasoning Agent would receive ambiguous input — two identically named components would be indistinguishable. The reward design directly encodes the paper's insight that spatial structure is a necessary precondition for topological reasoning.

Next: how it all performs on the benchmark →

CHAPTER 7

Benchmark results

On the DiagramNet evaluation benchmark (100 difficult diagrams from the 2025 EDA Elite Challenge), DiagramNet-3B achieves an overall score of 0.671 — surpassing the competition winner and outperforming GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro by over 2× end-to-end.

In plain English

Imagine a reading comprehension test where you're shown a complex subway map and asked to list every station, mark where each one is, trace every route, and answer questions like "what's the fastest path from Station A to Station B?" Now imagine GPT-5 scores 0.327 and Gemini-2.5-Pro scores 0.278 on this test — barely better than random.

The reason they fail isn't that they're bad at reasoning. It's that they can't see the stations in the first place. GPT-5 gets 0.085 on S1 (component detection) and Gemini-2.5-Pro gets 0.008. They literally cannot find the components on the diagram. With the multi-agent workflow providing a dedicated detector, the same models jump to 0.733 and 0.901 respectively.

Select different models in the chart below to compare their performance across all evaluation metrics.

Evaluation Score Formulas:

$$\text{Score}_{\text{Task1}} = 0.4 \, S_1 + 0.2 \, S_2 + 0.4 \, S_3, \qquad \text{Score}_{\text{overall}} = 0.6 \, \text{Score}_{\text{Task1}} + 0.4 \, \text{Score}_{\text{Task2}}$$

Interactive: Benchmark comparison

Select models to compare. Adjust S1, S2, S3 sliders to explore how the score formula weights different subtasks. Charts update as you drag.

DiagramNet-3B

3B params · multi-agent

The full pipeline: YOLO perception + 3B VLM reasoning + LoRA knowledge. Best overall score.

EDA Elite Winner

Competition winner 2025

The 2025 EDA Elite Challenge Problem Two winner. Best prior art.

GPT-5 (E2E)

Commercial MLLM

End-to-end inference with detailed prompts. Scores 0.327 overall.

S1 (Detection F1): 0.988

S2 (Output Count F1): 0.828

S3 (Connection F1): 0.735

Task 1 Score

0.855

Task 2 Score

0.395

Overall Score
0.671

Why this matters

The most striking result isn't that DiagramNet-3B wins overall — it's where it wins. On S3 (connection identification), the model scores 0.735 versus 0.029 for GPT-5 end-to-end — a 25× gap. Commercial MLLMs fail not because they can't reason about circuits, but because they can't reliably detect components in non-standardized diagrams. The multi-agent workflow fixes exactly this bottleneck.

Next: does it generalize beyond system-level diagrams? →

CHAPTER 8

Generalization beyond DiagramNet

The multi-agent workflow is model-agnostic: it boosts any VLM that plugs into it. And the trained model transfers to entirely different circuit benchmarks with only 60 adaptation images — matching GPT-5 and Claude-Sonnet-4 on zero-shot connectivity reasoning.

In plain English

Imagine you trained a radiographer to read chest X-rays, and then you sent them to read dental X-rays. The images look completely different — different bones, different structures, different conventions. But the skill of "systematically scanning an image and identifying regions of interest" transfers.

That's what happens here. DiagramNet-3B was trained on system-level chip diagrams — block-level architectural blueprints. The authors tested it on AMSBench, which uses completely different analog-mixed-signal circuit schematics. With only 60 images to adapt the YOLO detector (not the reasoning model), it matches GPT-5 on connectivity reasoning and outperforms the AMS-specific state-of-the-art method Netlistify by 6.7%.

This suggests the model isn't just memorising system-level diagrams — it's learning general topological reasoning that transfers across circuit types.

Interactive: Workflow effect across models

Compare end-to-end vs. multi-agent workflow performance for each model. The gain multiplier shows how much the workflow improves Task 1 score. Charts update as you drag.

Highlight Model

128.7×

Task 1 improvement for Gemini-2.5-Pro with multi-agent workflow

12.4×

Task 1 improvement for GPT-5 with multi-agent workflow

60

Images needed for zero-shot transfer to AMSBench — matching GPT-5 on connectivity

Why this matters

The workflow is not a one-trick pony tied to DiagramNet-3B. It's a model-agnostic paradigm. Gemini-2.5-Pro with the workflow achieves a Task 1 score of 0.901 — actually surpassing DiagramNet-3B itself (0.855). The three agents are abstract functional roles; the underlying model can be swapped independently. This means the pipeline will likely benefit from future, more capable VLMs without any architectural changes.

Read the paper

arXiv:2605.01338 · arxiv.org/abs/2605.01338

DiagramNet: An End-to-EndRecognition Frameworkfor System-Level Diagrams

Why system-level diagrams are hard

Four subtasks, three levels

Interactive: The task hierarchy

The DiagramNet dataset

Interactive: Dataset comparison

Three agents, one diagram

Interactive: Multi-agent architecture

Supervised, reinforced, adapted

Interactive: Training pipeline ablation

Compound reward design

Localization Reward

Connection Reward

Listing Reward

QA Reward

Interactive: Reward function explorer

Benchmark results

Interactive: Benchmark comparison

Generalization beyond DiagramNet

Interactive: Workflow effect across models

DiagramNet: An End-to-End
Recognition Framework
for System-Level Diagrams