An Interactive Reading of

VideoSEAL: Mitigating
Evidence Misalignment
in Agentic Long Video Understanding

Decoupling Answer Authority via a Planner–Inspector Framework

Chenhao Qiu, Yechao Zhang, Xin Luo, Shien Song, Xusheng Liu
Mango TV · NTU Singapore · May 2026 · arXiv:2605.12571

The paper, in plain English

Ask an AI agent a question about a 45-minute video — “What colour is the vlogger’s car?” — and it might return the right answer without ever finding the clip where the car is actually visible. The agent leans on its training data and common-sense priors, not on what it saw. The authors call this evidence misalignment: the answer is correct, but the interaction trace — the chain of retrieved clips and tool outputs — cannot justify it.

The misalignment has two root causes. During training, reinforcement learning rewards correct answers without checking whether the agent looked at the right evidence (reward pressure). During inference, longer and longer tool traces push the agent toward committing to a plausible answer rather than continuing to search (prompt pressure). Both stem from a single structural flaw: one monolithic model handles searching, inspecting, and answering in the same context window.

The fix is architectural. Split the agent into a planner that navigates the video timeline and an inspector that verifies candidate clips and holds exclusive authority to output the final answer. Across four long-video benchmarks, this decoupled design improves LVBench accuracy from 48.2% to 55.1%, cuts semantic hallucination from 41.4% to 11.3%, and scales with larger search budgets — while the coupled baseline actually regresses when given more steps.

I

Evidence Misalignment

Agents can produce correct answers that their own interaction traces cannot support — right for the wrong reasons.

II

Dual Pressure Diagnosis

Reward pressure at training time and prompt pressure at inference time both push agents toward shortcut-driven answering.

III

Decoupled Architecture

Separate the planner (search) from the inspector (verify & answer), gating final output on pixel-level evidence.

Chapter 1

The Right Answer, Wrong Reasons

Long-video QA agents can score well on benchmarks while their interaction traces reveal they never actually found the relevant evidence. How is that possible?

In plain English

Imagine a student taking an open-book history exam. The question asks what happened during a specific 30-second window of a 90-minute documentary. The student flips through pages, doesn’t find the passage, but writes the correct answer anyway because they remembered it from a different class. The grader gives full marks — but the student’s “workings” don’t support the answer.

That’s exactly what happens in agentic long-video understanding. The agent retrieves candidate clips, inspects frames, builds an interaction trace — and then guesses correctly using prior knowledge instead of grounded evidence. The paper calls this evidence misalignment: the (C=1, G=0) regime where correctness and groundedness diverge.

Drag the slider below to see how the probability of landing in each quadrant shifts as the agent relies more on shortcuts versus grounded search.

The Four Regimes of Agent Behaviour

Every agent trajectory can be classified along two axes: outcome correctness $C \in \{0,1\}$ (was the final answer right?) and trace groundedness $G \in \{0,1\}$ (was the answer supported by the evidence the agent accessed?). This gives four regimes:

Classification of agent trajectories

$$(C,G) \in \{(1,1),\; (1,0),\; (0,1),\; (0,0)\}$$

$\bullet\;(1,1)$ Correct & grounded $\bullet\;(1,0)$ Correct but ungrounded (evidence misalignment) $\bullet\;(0,1)$ Grounded but wrong $\bullet\;(0,0)$ Ungrounded & wrong

Live updates as you drag — the quadrant chart and hallucination rates update immediately.

Shortcut reliance (p)

0.40

Base search accuracy

0.45

Correct & Grounded

—

Correct, Ungrounded

—

Hallucination Rate H_t

—

Overall Accuracy

—

The danger isn’t wrong answers — it’s unverifiable right answers. When an agent scores well by accident, you can’t trust it to score well on purpose. Evidence misalignment is a silent failure mode: benchmarks miss it because they only check $C$, never $G$.

How do we measure groundedness?→

Chapter 2

Measuring Groundedness

To fix what you can’t see, you first need to measure it. The paper introduces two complementary diagnostics: temporal groundedness and semantic groundedness.

In plain English

Think of temporal groundedness like an alibi check: did the detective visit the crime scene during the investigation? Semantic groundedness is the courtroom test: can the detective actually explain how the evidence they collected leads to their conclusion? Both can fail independently — you can stand in the right room and still draw the wrong conclusion, or reason perfectly from evidence you never actually accessed.

The paper shows that existing agents (VideoAgent, DrVideo) do touch relevant temporal regions early on — temporal groundedness $G_t$ saturates fast. But semantic groundedness $G_s$ keeps dropping as the trace grows, meaning the agent ignores what it retrieved and falls back on plausible guessing.

Use the sliders below to simulate how $G_t$ and $G_s$ evolve as the agent takes more interaction steps.

Diagnostic I: Temporal Grounding

Equation 1 — Temporal groundedness

$$G_t := \mathbb{I}\!\left[\max_{\tau \in \mathcal{E}(\xi),\; \tau^* \in \mathcal{E}^*} \text{tIoU}(\tau, \tau^*) \geq \gamma\right]$$

$\text{tIoU}$ is temporal intersection-over-union between retrieved span $\tau$ and ground-truth interval $\tau^*$. The threshold $\gamma = 0.05$ (CG-Bench training mean).

Diagnostic II: Semantic Grounding

Equation 3 — Semantic groundedness

$$G_s := 1 - J_{\text{judge}}(q, \xi, \hat{a})$$

An LLM judge $J_{\text{judge}}$ audits whether the answer $\hat{a}$ is logically supported by trace $\xi$. $J_{\text{judge}}=1$ means the answer is unsupported (hallucination).

Hallucination Rates

Equations 2 & 4 — Hallucination rates

$$H_t := P(G_t=0 \mid C=1) = \frac{\mathbb{E}[C \cdot (1-G_t)]}{\mathbb{E}[C]}$$ $$H_s := P(G_s=0 \mid C=1)$$

Both measure how often the agent gets the right answer without proper evidence — temporal ($H_t$) or semantic ($H_s$).

Adjust trace length and see how temporal vs. semantic groundedness diverge.

Interaction steps (T)

10

Agent type

50% (blended)

Temporal access and semantic support increasingly decouple as the trace grows. $G_t$ saturates early — agents do touch relevant regions. But $G_s$ degrades monotonically: the longer the trace, the less the agent’s final answer is supported by what it retrieved.

What causes this during training?→

Chapter 3

Reward Pressure

Training longer doesn’t make the agent more grounded. Under outcome-only rewards, accuracy rises while groundedness stagnates — the gap between $C$ and $G_t$ widens with every optimisation step.

Equation 5 — Outcome-only terminal reward

$$R_{\text{ans}}(\xi) := \begin{cases} 1, & \text{if } \hat{a} = a^* \\ 0, & \text{otherwise} \end{cases}$$

Watch how accuracy and groundedness diverge as training progresses under outcome-only reward.

Training steps

150

Reward type

Outcome-only

The agent gets better at scoring without getting better at reasoning. Under reward pressure, speculative completion becomes more efficient than evidence seeking — a shortcut that works until it catastrophically doesn’t.

What about inference-time failures?→

Chapter 4

Prompt Pressure

Even a well-trained agent can be pushed into guessing by its own context window. As tool traces grow longer, the agent is increasingly prompted to commit — whether or not the trace justifies an answer.

In plain English

Think of a detective who has filled 47 pages of a notebook with witness statements, forensic reports, and surveillance footage descriptions. By page 48, the chief is tapping their watch. The detective knows they should keep investigating, but the sheer volume of accumulated material — most of it noise — creates an overwhelming pressure to just write a conclusion. So they pick the most plausible narrative from their prior experience and wrap the case.

Prompt pressure works the same way. The agent’s context fills with tool traces. Most traces are noisy, redundant, or irrelevant. The model shifts from evidence seeking to evidence fitting — cherry-picking from a messy trace to justify a plausible answer it already suspects is right.

Adjust the trace length below and watch semantic hallucination climb while temporal access barely improves.

Drag the slider to simulate how prompt pressure grows with trace length.

Interaction steps (T)

6

Seeking longer still cannot yield more grounded answers. Longer traces don’t help — they hurt. The agent touches relevant content ($G_t$ saturates) but its final decision drifts toward plausible guesses ($G_s$ declines, $H_s$ rises).

What structural flaw causes both pressures?→

Chapter 5

The Coupled Agent Trap

Both reward pressure and prompt pressure share a single structural root cause: the coupled agent paradigm, where one monolithic model handles searching, inspecting, and answering in the same context window.

In plain English

Imagine a courtroom where the prosecutor, the expert witness, and the judge are all the same person. They present evidence, evaluate it, and deliver the verdict — all while influenced by the same accumulated context. The structural incentive is to reach a verdict quickly, because the evidence pile keeps growing and the context keeps getting noisier. That’s the coupled agent: the planner inspects its own tool outputs, decides when to stop, and writes the final answer, all conditioned on the same noisy, ever-growing history.

The paper identifies this as the structural root cause. The entanglement of search, verification, and answer generation amplifies both prompt pressure (inference) and reward pressure (training). Decoupling these roles is the fix.

Click on the architecture cards below to see how information flows differ between coupled and decoupled designs.

The coupled agent policy

$$(r_t, u_t) \sim \pi(\cdot \mid h_{t-1}, q)$$

A single policy $\pi$ handles evidence seeking, trace inspection, termination, and final answer generation — all conditioned on the same shared interaction history $h_{t-1}$.

Answer Authority

Planner

Context Shared?

Yes

Pressure Risk

High

When search and verdict share a brain, the verdict always comes too early. The coupled agent conflates evidence seeking with answer authority, creating a structural incentive to commit before the evidence warrants it.

How does decoupling fix this?→

Chapter 6

The Decoupled Fix

VideoSEAL splits the agent into a planner (evidence seeking) and an inspector (verification & answer authority). The planner keeps searching until the inspector approves the evidence.

In plain English

Back to the courtroom analogy. Now the prosecutor gathers evidence (planner), but a separate judge-inspector examines each piece of visual evidence and decides whether it’s sufficient to support a verdict. The judge doesn’t see the prosecutor’s reasoning notes — only the raw video evidence and the original question. If the evidence is insufficient, the judge sends the prosecutor back to keep searching. The verdict is only delivered when the evidence actually supports it.

This is the inspector gate. It sees only the video evidence $v_t$ and the query $q$ — no planner reasoning, no accumulated context. Its verdict $z_t \in \{0,1\}$ gates the final answer.

Use the simulation below to watch a planner–inspector loop in action and see how evidence quality converges.

The Planner–Inspector Protocol

Decoupled agent interaction

$$(r_t, u_t) \sim P(\cdot \mid h_{t-1}, q), \quad (z_t, f_t) \sim I(\cdot \mid v_t, q)$$

Planner $P$ produces rationale and action. Inspector $I$ evaluates evidence $v_t = \mathcal{E}(o_t)$ and returns sufficiency verdict $z_t$ and feedback $f_t$. The agent terminates only when $z_t = 1$.

Evidence-Gated Reward

Equations 6–7 — Evidence-gated terminal reward

$$g_{\text{evd}}(\xi) := \min\!\left(1,\; \frac{\max_{\tau \in \mathcal{E}(\xi)} \max_{\tau^* \in \mathcal{E}^*} \text{tIoU}(\tau, \tau^*)}{\gamma}\right)$$ $$R_{\text{evd}}(\xi) := R_{\text{ans}}(\xi) \cdot g_{\text{evd}}(\xi)$$

Watch the planner–inspector loop converge as the inspector approves or rejects candidate evidence.

Max search turns (K)

8

Inspector strictness

0.50

Decoupling dominates reward design. Table 3 in the paper shows the decoupled agent with outcome-only reward ($R_{\text{ans}}$) still outperforms the coupled agent with evidence-gated reward ($R_{\text{evd}}$): 54.1% vs. 50.2%. The gain comes from architecture, not objective.

How does this perform at scale?→

Chapter 7

Results & Scaling

Across four long-video benchmarks, the decoupled framework improves both accuracy and groundedness. It also scales monotonically with search budget and inspector capacity — unlike coupled baselines.

In plain English

If you give a coupled agent more turns to search, it actually gets worse. The context window fills up with noisy tool traces, the agent panics, and accuracy regresses past 8 steps. But give the decoupled agent more turns and accuracy keeps climbing: from 47.1% to 55.1% as $K$ goes from 4 to 16. The inspector gate prevents context overload from contaminating the answer.

Even more striking: you can hot-swap the inspector without retraining the planner. Swap the 7B inspector for a 72B one and LVBench jumps from 55.1% to 59.5%. The coupled agent, by contrast, gains only 1.1% from the same upgrade. Once the planner learns to search, better eyes are all you need.

Use the sliders below to explore how performance scales with search budget and inspector size.

Drag sliders to compare coupled vs. decoupled scaling behaviour.

Search turns (K)

16

Inspector model

7B

Decoupled Acc.

—

Coupled Acc.

—

G_s (semantic)

—

H_s (hallucination)

—

The system is perception-bound, not reasoning-bound. Scaling the inspector from 7B to Gemini-3-Flash yields a 14.8-point gain (55.1% to 69.9%). Scaling the planner to GPT-4o actually hurts (52.3%). The bottleneck is visual fidelity, not planning sophistication.

What does this mean for the field?→

Chapter 8

What This Means

The paper’s message extends beyond video QA: any agentic system that conflates search with answer authority risks evidence misalignment.

In plain English

The next time someone tells you their AI agent achieves 90% on some benchmark, ask the follow-up question: how often does it get the right answer without actually finding the evidence? If they can’t answer that, they’re measuring the wrong thing. VideoSEAL’s diagnostics — $G_t$ and $G_s$ — give you the tools to ask.

The architectural insight is straightforward: separate the entity that searches from the entity that decides. In software engineering, this is separation of concerns. In the paper’s framing, it’s decoupling answer authority. The planner does what LLMs are good at — navigating and querying. The inspector does what MLLMs are good at — visual verification. Neither does the other’s job.

The practical upshot: you can upgrade the inspector without retraining the planner. As vision models get better, your agent gets better — for free. That’s a property worth designing for.

Key Takeaways

1. Measure groundedness, not just accuracy. $H_t$ and $H_s$ reveal how often right answers come from unverified evidence. On LVBench, DrVideo’s semantic hallucination rate is 41.4%; VideoSEAL’s is 11.3%.

2. Architecture beats objective. Decoupling with a simple outcome-only reward outperforms coupling with an evidence-gated reward (54.1% vs 50.2%). The gain is structural.

3. Perception is the bottleneck. Scaling the inspector from 7B to Gemini-3-Flash yields a 14.8-point gain. Scaling the planner to GPT-4o regresses. Invest in better eyes, not bigger brains.

4. Modularity is a feature. The planner, once trained, works with any inspector. As newer MLLMs ship, VideoSEAL improves automatically — Kimi2.5 as inspector pushes LVBench to 65.7%.

Build agents that earn their answers. The paper’s final line captures it: we hope this paradigm helps the community build more verifiable long-video agents and reduce unsupported guessing. The fix is simple — separate search from answer authority — but only once you’ve diagnosed the problem.

VideoSEAL: MitigatingEvidence Misalignmentin Agentic Long Video Understanding

The Right Answer, Wrong Reasons

The Four Regimes of Agent Behaviour

Measuring Groundedness

Diagnostic I: Temporal Grounding

Diagnostic II: Semantic Grounding

Hallucination Rates

Reward Pressure

Prompt Pressure

The Coupled Agent Trap

The Decoupled Fix

The Planner–Inspector Protocol

Evidence-Gated Reward

Results & Scaling

What This Means

Key Takeaways

VideoSEAL: Mitigating
Evidence Misalignment
in Agentic Long Video Understanding