VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding
Decoupling Answer Authority via a Planner–Inspector Framework
Chenhao Qiu, Yechao Zhang, Xin Luo, Shien Song, Xusheng Liu Mango TV · NTU Singapore · May 2026 · arXiv:2605.12571
The paper, in plain English
Ask an AI agent a question about a 45-minute video — “What colour is the vlogger’s car?” — and it might return the right answer without ever finding the clip where the car is actually visible. The agent leans on its training data and common-sense priors, not on what it saw. The authors call this evidence misalignment: the answer is correct, but the interaction trace — the chain of retrieved clips and tool outputs — cannot justify it.
The misalignment has two root causes. During training, reinforcement learning rewards correct answers without checking whether the agent looked at the right evidence (reward pressure). During inference, longer and longer tool traces push the agent toward committing to a plausible answer rather than continuing to search (prompt pressure). Both stem from a single structural flaw: one monolithic model handles searching, inspecting, and answering in the same context window.
The fix is architectural. Split the agent into a planner that navigates the video timeline and an inspector that verifies candidate clips and holds exclusive authority to output the final answer. Across four long-video benchmarks, this decoupled design improves LVBench accuracy from 48.2% to 55.1%, cuts semantic hallucination from 41.4% to 11.3%, and scales with larger search budgets — while the coupled baseline actually regresses when given more steps.
I
Evidence Misalignment
Agents can produce correct answers that their own interaction traces cannot support — right for the wrong reasons.
II
Dual Pressure Diagnosis
Reward pressure at training time and prompt pressure at inference time both push agents toward shortcut-driven answering.
III
Decoupled Architecture
Separate the planner (search) from the inspector (verify & answer), gating final output on pixel-level evidence.
Chapter 1
The Right Answer, Wrong Reasons
Long-video QA agents can score well on benchmarks while their interaction traces reveal they never actually found the relevant evidence. How is that possible?
The Four Regimes of Agent Behaviour
Every agent trajectory can be classified along two axes: outcome correctness $C \in \{0,1\}$ (was the final answer right?) and trace groundedness $G \in \{0,1\}$ (was the answer supported by the evidence the agent accessed?). This gives four regimes:
$\bullet\;(1,1)$ Correct & grounded $\bullet\;(1,0)$ Correct but ungrounded (evidence misalignment) $\bullet\;(0,1)$ Grounded but wrong $\bullet\;(0,0)$ Ungrounded & wrong
Live updates as you drag — the quadrant chart and hallucination rates update immediately.
0.40
0.45
Correct & Grounded
—
Correct, Ungrounded
—
Hallucination Rate Ht
—
Overall Accuracy
—
The danger isn’t wrong answers — it’s unverifiable right answers. When an agent scores well by accident, you can’t trust it to score well on purpose. Evidence misalignment is a silent failure mode: benchmarks miss it because they only check $C$, never $G$.
To fix what you can’t see, you first need to measure it. The paper introduces two complementary diagnostics: temporal groundedness and semantic groundedness.
$\text{tIoU}$ is temporal intersection-over-union between retrieved span $\tau$ and ground-truth interval $\tau^*$. The threshold $\gamma = 0.05$ (CG-Bench training mean).
Diagnostic II: Semantic Grounding
Equation 3 — Semantic groundedness
$$G_s := 1 - J_{\text{judge}}(q, \xi, \hat{a})$$
An LLM judge $J_{\text{judge}}$ audits whether the answer $\hat{a}$ is logically supported by trace $\xi$. $J_{\text{judge}}=1$ means the answer is unsupported (hallucination).
Both measure how often the agent gets the right answer without proper evidence — temporal ($H_t$) or semantic ($H_s$).
Adjust trace length and see how temporal vs. semantic groundedness diverge.
10
50% (blended)
Temporal access and semantic support increasingly decouple as the trace grows. $G_t$ saturates early — agents do touch relevant regions. But $G_s$ degrades monotonically: the longer the trace, the less the agent’s final answer is supported by what it retrieved.
Training longer doesn’t make the agent more grounded. Under outcome-only rewards, accuracy rises while groundedness stagnates — the gap between $C$ and $G_t$ widens with every optimisation step.
Watch how accuracy and groundedness diverge as training progresses under outcome-only reward.
150
Outcome-only
The agent gets better at scoring without getting better at reasoning. Under reward pressure, speculative completion becomes more efficient than evidence seeking — a shortcut that works until it catastrophically doesn’t.
Even a well-trained agent can be pushed into guessing by its own context window. As tool traces grow longer, the agent is increasingly prompted to commit — whether or not the trace justifies an answer.
Drag the slider to simulate how prompt pressure grows with trace length.
6
Seeking longer still cannot yield more grounded answers. Longer traces don’t help — they hurt. The agent touches relevant content ($G_t$ saturates) but its final decision drifts toward plausible guesses ($G_s$ declines, $H_s$ rises).
Both reward pressure and prompt pressure share a single structural root cause: the coupled agent paradigm, where one monolithic model handles searching, inspecting, and answering in the same context window.
The coupled agent policy
$$(r_t, u_t) \sim \pi(\cdot \mid h_{t-1}, q)$$
A single policy $\pi$ handles evidence seeking, trace inspection, termination, and final answer generation — all conditioned on the same shared interaction history $h_{t-1}$.
Answer Authority
Planner
Context Shared?
Yes
Pressure Risk
High
When search and verdict share a brain, the verdict always comes too early. The coupled agent conflates evidence seeking with answer authority, creating a structural incentive to commit before the evidence warrants it.
VideoSEAL splits the agent into a planner (evidence seeking) and an inspector (verification & answer authority). The planner keeps searching until the inspector approves the evidence.
Planner $P$ produces rationale and action. Inspector $I$ evaluates evidence $v_t = \mathcal{E}(o_t)$ and returns sufficiency verdict $z_t$ and feedback $f_t$. The agent terminates only when $z_t = 1$.
Watch the planner–inspector loop converge as the inspector approves or rejects candidate evidence.
8
0.50
Decoupling dominates reward design. Table 3 in the paper shows the decoupled agent with outcome-only reward ($R_{\text{ans}}$) still outperforms the coupled agent with evidence-gated reward ($R_{\text{evd}}$): 54.1% vs. 50.2%. The gain comes from architecture, not objective.
Across four long-video benchmarks, the decoupled framework improves both accuracy and groundedness. It also scales monotonically with search budget and inspector capacity — unlike coupled baselines.
Drag sliders to compare coupled vs. decoupled scaling behaviour.
16
7B
Decoupled Acc.
—
Coupled Acc.
—
Gs (semantic)
—
Hs (hallucination)
—
The system is perception-bound, not reasoning-bound. Scaling the inspector from 7B to Gemini-3-Flash yields a 14.8-point gain (55.1% to 69.9%). Scaling the planner to GPT-4o actually hurts (52.3%). The bottleneck is visual fidelity, not planning sophistication.
The paper’s message extends beyond video QA: any agentic system that conflates search with answer authority risks evidence misalignment.
Key Takeaways
1. Measure groundedness, not just accuracy. $H_t$ and $H_s$ reveal how often right answers come from unverified evidence. On LVBench, DrVideo’s semantic hallucination rate is 41.4%; VideoSEAL’s is 11.3%.
2. Architecture beats objective. Decoupling with a simple outcome-only reward outperforms coupling with an evidence-gated reward (54.1% vs 50.2%). The gain is structural.
3. Perception is the bottleneck. Scaling the inspector from 7B to Gemini-3-Flash yields a 14.8-point gain. Scaling the planner to GPT-4o regresses. Invest in better eyes, not bigger brains.
4. Modularity is a feature. The planner, once trained, works with any inspector. As newer MLLMs ship, VideoSEAL improves automatically — Kimi2.5 as inspector pushes LVBench to 65.7%.
Build agents that earn their answers. The paper’s final line captures it: we hope this paradigm helps the community build more verifiable long-video agents and reduce unsupported guessing. The fix is simple — separate search from answer authority — but only once you’ve diagnosed the problem.