An Interactive Reading of

VideoSEAL: Mitigating
Evidence Misalignment
in Agentic Long Video Understanding

Decoupling Answer Authority via a Planner–Inspector Framework
The paper, in plain English

Ask an AI agent a question about a 45-minute video — “What colour is the vlogger’s car?” — and it might return the right answer without ever finding the clip where the car is actually visible. The agent leans on its training data and common-sense priors, not on what it saw. The authors call this evidence misalignment: the answer is correct, but the interaction trace — the chain of retrieved clips and tool outputs — cannot justify it.

The misalignment has two root causes. During training, reinforcement learning rewards correct answers without checking whether the agent looked at the right evidence (reward pressure). During inference, longer and longer tool traces push the agent toward committing to a plausible answer rather than continuing to search (prompt pressure). Both stem from a single structural flaw: one monolithic model handles searching, inspecting, and answering in the same context window.

The fix is architectural. Split the agent into a planner that navigates the video timeline and an inspector that verifies candidate clips and holds exclusive authority to output the final answer. Across four long-video benchmarks, this decoupled design improves LVBench accuracy from 48.2% to 55.1%, cuts semantic hallucination from 41.4% to 11.3%, and scales with larger search budgets — while the coupled baseline actually regresses when given more steps.

I
Evidence Misalignment
Agents can produce correct answers that their own interaction traces cannot support — right for the wrong reasons.
II
Dual Pressure Diagnosis
Reward pressure at training time and prompt pressure at inference time both push agents toward shortcut-driven answering.
III
Decoupled Architecture
Separate the planner (search) from the inspector (verify & answer), gating final output on pixel-level evidence.
Chapter 1

The Right Answer, Wrong Reasons

Long-video QA agents can score well on benchmarks while their interaction traces reveal they never actually found the relevant evidence. How is that possible?

The Four Regimes of Agent Behaviour

Every agent trajectory can be classified along two axes: outcome correctness $C \in \{0,1\}$ (was the final answer right?) and trace groundedness $G \in \{0,1\}$ (was the answer supported by the evidence the agent accessed?). This gives four regimes:

Classification of agent trajectories
$$(C,G) \in \{(1,1),\; (1,0),\; (0,1),\; (0,0)\}$$

$\bullet\;(1,1)$ Correct & grounded    $\bullet\;(1,0)$ Correct but ungrounded (evidence misalignment)    $\bullet\;(0,1)$ Grounded but wrong    $\bullet\;(0,0)$ Ungrounded & wrong

Live updates as you drag — the quadrant chart and hallucination rates update immediately.
0.40
0.45
Correct & Grounded
Correct, Ungrounded
Hallucination Rate Ht
Overall Accuracy

The danger isn’t wrong answers — it’s unverifiable right answers. When an agent scores well by accident, you can’t trust it to score well on purpose. Evidence misalignment is a silent failure mode: benchmarks miss it because they only check $C$, never $G$.

How do we measure groundedness?
Chapter 2

Measuring Groundedness

To fix what you can’t see, you first need to measure it. The paper introduces two complementary diagnostics: temporal groundedness and semantic groundedness.

Diagnostic I: Temporal Grounding

Equation 1 — Temporal groundedness
$$G_t := \mathbb{I}\!\left[\max_{\tau \in \mathcal{E}(\xi),\; \tau^* \in \mathcal{E}^*} \text{tIoU}(\tau, \tau^*) \geq \gamma\right]$$

$\text{tIoU}$ is temporal intersection-over-union between retrieved span $\tau$ and ground-truth interval $\tau^*$. The threshold $\gamma = 0.05$ (CG-Bench training mean).

Diagnostic II: Semantic Grounding

Equation 3 — Semantic groundedness
$$G_s := 1 - J_{\text{judge}}(q, \xi, \hat{a})$$

An LLM judge $J_{\text{judge}}$ audits whether the answer $\hat{a}$ is logically supported by trace $\xi$. $J_{\text{judge}}=1$ means the answer is unsupported (hallucination).

Hallucination Rates

Equations 2 & 4 — Hallucination rates
$$H_t := P(G_t=0 \mid C=1) = \frac{\mathbb{E}[C \cdot (1-G_t)]}{\mathbb{E}[C]}$$ $$H_s := P(G_s=0 \mid C=1)$$

Both measure how often the agent gets the right answer without proper evidence — temporal ($H_t$) or semantic ($H_s$).

Adjust trace length and see how temporal vs. semantic groundedness diverge.
10
50% (blended)

Temporal access and semantic support increasingly decouple as the trace grows. $G_t$ saturates early — agents do touch relevant regions. But $G_s$ degrades monotonically: the longer the trace, the less the agent’s final answer is supported by what it retrieved.

What causes this during training?
Chapter 3

Reward Pressure

Training longer doesn’t make the agent more grounded. Under outcome-only rewards, accuracy rises while groundedness stagnates — the gap between $C$ and $G_t$ widens with every optimisation step.

Equation 5 — Outcome-only terminal reward
$$R_{\text{ans}}(\xi) := \begin{cases} 1, & \text{if } \hat{a} = a^* \\ 0, & \text{otherwise} \end{cases}$$
Watch how accuracy and groundedness diverge as training progresses under outcome-only reward.
150
Outcome-only

The agent gets better at scoring without getting better at reasoning. Under reward pressure, speculative completion becomes more efficient than evidence seeking — a shortcut that works until it catastrophically doesn’t.

What about inference-time failures?
Chapter 4

Prompt Pressure

Even a well-trained agent can be pushed into guessing by its own context window. As tool traces grow longer, the agent is increasingly prompted to commit — whether or not the trace justifies an answer.

Drag the slider to simulate how prompt pressure grows with trace length.
6

Seeking longer still cannot yield more grounded answers. Longer traces don’t help — they hurt. The agent touches relevant content ($G_t$ saturates) but its final decision drifts toward plausible guesses ($G_s$ declines, $H_s$ rises).

What structural flaw causes both pressures?
Chapter 5

The Coupled Agent Trap

Both reward pressure and prompt pressure share a single structural root cause: the coupled agent paradigm, where one monolithic model handles searching, inspecting, and answering in the same context window.

The coupled agent policy
$$(r_t, u_t) \sim \pi(\cdot \mid h_{t-1}, q)$$

A single policy $\pi$ handles evidence seeking, trace inspection, termination, and final answer generation — all conditioned on the same shared interaction history $h_{t-1}$.

Answer Authority
Planner
Context Shared?
Yes
Pressure Risk
High

When search and verdict share a brain, the verdict always comes too early. The coupled agent conflates evidence seeking with answer authority, creating a structural incentive to commit before the evidence warrants it.

How does decoupling fix this?
Chapter 6

The Decoupled Fix

VideoSEAL splits the agent into a planner (evidence seeking) and an inspector (verification & answer authority). The planner keeps searching until the inspector approves the evidence.

The Planner–Inspector Protocol

Decoupled agent interaction
$$(r_t, u_t) \sim P(\cdot \mid h_{t-1}, q), \quad (z_t, f_t) \sim I(\cdot \mid v_t, q)$$

Planner $P$ produces rationale and action. Inspector $I$ evaluates evidence $v_t = \mathcal{E}(o_t)$ and returns sufficiency verdict $z_t$ and feedback $f_t$. The agent terminates only when $z_t = 1$.

Evidence-Gated Reward

Equations 6–7 — Evidence-gated terminal reward
$$g_{\text{evd}}(\xi) := \min\!\left(1,\; \frac{\max_{\tau \in \mathcal{E}(\xi)} \max_{\tau^* \in \mathcal{E}^*} \text{tIoU}(\tau, \tau^*)}{\gamma}\right)$$ $$R_{\text{evd}}(\xi) := R_{\text{ans}}(\xi) \cdot g_{\text{evd}}(\xi)$$
Watch the planner–inspector loop converge as the inspector approves or rejects candidate evidence.
8
0.50

Decoupling dominates reward design. Table 3 in the paper shows the decoupled agent with outcome-only reward ($R_{\text{ans}}$) still outperforms the coupled agent with evidence-gated reward ($R_{\text{evd}}$): 54.1% vs. 50.2%. The gain comes from architecture, not objective.

How does this perform at scale?
Chapter 7

Results & Scaling

Across four long-video benchmarks, the decoupled framework improves both accuracy and groundedness. It also scales monotonically with search budget and inspector capacity — unlike coupled baselines.

Drag sliders to compare coupled vs. decoupled scaling behaviour.
16
7B
Decoupled Acc.
Coupled Acc.
Gs (semantic)
Hs (hallucination)

The system is perception-bound, not reasoning-bound. Scaling the inspector from 7B to Gemini-3-Flash yields a 14.8-point gain (55.1% to 69.9%). Scaling the planner to GPT-4o actually hurts (52.3%). The bottleneck is visual fidelity, not planning sophistication.

What does this mean for the field?
Chapter 8

What This Means

The paper’s message extends beyond video QA: any agentic system that conflates search with answer authority risks evidence misalignment.

Key Takeaways

1. Measure groundedness, not just accuracy. $H_t$ and $H_s$ reveal how often right answers come from unverified evidence. On LVBench, DrVideo’s semantic hallucination rate is 41.4%; VideoSEAL’s is 11.3%.

2. Architecture beats objective. Decoupling with a simple outcome-only reward outperforms coupling with an evidence-gated reward (54.1% vs 50.2%). The gain is structural.

3. Perception is the bottleneck. Scaling the inspector from 7B to Gemini-3-Flash yields a 14.8-point gain. Scaling the planner to GPT-4o regresses. Invest in better eyes, not bigger brains.

4. Modularity is a feature. The planner, once trained, works with any inspector. As newer MLLMs ship, VideoSEAL improves automatically — Kimi2.5 as inspector pushes LVBench to 65.7%.

Build agents that earn their answers. The paper’s final line captures it: we hope this paradigm helps the community build more verifiable long-video agents and reduce unsupported guessing. The fix is simple — separate search from answer authority — but only once you’ve diagnosed the problem.