An Interactive Reading of

LatentOmni:
Rethinking Omni-Modal
Understanding

Unified Audio-Visual Latent Reasoning

Yifan Dai, Zhenhua Wu, Bohan Zeng, Daili Hua, Jialing Liu, Bozhou Li,
Yuran Wang, Chengzhuo Tong, Hao Liang, Xiaochen Ma, Junbo Niu,
Tianyu Guo, Yang Shi, Yue Ding, Yiyan Ji, Bingyin Mei,
Yushuo Guan, Yuanxing Zhang, Pengfei Wan, Fangcheng Fu, Wentao Zhang
Shanghai Jiao Tong University · Kuaishou · Peking University · May 2026 · arXiv:2605.22012

The paper, in plain English

When you ask a multimodal AI to reason about a video — "What sound played when the ball bounced?" — it typically thinks in text. It compresses the rich audio waveform and the 30-frame-per-second video into English sentences, then reasons over those sentences. That compression throws away the exact timing cues the model needs to answer correctly. LatentOmni asks a simple question: what if the model could think directly in the continuous sensory signal instead of forcing everything through English first?

The answer is a framework that interleaves normal English reasoning with "latent reasoning phases" — stretches where the model generates continuous vectors that stay grounded in the original audio-visual features rather than collapsing into words. A new position embedding system (OSPE) keeps audio and video temporally synchronized during these latent phases, and a feature-level supervision loss forces each latent vector to stay close to the actual sensory evidence it represents. The result is a model that attends to the original video and audio 2–3× more than a text-only baseline.

On four benchmarks spanning everyday events, physical commonsense, fine-grained audio typing, and long-form video understanding, LatentOmni achieves the best results among all evaluated open-source models, outperforming even specialized latent-reasoning methods on vision-only tasks. On OmniVideoBench it improves over the base Qwen2.5-Omni-7B by +6.1 percentage points — a 21% relative gain — confirming that preserving dense sensory evidence during reasoning is not just theoretically cleaner, but practically decisive.

I

Interleaved Latent Reasoning

The model alternates between textual deduction and continuous latent states that carry dense audio-visual evidence — no more compressing everything into words.

II

Feature-Level Supervision

Each latent vector is trained to stay close to its source sensory features via MSE alignment ($\mathcal{L}_\text{latent}$) and temporal synchronization ($\mathcal{L}_\text{sync}$).

III

Omni-Sync Position Embedding

OSPE extends time-aligned RoPE to latent space, keeping audio and visual features that correspond to the same moment positionally aligned.

~ 25 minutes · 9 chapters · 6 interactive simulations

Chapter 1

The Text Bottleneck

Multimodal models can see and hear. So why do they reason only in English?

In plain English

Imagine watching a security camera feed with audio. Someone throws a ball; it hits a wall with a thud. If you had to describe what happened, you'd write "ball hits wall, thud sound." Now ask: did the thud happen before or after the bounce? Your English description has already collapsed that timing. The words carry no frame-level precision.

That is exactly what happens inside current multimodal AI models. They convert the rich audio waveform and video frames into discrete text tokens before reasoning. This "textual bottleneck" throws away the fine-grained temporal information that cross-modal questions require. The model ends up reasoning over a lossy summary of the evidence, not the evidence itself.

Drag the slider in the simulation below to see how much attention the model pays to the original audio-visual inputs under different reasoning strategies.

The paper identifies a core problem: explicit text-based chain-of-thought (CoT) maps high-dimensional audio-visual evidence into discrete text tokens. This compression causes two issues:

Temporal grounding loss — continuous audio-visual signals are discretized, weakening the model's ability to locate events in time.
Language prior shift — the model leans on what language "expects" rather than what the sensory evidence actually shows, leading to hallucinations.

Figure 1 in the paper quantifies this: on the Daily-Omni benchmark, the Explicit Text CoT baseline allocates significantly less attention to AV tokens than LatentOmni, especially on audio-visual alignment tasks.

$$\text{Explicit Text CoT:} \quad \underbrace{H^v, H^a}_{\text{dense AV features}} \xrightarrow{\text{compress}} \underbrace{w_1, w_2, \ldots, w_n}_{\text{discrete text tokens}} \xrightarrow{\text{reason}} a$$

AV Token Attention Ratio

Live updates as you change the reasoning strategy.

Reasoning Method

Why this matters

The gap between text CoT and LatentOmni is widest on audio-visual alignment tasks — precisely the kind where temporal precision matters most. When the model can't attend to the original AV signal, it guesses based on language priors. When it can, it anchors on evidence.

Next: How latent reasoning works →

Chapter 2

Thinking in Latent Space

What if the model could reason directly over continuous sensory features, bypassing the text bottleneck?

In plain English

Think of a detective at a crime scene. She doesn't write a paragraph about every footprint, then reason from the paragraphs. She looks at the footprints, listens to the ambient audio, and thinks in the scene itself — forming impressions that are richer than any text description. Only after she's done reasoning does she write up her conclusion.

LatentOmni works the same way. When the model needs to cross-reference audio and visual evidence, it doesn't compress into words. Instead it generates a sequence of continuous latent vectors — high-dimensional embeddings that preserve the density of the original sensory signal. Each vector stays close to the actual audio-visual features it represents. Reasoning happens in the raw evidence space, not in a lossy text summary.

Adjust the latent token count in the simulation below and watch how the reasoning trajectory changes between text and latent phases.

The model autoregressively generates a hybrid sequence of text tokens and latent states. When it needs to revisit audio-visual evidence, it emits a special trigger token <Unified_Latent>, switches to continuous latent space, generates $K$ latent embeddings, then emits </Unified_Latent> to return to text generation.

$$S = \bigl[w_{1:i},\; u,\; z_{1:K},\; u',\; w_{i+1:j},\; u,\; z_{K+1:2K},\; u',\; \ldots,\; a\bigr]$$

where $w$ denotes text tokens, $u$ is the <Unified_Latent> trigger, $u'$ is the stop token, $z$ denotes continuous latent reasoning states, and $a$ is the final answer. Each latent state is the last-layer hidden state of the transformer:

$$z_k = \text{LM}_\theta^{(L)}\!\bigl(H^v, H^a, H^q, S_{

The first $K_v$ latent positions are allocated to visual features and the remaining $K_a$ positions to audio features, all sharing the same continuous space $\mathbb{R}^d$. The paper finds an optimal configuration of $K=40$ tokens: $K_v=32$ for vision, $K_a=8$ for audio.

Reasoning Trajectory Explorer

Adjust latent tokens and watch the trajectory change.

Total latent tokens (K) 40

1060

Visual allocation ratio 0.80

0.400.95

Latent phases 2

14

Visual tokens (K_v)

32

Audio tokens (K_a)

8

Total trajectory length

~120

Why this matters

The hybrid design is deliberate: text provides logical scaffolding (if-then, therefore), while latent states carry evidence-intensive cross-modal detail. Removing either degrades performance — but the latent phases are where the model actually "looks again" at the original sensory input.

Next: Keeping audio and video in sync →

Chapter 3

Temporal Alignment via OSPE

Sequential generation creates a positional drift. OSPE re-synchronizes audio and visual latents that refer to the same moment.

In plain English

Picture a subtitle track on a foreign film. The subtitles arrive one by one, but they must match the exact moment someone speaks. If subtitle #5 drifts three seconds ahead of the dialogue, the whole thing breaks. The positional encoding does for latent vectors what timestamps do for subtitles: it pins each one to a physical moment in the video.

Without OSPE, the model generates visual latents first (positions 1–32), then audio latents (positions 33–40). By position 35, the audio feature has no positional memory that it corresponds to the same frame as visual latent #3. OSPE injects a shared physical timestamp so that temporally co-occurring features stay close, regardless of their position in the generation sequence.

In the simulation below, drag the OSPE strength slider and watch how temporally matched audio-visual pairs move closer together in the latent space.

OSPE extends the time-aligned multimodal RoPE from Qwen2.5-Omni to the unified latent space. For a latent feature $h$ at timestamp $t$:

$$\text{OSPE}(h, t) = h \odot \cos(t\Theta) + R(h) \odot \sin(t\Theta)$$

where $\Theta = \{\theta_i\}_{i=1}^{d/2}$ is the base frequency vector, $\odot$ is the Hadamard product, and $R(\cdot)$ is the block-diagonal rotation matrix over adjacent feature dimensions. OSPE assigns a shared physical timestamp $t$ to temporally corresponding visual frames and audio segments, allowing later reasoning steps to attend to temporally consistent cross-modal evidence.

OSPE Alignment Visualization

Drag the slider to control OSPE strength and watch temporal alignment change.

OSPE strength 1.00

0 (off)1 (full)

Noise level 0.30

00.80

Avg AV distance (matched)

--

Avg AV distance (unmatched)

--

Alignment ratio

--

Why this matters

Without OSPE, Daily-Omni drops from 67.4 to 66.0 and LVOmniBench drops from 35.1 to 33.1. The effect is smaller than removing $\mathcal{L}_\text{latent}$, but it is consistent across every benchmark — temporal alignment is a structural necessity, not a nice-to-have.

Next: The three training objectives →

Chapter 4

Three Losses, One Objective

LatentOmni is trained end-to-end with three complementary losses that balance textual fluency, sensory grounding, and temporal coherence.

In plain English

Training a model to reason in latent space is like teaching a student to show their work — except the "work" isn't written sentences, it's continuous brain activity. You need three pressures: keep the student's language skills sharp ($\mathcal{L}_\text{text}$), make sure each brain state actually corresponds to real sensory evidence ($\mathcal{L}_\text{latent}$), and keep audio and visual brain states temporally synchronized ($\mathcal{L}_\text{sync}$).

The $\mathcal{L}_\text{latent}$ term is the most critical: without it, the model's latent states drift into meaningless noise. $\mathcal{L}_\text{sync}$ is the temporal safety net. And $\mathcal{L}_\text{text}$ ensures the model can still talk fluently about what it discovered in the latent phase.

Adjust the balancing weights below to see how each loss component shapes the model's behavior.

Temporal Synchronization Loss

Given latent visual features $h_t^v$ and audio features $h_t^a$ at matching timestamps $t \in \mathcal{T}$, a symmetric InfoNCE contrastive loss:

$$\mathcal{L}_\text{sync} = -\frac{1}{2|\mathcal{T}|} \sum_{t \in \mathcal{T}} \left[\log \frac{\exp(\text{sim}(h_t^v, h_t^a)/\tau)}{\sum_{t'}\exp(\text{sim}(h_t^v, h_{t'}^a)/\tau)} + \log \frac{\exp(\text{sim}(h_t^a, h_t^v)/\tau)}{\sum_{t'}\exp(\text{sim}(h_t^a, h_{t'}^v)/\tau)}\right]$$

Latent Alignment Loss

Each generated latent state $z_k$ is aligned with a dense anchor $a_k$ from the encoder features:

$$\mathcal{L}_\text{latent} = \frac{1}{K}\sum_{k=1}^{K} \|z_k - a_k\|_2^2$$

Text Prediction Loss

Standard cross-entropy over discrete tokens in the hybrid sequence:

$$\mathcal{L}_\text{text} = -\frac{1}{N_\text{text}}\sum_{t=1}^{L} \mathbb{I}(s_t \in \mathcal{V})\,\log\,p(s_t \mid S_{

Combined Objective

$$\mathcal{L}_\text{total} = \mathcal{L}_\text{text} + \lambda_1\,\mathcal{L}_\text{latent} + \lambda_2\,\mathcal{L}_\text{sync}$$

Loss Component Explorer

Adjust balancing weights and see each component's contribution.

λ₁ (latent weight) 1.00

03

λ₂ (sync weight) 1.00

03

L_total

--

L_latent contribution

--

Simulated Daily-Omni

--

Why this matters

Ablation confirms $\mathcal{L}_\text{latent}$ is the single most important component. Removing it drops Daily-Omni from 67.4 to 61.0 — a 6.4-point collapse. Removing $\mathcal{L}_\text{sync}$ costs only 1.5 points, but on every benchmark, indicating it is a structural complement rather than optional.

Next: Building the training data →

Chapter 5

Crafting 35K Reasoning Trajectories

Latent-space supervision requires data that doesn't exist: reasoning chains annotated with precise audio-visual segment references.

In plain English

Imagine you're building a textbook for students learning to reason about videos. Each exercise needs: (1) a good question that requires both seeing and hearing, (2) a description of exactly which video segment and audio clip contain the answer, and (3) a step-by-step solution that references those clips at the right moment.

That textbook doesn't exist. Existing datasets give you a video, a question, and an answer — but no reasoning chain, and certainly no timestamped segment annotations. So the authors build one from scratch: LatentOmni-Instruct-35K. They use LLMs to generate the questions, verify quality with a second LLM, synthesize segment-level captions for both audio and video independently, then weave everything into interleaved reasoning trajectories with explicit <Unified_Latent> markers.

The final audit pass uses a third LLM to catch hallucinated citations and temporal contradictions — only clean trajectories survive.

01

AVQA Synthesis & Filtering

Raw samples from ASID and AVoCaDO are transformed into cross-modal QA pairs using Qwen3-235B. Each pair receives quality scores for difficulty, logical soundness, and modality dependency. Samples scoring below 13 are discarded; category balance is enforced.

02

Segment-Level Captioning

Audio and video streams are segmented by timestamp. Qwen3-30B generates separate audio and video captions per segment. GLM-4.7 filters hallucinations, repairs shot fragmentation, and realigns captions temporally.

03

Trajectory Synthesis

GLM-4.7 generates reasoning chains with explicit AV-segment markers. Gemini-2.5-Flash audits for citation errors and contradictions. After filtering, markers are replaced with actual AV segments to produce the final 35K trajectories.

35K

High-quality audio-visual interleaved reasoning trajectories

2

Source datasets: ASID + AVoCaDO (temporally aligned AV captions)

3

LLMs in the pipeline: Qwen3-235B, GLM-4.7, Gemini-2.5-Flash

Design insight

The key challenge isn't generating questions — it's knowing which audio-visual segments are relevant to each reasoning step. The pipeline explicitly annotates these, providing the segment-level grounding that makes $\mathcal{L}_\text{latent}$ supervision possible. Without this data, the model has no anchor for its latent states.

Next: Benchmark results →

Chapter 6

Best Open-Source, Period

Four benchmarks, four wins. LatentOmni outperforms every evaluated open-source model — and rivals proprietary systems.

Benchmark Comparison

Click benchmark names in the legend to toggle visibility.

+6.1pp

Gain over base model on OmniVideoBench (29.3 → 35.4)

67.4%

Daily-Omni accuracy — best among all open-source models

60.8%

VideoMME (vision-only) — beats Monet (51.6) and LVR (36.7)

Why this matters

The improvement is not from more parameters or more training data — it's from the same 7B model with a different reasoning mechanism. LatentOmni uses the same backbone (Qwen2.5-Omni-7B), the same compute budget, and a modest 35K training set. The gains come entirely from preserving dense audio-visual evidence during reasoning.

Next: What makes it tick (ablation) →

Chapter 7

What Matters Most

Remove one component at a time and measure the damage. The ablation reveals a clear hierarchy of importance.

Token Configuration Explorer

Adjust latent token count and AV allocation to find the sweet spot.

Total latent tokens (K) 40

2050

Visual fraction 0.80

0.500.95

Estimated Daily-Omni

67.4

Estimated WorldSense

48.9

Estimated OmniVideoBench

35.4

The hierarchy

$\mathcal{L}_\text{latent}$ (feature supervision) > Audio in latent space > Visual in latent space > OSPE > $\mathcal{L}_\text{sync}$. The ablation table confirms that every component helps, but $\mathcal{L}_\text{latent}$ is the load-bearing wall. The optimal token configuration is 40 total, split 32 visual + 8 audio.

Next: Conclusion →

Chapter 8

Latent Reasoning Works

The paper closes with a clear verdict: preserving continuous sensory evidence during reasoning is a practical, effective path toward stronger omni-modal understanding.

The central claim of LatentOmni is that not all reasoning should pass through text. For tasks that require fine-grained audio-visual integration — temporal synchronization, cross-modal alignment, long-form event understanding — keeping part of the reasoning process in continuous latent space gives the model access to denser, less lossy evidence.

The framework achieves this through three mechanisms that work together:

Interleaved text-latent reasoning — Text provides logical structure; latent states carry evidence-intensive detail.
Feature-level supervision — $\mathcal{L}_\text{latent}$ anchors each latent vector to real sensory features; $\mathcal{L}_\text{sync}$ ensures temporal consistency.
OSPE — Time-aligned position embeddings prevent sequential generation from destroying temporal correspondence.

The results speak for themselves: best among open-source models on all four benchmarks, outperforming the text CoT baseline by clear margins, and competitive with much larger proprietary systems.

The bigger picture

LatentOmni suggests a broader principle: the reasoning medium should match the evidence medium. When the evidence is continuous and multimodal, forcing reasoning into discrete text is a bottleneck. This insight likely extends beyond audio-visual understanding to any domain where dense, structured data must be integrated — medical imaging, scientific simulation, robotic perception. The latent reasoning paradigm is still early, but LatentOmni makes a strong case that it deserves to be a central research direction.

LatentOmni:Rethinking Omni-ModalUnderstanding

The Text Bottleneck

AV Token Attention Ratio

Thinking in Latent Space

Reasoning Trajectory Explorer

Temporal Alignment via OSPE

OSPE Alignment Visualization

Three Losses, One Objective

Temporal Synchronization Loss

Latent Alignment Loss

Text Prediction Loss

Combined Objective

Loss Component Explorer

Crafting 35K Reasoning Trajectories

Best Open-Source, Period

Benchmark Comparison

What Matters Most

Token Configuration Explorer

Latent Reasoning Works

LatentOmni:
Rethinking Omni-Modal
Understanding