When you ask a multimodal AI to reason about a video — "What sound played when the ball bounced?" — it typically thinks in text. It compresses the rich audio waveform and the 30-frame-per-second video into English sentences, then reasons over those sentences. That compression throws away the exact timing cues the model needs to answer correctly. LatentOmni asks a simple question: what if the model could think directly in the continuous sensory signal instead of forcing everything through English first?
The answer is a framework that interleaves normal English reasoning with "latent reasoning phases" — stretches where the model generates continuous vectors that stay grounded in the original audio-visual features rather than collapsing into words. A new position embedding system (OSPE) keeps audio and video temporally synchronized during these latent phases, and a feature-level supervision loss forces each latent vector to stay close to the actual sensory evidence it represents. The result is a model that attends to the original video and audio 2–3× more than a text-only baseline.
On four benchmarks spanning everyday events, physical commonsense, fine-grained audio typing, and long-form video understanding, LatentOmni achieves the best results among all evaluated open-source models, outperforming even specialized latent-reasoning methods on vision-only tasks. On OmniVideoBench it improves over the base Qwen2.5-Omni-7B by +6.1 percentage points — a 21% relative gain — confirming that preserving dense sensory evidence during reasoning is not just theoretically cleaner, but practically decisive.
Multimodal models can see and hear. So why do they reason only in English?
The paper identifies a core problem: explicit text-based chain-of-thought (CoT) maps high-dimensional audio-visual evidence into discrete text tokens. This compression causes two issues:
Figure 1 in the paper quantifies this: on the Daily-Omni benchmark, the Explicit Text CoT baseline allocates significantly less attention to AV tokens than LatentOmni, especially on audio-visual alignment tasks.
Live updates as you change the reasoning strategy.
What if the model could reason directly over continuous sensory features, bypassing the text bottleneck?
The model autoregressively generates a hybrid sequence of text tokens and latent states. When it needs to revisit audio-visual evidence, it emits a special trigger token <Unified_Latent>, switches to continuous latent space, generates $K$ latent embeddings, then emits </Unified_Latent> to return to text generation.
where $w$ denotes text tokens, $u$ is the <Unified_Latent> trigger, $u'$ is the stop token, $z$ denotes continuous latent reasoning states, and $a$ is the final answer. Each latent state is the last-layer hidden state of the transformer:
The first $K_v$ latent positions are allocated to visual features and the remaining $K_a$ positions to audio features, all sharing the same continuous space $\mathbb{R}^d$. The paper finds an optimal configuration of $K=40$ tokens: $K_v=32$ for vision, $K_a=8$ for audio.
Adjust latent tokens and watch the trajectory change.
Sequential generation creates a positional drift. OSPE re-synchronizes audio and visual latents that refer to the same moment.
OSPE extends the time-aligned multimodal RoPE from Qwen2.5-Omni to the unified latent space. For a latent feature $h$ at timestamp $t$:
where $\Theta = \{\theta_i\}_{i=1}^{d/2}$ is the base frequency vector, $\odot$ is the Hadamard product, and $R(\cdot)$ is the block-diagonal rotation matrix over adjacent feature dimensions. OSPE assigns a shared physical timestamp $t$ to temporally corresponding visual frames and audio segments, allowing later reasoning steps to attend to temporally consistent cross-modal evidence.
Drag the slider to control OSPE strength and watch temporal alignment change.
LatentOmni is trained end-to-end with three complementary losses that balance textual fluency, sensory grounding, and temporal coherence.
Given latent visual features $h_t^v$ and audio features $h_t^a$ at matching timestamps $t \in \mathcal{T}$, a symmetric InfoNCE contrastive loss:
Each generated latent state $z_k$ is aligned with a dense anchor $a_k$ from the encoder features:
Standard cross-entropy over discrete tokens in the hybrid sequence:
Adjust balancing weights and see each component's contribution.
Latent-space supervision requires data that doesn't exist: reasoning chains annotated with precise audio-visual segment references.
Four benchmarks, four wins. LatentOmni outperforms every evaluated open-source model — and rivals proprietary systems.
Click benchmark names in the legend to toggle visibility.
Remove one component at a time and measure the damage. The ablation reveals a clear hierarchy of importance.
Adjust latent token count and AV allocation to find the sweet spot.
The paper closes with a clear verdict: preserving continuous sensory evidence during reasoning is a practical, effective path toward stronger omni-modal understanding.
The central claim of LatentOmni is that not all reasoning should pass through text. For tasks that require fine-grained audio-visual integration — temporal synchronization, cross-modal alignment, long-form event understanding — keeping part of the reasoning process in continuous latent space gives the model access to denser, less lossy evidence.
The framework achieves this through three mechanisms that work together:
The results speak for themselves: best among open-source models on all four benchmarks, outperforming the text CoT baseline by clear margins, and competitive with much larger proprietary systems.