Imagine searching a massive audio library for "a clip where a dog barks before a car horn, with no siren." Today's audio search engines would find dog barks and car horns easily — but they'd ignore the ordering (dog first, horn second) and miss the negation (no siren). They match sounds; they don't reason about them.
The authors built ReasonAudio — the first benchmark that tests whether retrieval models can handle logical constraints like negation, temporal ordering, overlap, and duration. They synthesized 10,000 audio clips from 200 atomic sounds and generated 1,000 search queries designed to require reasoning, not just matching. Think of it as giving audio search engines a standardized test where every question has a logical twist.
The result: every model struggles. The best performer, OmniEmbed-7B, manages only 20.1% average accuracy — worse than random on negation tasks. Multimodal large language models, despite strong reasoning in text, lose that ability when fine-tuned for retrieval. The paper exposes a fundamental gap: current systems match sounds but cannot think about them.
Modern audio retrieval systems are remarkably good at finding a dog bark when you type "dog barking." But real queries demand more than matching — they demand reasoning.
Prior Text-Audio Retrieval benchmarks fall into three categories, all focused on semantic matching:
None of these benchmarks require the model to understand that "a clip with dog sounds but no cat sounds" means the cat must be absent. The retrieval task is formulated as a similarity problem:
where $Q = \{q_1, \ldots, q_n\}$ are text queries, $D = \{d_1, \ldots, d_m\}$ is the audio corpus, and the scoring function $f(q_i, d_j)$ maps each query-audio pair to a real-valued relevance score. The goal is to rank relevant audio $d^+$ above irrelevant $d^-$.
In semantic matching benchmarks, relevance is determined by surface-level content overlap. In ReasonAudio, relevance depends on satisfying logical and temporal constraints that go far beyond keyword overlap.
ReasonAudio defines five task types, each isolating a distinct reasoning capability that current models lack. Click a task below to explore it.
Click any task card to see its details and model performance.
200 atomic sounds. 10,000 composite clips. 1,000 reasoning-intensive queries. A three-stage pipeline designed for deterministic quality control.
Several design choices ensure benchmark reliability:
Three paradigms, ten models, one dismal result. From two-stage pipelines to billion-parameter MLLMs, no approach solves reasoning-intensive retrieval.
The paper evaluates three families of models:
Click a paradigm to filter, or view all models together.
20.1%. That is the best accuracy any model achieves — and only on average. On individual tasks, the numbers are far worse.
When you control for sound matching, do models still reason? A two-option test reveals the ugly truth.
The two-option multiple-choice setup controls for sound matching by ensuring both candidates contain the same target sounds. The negative candidate violates the query's logical constraints only — for instance, including a hailstorm sound when the query says "without hailstorm." With 120 such questions per task, this setup isolates reasoning ability from acoustic matching.
The t-SNE plot tells the whole story: positive and negative audio samples occupy the same region of embedding space. The model literally cannot tell them apart.
The authors visualized text and audio embeddings using t-SNE, sampling five queries. For each query, positive audio samples (matching the query) and negative samples (violating constraints) are plotted together with the query embedding. The results reveal substantial misalignment:
This misalignment directly explains the below-random performance on negation tasks: the embedding space has no mechanism to encode logical constraints.