An Interactive Reading of

ReasonAudio: A Benchmark for
Reasoning Beyond Matching
in Text-Audio Retrieval

The paper, in plain English

Imagine searching a massive audio library for "a clip where a dog barks before a car horn, with no siren." Today's audio search engines would find dog barks and car horns easily — but they'd ignore the ordering (dog first, horn second) and miss the negation (no siren). They match sounds; they don't reason about them.

The authors built ReasonAudio — the first benchmark that tests whether retrieval models can handle logical constraints like negation, temporal ordering, overlap, and duration. They synthesized 10,000 audio clips from 200 atomic sounds and generated 1,000 search queries designed to require reasoning, not just matching. Think of it as giving audio search engines a standardized test where every question has a logical twist.

The result: every model struggles. The best performer, OmniEmbed-7B, manages only 20.1% average accuracy — worse than random on negation tasks. Multimodal large language models, despite strong reasoning in text, lose that ability when fine-tuned for retrieval. The paper exposes a fundamental gap: current systems match sounds but cannot think about them.

I
Reasoning, Not Matching
Current benchmarks test whether a model can match "dog barking" text to dog-barking audio. ReasonAudio asks: can the model handle "dog barking before a car horn, without a siren"?
II
The Backbone Paradox
Multimodal LLMs like Qwen2 understand negation and ordering in text. But when contrastive fine-tuning converts them into retrieval models, that reasoning capability vanishes — the training emphasizes similarity matching over logic.
III
Embeddings Don't Encode Logic
t-SNE visualization reveals that positive and negative audio samples are virtually indistinguishable in the shared embedding space. The model has no way to represent "without X" — it can only measure similarity.
Chapter 1

The Matching Trap

Modern audio retrieval systems are remarkably good at finding a dog bark when you type "dog barking." But real queries demand more than matching — they demand reasoning.

Prior Text-Audio Retrieval benchmarks fall into three categories, all focused on semantic matching:

None of these benchmarks require the model to understand that "a clip with dog sounds but no cat sounds" means the cat must be absent. The retrieval task is formulated as a similarity problem:

$$f(q_i, d_j) \rightarrow \text{rank relevant } d^+ \text{ above irrelevant } d^-$$

where $Q = \{q_1, \ldots, q_n\}$ are text queries, $D = \{d_1, \ldots, d_m\}$ is the audio corpus, and the scoring function $f(q_i, d_j)$ maps each query-audio pair to a real-valued relevance score. The goal is to rank relevant audio $d^+$ above irrelevant $d^-$.

In semantic matching benchmarks, relevance is determined by surface-level content overlap. In ReasonAudio, relevance depends on satisfying logical and temporal constraints that go far beyond keyword overlap.

Why this matters
Real-world audio search queries routinely include negation ("no traffic noise"), temporal constraints ("thunder before rain"), and duration expectations ("continuous alarm for 5 seconds"). A retrieval system that only matches keywords will systematically fail on these queries — yet no existing benchmark tests for this failure.
Next: The five forms of reasoning
Chapter 2

Five Forms of Reasoning

ReasonAudio defines five task types, each isolating a distinct reasoning capability that current models lack. Click a task below to explore it.

Task Explorer

Click any task card to see its details and model performance.

Negation
Order
Overlap
Duration
Σ
Mix
Why this matters
The five-task decomposition is not arbitrary. Each task targets a specific reasoning primitive that humans handle effortlessly but that current embedding spaces cannot represent. Negation — the hardest task — is particularly revealing: models actively match the sounds they are told to exclude, suggesting they process the query as a bag-of-words rather than a logical statement.
Next: How the benchmark was built
Chapter 3

Building the Benchmark

200 atomic sounds. 10,000 composite clips. 1,000 reasoning-intensive queries. A three-stage pipeline designed for deterministic quality control.

Construction Pipeline

Quality Control

Several design choices ensure benchmark reliability:

Audio Clip Composition

Why this matters
The synthetic construction ensures perfect ground truth — every query-audio relevance judgment is provably correct because it is derived from metadata, not human opinion. This eliminates the annotation noise that plagues other benchmarks and makes it possible to attribute model failures to reasoning deficits rather than data quality.
Next: The model landscape
Chapter 4

The Model Landscape

Three paradigms, ten models, one dismal result. From two-stage pipelines to billion-parameter MLLMs, no approach solves reasoning-intensive retrieval.

The paper evaluates three families of models:

Model Comparison

Click a paradigm to filter, or view all models together.

All Models
10 models across 3 paradigms
Two-Stage
Caption + text retrieval
CLIP-Style
Contrastive embeddings
MLLM-Based
Unified multimodal embeddings

Paradigm Comparison

Why this matters
The two-stage pipeline's bottleneck is audio captioning — replacing Qwen2-Audio with Step-Audio boosted accuracy 14x, while swapping the text retriever had minimal effect. This means the problem is upstream: captions don't capture temporal structure or negation. Among CLIP-style models, only CLAP shows any signal; the others are near-zero.
Next: The accuracy crisis in detail
Chapter 5

The Accuracy Crisis

20.1%. That is the best accuracy any model achieves — and only on average. On individual tasks, the numbers are far worse.

20.1%
Best average Acc@1
(OmniEmbed-7B)
0.0%
Negation Acc@1
(6 of 10 models)
29.5%
Best single-task Acc@1
(Overlap, LCO-7B)

Full Results Heatmap

nDCG@10 — Ranking Quality

Does Bigger Mean Better?

Why this matters
Scaling model size yields only moderate gains: LCO-Embedding-7B improves 5.2% over the 3B variant. No single model dominates across tasks. The failure is not about capacity — it is structural. Current training paradigms (contrastive learning) optimize for similarity matching, which is fundamentally misaligned with logical reasoning requirements.
Next: Isolating reasoning from matching
Chapter 6

Reasoning vs. Matching

When you control for sound matching, do models still reason? A two-option test reveals the ugly truth.

The two-option multiple-choice setup controls for sound matching by ensuring both candidates contain the same target sounds. The negative candidate violates the query's logical constraints only — for instance, including a hailstorm sound when the query says "without hailstorm." With 120 such questions per task, this setup isolates reasoning ability from acoustic matching.

Two-Option Reasoning Test

Retrieval vs. Reasoning Gap

What If Models Could Reason?

No reasoning (current)Strong reasoning
Why this matters
The below-random performance on negation (27.5% vs. 50% chance) means the model is not just ignoring the negation — it is systematically matching the excluded sound. This is a direct consequence of contrastive training, which pushes representations of co-mentioned items closer together regardless of logical relationship. The model has learned "hailstorm is relevant to this query" rather than "hailstorm must be absent."
Next: Why embeddings fail at logic
Chapter 7

The Alignment Gap

The t-SNE plot tells the whole story: positive and negative audio samples occupy the same region of embedding space. The model literally cannot tell them apart.

The authors visualized text and audio embeddings using t-SNE, sampling five queries. For each query, positive audio samples (matching the query) and negative samples (violating constraints) are plotted together with the query embedding. The results reveal substantial misalignment:

This misalignment directly explains the below-random performance on negation tasks: the embedding space has no mechanism to encode logical constraints.

Embedding Alignment Explorer

Current state (misaligned)Ideal (well-aligned)

Why this matters
The embedding misalignment is not a bug in any single model — it is a consequence of the training paradigm. Contrastive learning optimizes for similarity: items mentioned together should be close in embedding space. But negation requires dissimilarity within similarity: the excluded item should be pushed away even though it is co-mentioned. Current loss functions have no mechanism for this distinction.
ReasonAudio exposes a fundamental gap between matching and reasoning in multimodal retrieval. Current models can find sounds — but they cannot think about them. Bridging this gap will require new training paradigms that go beyond contrastive similarity and encode logical constraints directly into the embedding space.