An Interactive Reading of

ReasonAudio: A Benchmark for
Reasoning Beyond Matching
in Text-Audio Retrieval

Honglei Zhang, Yuting Chen, Chenpeng Hu, Siyue Zhang & Yilei Shi
Nanjing University · Northwestern Polytechnical University · Nanyang Technological University · May 2026 · arXiv:2605.03361

The paper, in plain English

Imagine searching a massive audio library for "a clip where a dog barks before a car horn, with no siren." Today's audio search engines would find dog barks and car horns easily — but they'd ignore the ordering (dog first, horn second) and miss the negation (no siren). They match sounds; they don't reason about them.

The authors built ReasonAudio — the first benchmark that tests whether retrieval models can handle logical constraints like negation, temporal ordering, overlap, and duration. They synthesized 10,000 audio clips from 200 atomic sounds and generated 1,000 search queries designed to require reasoning, not just matching. Think of it as giving audio search engines a standardized test where every question has a logical twist.

The result: every model struggles. The best performer, OmniEmbed-7B, manages only 20.1% average accuracy — worse than random on negation tasks. Multimodal large language models, despite strong reasoning in text, lose that ability when fine-tuned for retrieval. The paper exposes a fundamental gap: current systems match sounds but cannot think about them.

I

Reasoning, Not Matching

Current benchmarks test whether a model can match "dog barking" text to dog-barking audio. ReasonAudio asks: can the model handle "dog barking before a car horn, without a siren"?

II

The Backbone Paradox

Multimodal LLMs like Qwen2 understand negation and ordering in text. But when contrastive fine-tuning converts them into retrieval models, that reasoning capability vanishes — the training emphasizes similarity matching over logic.

III

Embeddings Don't Encode Logic

t-SNE visualization reveals that positive and negative audio samples are virtually indistinguishable in the shared embedding space. The model has no way to represent "without X" — it can only measure similarity.

Chapter 1

The Matching Trap

Modern audio retrieval systems are remarkably good at finding a dog bark when you type "dog barking." But real queries demand more than matching — they demand reasoning.

In plain English

Think of Spotify's search bar. Type "upbeat pop," and it finds tracks tagged "upbeat" and "pop" — pure keyword matching. Now imagine asking: "Find me a song that starts with acoustic guitar, then adds drums, but doesn't use synthesizers." Spotify's search would fail completely. It can match tags; it cannot reason about sequence, co-occurrence, or absence.

This is exactly the gap in audio retrieval research. Every major benchmark — AudioCaps, Clotho, WavText5K — tests whether a model can match a text description to an audio clip. None test whether the model can handle logical constraints like "before," "without," "at the same time as," or "lasting at least 3 seconds."

ReasonAudio was built to expose this blind spot. The five reasoning tasks below reveal just how much current systems depend on surface-level matching.

Prior Text-Audio Retrieval benchmarks fall into three categories, all focused on semantic matching:

AudioCaps (4.9K samples): audio caption retrieval — match a natural-language description to its audio clip.
Clotho (5K samples): similar caption-matching paradigm with human-written descriptions.
AudioSet (0.5K for retrieval): sound classification — identify which labels apply to an audio clip.

None of these benchmarks require the model to understand that "a clip with dog sounds but no cat sounds" means the cat must be absent. The retrieval task is formulated as a similarity problem:

$$f(q_i, d_j) \rightarrow \text{rank relevant } d^+ \text{ above irrelevant } d^-$$

where $Q = \{q_1, \ldots, q_n\}$ are text queries, $D = \{d_1, \ldots, d_m\}$ is the audio corpus, and the scoring function $f(q_i, d_j)$ maps each query-audio pair to a real-valued relevance score. The goal is to rank relevant audio $d^+$ above irrelevant $d^-$.

In semantic matching benchmarks, relevance is determined by surface-level content overlap. In ReasonAudio, relevance depends on satisfying logical and temporal constraints that go far beyond keyword overlap.

Why this matters

Real-world audio search queries routinely include negation ("no traffic noise"), temporal constraints ("thunder before rain"), and duration expectations ("continuous alarm for 5 seconds"). A retrieval system that only matches keywords will systematically fail on these queries — yet no existing benchmark tests for this failure.

Next: The five forms of reasoning →

Chapter 2

Five Forms of Reasoning

ReasonAudio defines five task types, each isolating a distinct reasoning capability that current models lack. Click a task below to explore it.

Task Explorer

Click any task card to see its details and model performance.

≠

Negation

→

Order

∩

Overlap

⏱

Duration

Σ

Mix

Why this matters

The five-task decomposition is not arbitrary. Each task targets a specific reasoning primitive that humans handle effortlessly but that current embedding spaces cannot represent. Negation — the hardest task — is particularly revealing: models actively match the sounds they are told to exclude, suggesting they process the query as a bag-of-words rather than a logical statement.

Next: How the benchmark was built →

Chapter 3

Building the Benchmark

200 atomic sounds. 10,000 composite clips. 1,000 reasoning-intensive queries. A three-stage pipeline designed for deterministic quality control.

In plain English

Think of ReasonAudio's construction like building a standardized exam. First, you curate a pool of 200 clean, unambiguous "atomic" sounds — a single dog bark, a single bell ring, a single thunderclap. No sound is semantically ambiguous: "music" and "violin" are never both included, because that would confuse the grading.

Next, you combine these atoms into 10,000 "composite" audio clips — layering 2 to 8 sounds in precise temporal arrangements. Some are sequential (A then B then C), some overlapping (A and B at the same time). The exact timestamps and labels are recorded in metadata.

Finally, you generate queries by filling in templates: "Find a clip where [sound A] plays before [sound B], without [sound C], and [sound A] lasts at least [N] seconds." Relevance is checked deterministically against the metadata — no human annotation disagreements.

Construction Pipeline

Quality Control

Several design choices ensure benchmark reliability:

No semantically similar sounds: atomic sounds are chosen to avoid hierarchical or overlapping categories (e.g., no "music" vs. "violin").
Deterministic annotation: a program labels query-audio pairs as relevant only if the audio's metadata satisfies all constraints in the query.
Manual verification: 50 random queries per task were manually reviewed to confirm annotation correctness.
Scalable pipeline: new atomic sounds or templates can be added without changing the pipeline.

Audio Clip Composition

Why this matters

The synthetic construction ensures perfect ground truth — every query-audio relevance judgment is provably correct because it is derived from metadata, not human opinion. This eliminates the annotation noise that plagues other benchmarks and makes it possible to attribute model failures to reasoning deficits rather than data quality.

Next: The model landscape →

Chapter 4

The Model Landscape

Three paradigms, ten models, one dismal result. From two-stage pipelines to billion-parameter MLLMs, no approach solves reasoning-intensive retrieval.

In plain English

Think of the three model paradigms like three ways to search a foreign-language library. The two-stage approach: hire a translator to convert every book into English, then search the English summaries. It works for matching but fails on reasoning because the translator loses temporal details.

The CLIP-style approach: build a bilingual dictionary that maps English words and foreign words into the same "meaning space." Fast and direct, but the dictionary only captures surface similarity — it has no entries for "not" or "before."

The MLLM-based approach: start with a polymath scholar who speaks both languages and understands logic, then compress their knowledge into a lookup table. The compression destroys the reasoning ability. Click through the paradigms below to see the performance gap.

The paper evaluates three families of models:

Two-stage: audio is captioned (Qwen2-Audio or Step-Audio), then text retrieval is performed (BGE-M3 or Qwen3-Embedding).
CLIP-style: contrastive learning maps audio and text into a shared embedding space (CLAP, AudioCLIP, Wav2CLIP).
MLLM-based: pretrained multimodal LLMs are fine-tuned as unified embedding models (LCO-Embedding, e5-omni, OmniEmbed).

Model Comparison

Click a paradigm to filter, or view all models together.

All Models

10 models across 3 paradigms

Two-Stage

Caption + text retrieval

CLIP-Style

Contrastive embeddings

MLLM-Based

Unified multimodal embeddings

Paradigm Comparison

Why this matters

The two-stage pipeline's bottleneck is audio captioning — replacing Qwen2-Audio with Step-Audio boosted accuracy 14x, while swapping the text retriever had minimal effect. This means the problem is upstream: captions don't capture temporal structure or negation. Among CLIP-style models, only CLAP shows any signal; the others are near-zero.

Next: The accuracy crisis in detail →

Chapter 5

The Accuracy Crisis

20.1%. That is the best accuracy any model achieves — and only on average. On individual tasks, the numbers are far worse.

20.1%

Best average Acc@1
(OmniEmbed-7B)

0.0%

Negation Acc@1
(6 of 10 models)

29.5%

Best single-task Acc@1
(Overlap, LCO-7B)

Full Results Heatmap

nDCG@10 — Ranking Quality

Does Bigger Mean Better?

Why this matters

Scaling model size yields only moderate gains: LCO-Embedding-7B improves 5.2% over the 3B variant. No single model dominates across tasks. The failure is not about capacity — it is structural. Current training paradigms (contrastive learning) optimize for similarity matching, which is fundamentally misaligned with logical reasoning requirements.

Next: Isolating reasoning from matching →

Chapter 6

Reasoning vs. Matching

When you control for sound matching, do models still reason? A two-option test reveals the ugly truth.

In plain English

Suppose you ask an audio retrieval model: "Find a clip with plastic bag and goose sounds, but without hailstorm." The model fails because of two possible reasons: (1) it cannot match the sounds correctly, or (2) it matches them but cannot process the "without" logic.

To isolate the reasoning failure, the authors designed a clever test: give the model exactly two audio clips. Both contain the correct sounds (plastic bag + goose). The negative clip also contains hailstorm — which the query says must be absent. Now the task is pure reasoning: which clip satisfies the logical constraint?

The result: on negation, OmniEmbed-7B scores only 27.5% — below random chance (50%). On duration, it scores 50.8% — essentially random. The model is not just bad at reasoning; on negation, it is worse than a coin flip.

The two-option multiple-choice setup controls for sound matching by ensuring both candidates contain the same target sounds. The negative candidate violates the query's logical constraints only — for instance, including a hailstorm sound when the query says "without hailstorm." With 120 such questions per task, this setup isolates reasoning ability from acoustic matching.

Two-Option Reasoning Test

Retrieval vs. Reasoning Gap

What If Models Could Reason?

Reasoning enhancement weight: 0.0

No reasoning (current)Strong reasoning

Why this matters

The below-random performance on negation (27.5% vs. 50% chance) means the model is not just ignoring the negation — it is systematically matching the excluded sound. This is a direct consequence of contrastive training, which pushes representations of co-mentioned items closer together regardless of logical relationship. The model has learned "hailstorm is relevant to this query" rather than "hailstorm must be absent."

Next: Why embeddings fail at logic →

Chapter 7

The Alignment Gap

The t-SNE plot tells the whole story: positive and negative audio samples occupy the same region of embedding space. The model literally cannot tell them apart.

In plain English

Imagine sorting books on a shelf. A well-organized shelf clusters all the science fiction together, all the mysteries together, and keeps them clearly separated. That is what a well-aligned embedding space looks like — audio clips that match a query cluster together, and irrelevant clips are far away.

Now imagine dumping all the books in a pile. Science fiction is mixed with mystery is mixed with cookbooks. You cannot find anything by browsing — you would have to check each book individually. This is the current state of text-audio embedding spaces for reasoning queries.

Drag the alignment slider below to see the difference between "dumped in a pile" (left) and "well-organized shelf" (right). The paper's t-SNE plots confirm: real models are stuck at the left end.

The authors visualized text and audio embeddings using t-SNE, sampling five queries. For each query, positive audio samples (matching the query) and negative samples (violating constraints) are plotted together with the query embedding. The results reveal substantial misalignment:

Positive samples fail to form compact clusters — they are highly dispersed.
Positive and negative samples are poorly separated — they overlap in the embedding space.
In several cases, negative samples lie closer to the query embedding than positive ones.

This misalignment directly explains the below-random performance on negation tasks: the embedding space has no mechanism to encode logical constraints.

Embedding Alignment Explorer

Embedding alignment quality: 0%

Current state (misaligned)Ideal (well-aligned)

Why this matters

The embedding misalignment is not a bug in any single model — it is a consequence of the training paradigm. Contrastive learning optimizes for similarity: items mentioned together should be close in embedding space. But negation requires dissimilarity within similarity: the excluded item should be pushed away even though it is co-mentioned. Current loss functions have no mechanism for this distinction.

ReasonAudio exposes a fundamental gap between matching and reasoning in multimodal retrieval. Current models can find sounds — but they cannot think about them. Bridging this gap will require new training paradigms that go beyond contrastive similarity and encode logical constraints directly into the embedding space.

ReasonAudio: A Benchmark forReasoning Beyond Matchingin Text-Audio Retrieval

The Matching Trap

Five Forms of Reasoning

Task Explorer

Building the Benchmark

Construction Pipeline

Quality Control

Audio Clip Composition

The Model Landscape

Model Comparison

Paradigm Comparison

The Accuracy Crisis

Full Results Heatmap

nDCG@10 — Ranking Quality

Does Bigger Mean Better?

Reasoning vs. Matching

Two-Option Reasoning Test

Retrieval vs. Reasoning Gap

What If Models Could Reason?

The Alignment Gap

Embedding Alignment Explorer

ReasonAudio: A Benchmark for
Reasoning Beyond Matching
in Text-Audio Retrieval