An Interactive Reading of

When Language
Overwrites Vision

Over-Alignment and Geometric Debiasing in Vision-Language Models

Harshvardhan Saini, Samyak Jha, Yiming Tang & Dianbo Liu
IIT Dhanbad & National University of Singapore · May 2026 · arXiv:2605.08245

The paper, in plain English

Vision-Language Models — the AI systems that look at an image and describe what they see — have an embarrassing habit: they confidently describe objects that aren’t there. Show them a dining table with no person nearby, and they’ll tell you about the person they expect to see. The problem isn’t that the camera is broken. The problem is that the model’s language side is overriding its visual side.

The authors trace this failure to a geometric root cause: during training, the model is forced to squeeze rich visual data into the narrow shape of language. To bridge this gap, the model takes a shortcut — it injects high-frequency text patterns into the top dimensions of its visual representations. These patterns carry no information about the actual image; they serve purely as a bridge to make the math work. The result is a universal linguistic bias that lives in the top principal components of the embedding space, stable across every dataset tested.

The fix is deceptively simple: project out those top components, and the model stops hallucinating. On LLaVA-1.5, the method cuts hallucination rates by up to 27%. On Qwen2.5-VL, it drops them by 21%. No retraining required — just a single matrix subtraction at inference time, adding zero computational overhead. When baked into fine-tuning, it improves caption quality on specialist domains like satellite imagery. The core insight: sometimes the best way to make AI see better is to stop it from listening to itself.

I

Over-Alignment

Forcing visual embeddings into the text manifold injects a statistical linguistic bias that systematically overshadows fine-grained visual evidence.

II

Universal Text Subspace

The bias concentrates in the top principal components of a dataset-agnostic text manifold — the same directions dominate regardless of which captions you use.

III

Geometric Debiasing

Projecting out those top PCs from vision embeddings slashes hallucinations by up to 27% with no retraining and zero computational overhead.

~ 20 minutes · 6 chapters · 5 interactive simulations

CHAPTER 1

Inside the VLM

A Vision-Language Model has three moving parts: eyes, a translator, and a mouth. The problem lives in the translator — and it’s subtler than you’d think.

In plain English

Think of a VLM like a person who speaks only English, trying to understand a Japanese newspaper. They need a translator — someone who converts the visual symbols (kanji) into words they can reason about. In a VLM, the “eyes” are a vision encoder (like CLIP’s ViT), the “translator” is a cross-modal projector (usually an MLP), and the “mouth” is a large language model (like LLaMA or Qwen).

The trouble is that the visual world and the linguistic world naturally live in different parts of mathematical space. Pictures spread out in a wide, high-dimensional cone. Words cluster in a narrow band. The translator has to squeeze the one into the other — and it does so by exploiting a statistical shortcut.

Click on each component in the diagram below to see what it does and where the trouble starts.

Decoder-based VLMs couple a visual encoder with a pretrained language model via a cross-modal projector. The central challenge is the modality gap: visual and linguistic representations naturally occupy disjoint manifolds within the shared latent space.

Because the LLM decoder uses scaled dot-product attention, both modalities must share a consistent representation space for the dot product $\mathbf{Q} \cdot \mathbf{K}^T$ to yield meaningful similarity scores. This creates a mathematical necessity for some degree of cross-modal alignment.

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}}\right)\mathbf{V}$$

For the dot product $\mathbf{Q} \cdot \mathbf{K}^T$ to yield meaningful cross-modal similarity scores between visual queries and textual keys, both must operate within a shared, dimensionally consistent coordinate system. Some degree of alignment is unavoidable.

The question is: how much alignment is too much?

VLM Architecture

Click on any component to see details about its role and failure mode.

Select a component above to explore the VLM pipeline and understand where over-alignment originates.

Why this matters

The modality gap isn’t a bug — it’s a mathematical consequence of training vision and language encoders separately. The problem begins when the cross-modal projector aggressively closes this gap by injecting statistical textual patterns into the visual representations. What starts as a necessary bridge becomes a structural shortcut.

Next: The universal text subspace →

CHAPTER 2

The Universal Text Subspace

Every text dataset — whether it’s COCO captions, chart descriptions, or document Q&A — carves out nearly the same top directions in embedding space. That’s the linguistic bias, and it’s everywhere.

In plain English

Imagine you computed the “average face” from a photo database — then found that the same average face appears whether you used passport photos, Instagram selfies, or LinkedIn headshots. That’s what the authors discovered about text: the top directions of variation in language are universal.

They computed Principal Component Analysis (PCA) on caption embeddings from COCO, ChartQA, DocVQA, and InfographicVQA — four wildly different datasets. The top principal components? Nearly identical. A Frobenius norm similarity above 0.84 across all pairs.

This means the bias isn’t tied to any particular dataset. It’s baked into how language works statistically. And because VLM pretraining forces visual embeddings into these dominant directions, the bias becomes inescapable.

To investigate the geometry of the shared space, the authors first rigorously define the text manifold. They compute principal components from caption embeddings drawn from pretraining datasets:

Given text token embeddings $\{\mathbf{t}_1, \dots, \mathbf{t}_N\}$, compute the covariance matrix and extract the top-$K$ principal components $\{\vec{k}_1, \dots, \vec{k}_K\}$, forming the text-induced PCA basis.

The critical finding: the top PCs form a consistently common manifold across all evaluated datasets. This means standard pretraining injects a universal statistical linguistic bias rather than task-specific semantic structures.

To verify stability, the authors compute the Frobenius norm between text manifolds from different datasets:

$$\text{Similarity} = \frac{\|\mathbf{U}_i^T \mathbf{U}_j\|_F}{\sqrt{\min(r_i, r_j)}}$$

Cross-Dataset Text Manifold Stability

The chart shows how similar the top principal components are across different text datasets. Hover for exact values.

Why this matters

If the linguistic bias were dataset-specific, you could fix it by changing your training data. But because it’s universal — the same top directions dominate regardless of what captions you use — the problem is structural. The only fix is to remove those directions from the visual representations.

Next: When alignment overwrites vision →

CHAPTER 3

When Alignment Overwrites Vision

The model doesn’t just bridge the gap between vision and language. It overwrites visual detail with statistical text patterns. The evidence is in the layer-by-layer alignment scores.

In plain English

Think of a courtroom sketch artist who’s been told that most defendants wear suits. After a while, the artist starts drawing suits on everyone — even the ones in t-shirts. The “suit” is the statistical text prior; the actual clothing is the visual evidence. The artist (the model) draws what it expects, not what it sees.

The paper proves this by tracking alignment scores layer by layer through the LLM decoder. At each layer, they measure how much of the vision embedding lives in the text manifold. In early-to-middle layers, the score shoots up: the network is aggressively forcing vision tokens to overlap with text. But here’s the twist — when they remove the top few principal components from the text manifold, the alignment drops sharply.

This confirms the over-alignment hypothesis: the top PCs are hijacked to satisfy the alignment objective. They carry no real visual information. Drag the slider below to see how removing different numbers of top PCs changes the alignment trajectory.

The authors define the text-aligned representation of a vision embedding $\vec{v}^{(l)}$ at layer $l$ as its projection onto the text PCA basis:

$$\vec{z}^{(l)} = \sum_i \left(\left(\vec{v}^{(l)} - \vec{m}\right) \cdot \vec{k}_i\right) \vec{k}_i$$

where $\vec{m}$ is the mean of the text manifold. The alignment score is the ratio of the projected norm to the original centered norm:

$$\text{Align}^{(l)} = \frac{1}{M} \sum_{j=1}^{M} \frac{\|\vec{z}_j^{(l)}\|^2}{\|\vec{v}_j^{(l)} - \vec{m}\|^2}$$

As shown in Figure 2a of the paper, full vision embeddings exhibit aggressively high alignment in early-to-middle layers. When the top $k$ PCs are removed (Figure 2b), alignment drops significantly — confirming that linguistic bias is concentrated in those top directions.

Layer-wise Alignment Trajectory

Top PCs removed (k): 0

0510

Drag the slider to see how removing top PCs changes the alignment trajectory across decoder layers.

Why this matters

The linear probe results are devastating. In the baseline model, visual decodability degrades through the network — the model literally forgets what it saw. When the top text PCs are projected out, decodability recovers. The top PCs aren’t helping the model see; they’re actively masking the visual signal.

Next: The orthogonal cure →

CHAPTER 4

The Orthogonal Cure

Remove the top text principal components from vision embeddings. That’s it. The math is clean, the overhead is zero, and the effect is dramatic.

In plain English

Imagine you’re editing a photo and notice a grey haze over everything. You discover the haze is caused by a single light setting on your camera. Turn off that light, and the true colours come through. That’s what this method does: it identifies the “light setting” (the top text PCs) that’s been casting a linguistic haze over the visual data, and turns it off.

The debiasing is a projection: take the vision embedding, compute how much of it lies along each of the top text PCs, and subtract those components out. What remains is orthogonal to the linguistic bias — pure visual signal.

Drag the vector around in the simulation below. Watch how the projection onto text PCs (the red arrow) grows or shrinks, and how the debiased residual (the teal arrow) captures the true visual content.

The training-free debiasing strategy starts by mean-centering the vision embedding $\vec{v}$ using the text manifold mean $\vec{m}$. It then projects out the top-$k$ principal components:

$$\vec{v}_{\text{debiased}} = \vec{v} - \sum_{i=1}^{k} \left(\left(\vec{v} - \vec{m}\right) \cdot \vec{k}_i\right) \vec{k}_i$$

This is the orthogonal complement of the vision embedding with respect to the top-$k$ text PCs. By projecting out these components, the method explicitly removes the textual bias, forcing the VLM to rely on generalized semantic information rather than statistical linguistic shortcuts.

The authors propose two complementary remedies:

Training-free inference strategy — Apply the projection at inference time. No retraining, no extra compute. Just matrix subtraction.
Bias-aware fine-tuning — Integrate the projection into the training loop, depriving the network of textual shortcuts during parameter updates.

Geometric Debiasing Visualization

Number of PCs to remove (k): 2

025

Vision vector angle: 45°

0°180°360°

Adjust the sliders to see how the projection changes. The teal arrow is the debiased visual signal.

Original norm

1.00

Projected (bias) norm

0.00

Debiased norm

1.00

Bias fraction removed

0%

Why this matters

The elegance is in the geometry. The method doesn’t add anything — it removes. And because the text subspace is orthogonal to the fine-grained visual signal, removing the bias reveals rather than destroys. The model sees more by having less shoved into its representations.

Next: The benchmark evidence →

CHAPTER 5

By the Numbers

On LLaVA-1.5, hallucination rates drop by up to 27%. On Qwen2.5-VL, by 21%. No retraining. No extra compute. Just geometry.

Hallucination Benchmark Comparison

Qwen2.5-VL-7B

7B parameters

Strong recent VLM with aggressive alignment.

LLaVA-1.5-7B

7B parameters

Widely-studied VLM with known hallucination issues.

Ablation: How Many PCs to Remove?

Top PCs removed (K): 5

02550

Drag to see the sweet spot: too few PCs and the bias remains; too many and visual detail starts degrading.

-26.7%

AMBER CHAIR reduction on LLaVA-1.5-7B

-21.2%

CHAIR_i reduction on Qwen2.5-VL-7B

0%

Additional computational overhead

Why this matters

The method beats multi-pass decoding strategies (VCD, SID, DMAS) that require running the model multiple times per inference. A single matrix subtraction outperforms contrastive decoding — because it attacks the root cause rather than the symptoms. And the ablation study reveals a sweet spot around K=2–5: enough to remove the bias, not enough to hurt visual detail.

Next: What it all means →

CHAPTER 6

What It All Means

The fine-tuning results confirm the geometric story, the logit lens reveals what the model “sees” under the bias, and the limitations point to a deeper architectural lesson.

In plain English

When the debiasing is baked into fine-tuning — not just applied at inference — the model learns to never rely on linguistic shortcuts during training. On specialist captioning tasks like satellite imagery (VRSBench) and text-in-images (TextCaps), the bias-aware model writes captions that are more accurate and less hallucinated than standard fine-tuning, using the exact same data and compute budget.

The logit lens analysis is perhaps the most striking visual evidence. In the baseline model, individual image patches decode to punctuation marks and articles — “,”, “a”, “in”. After debiasing, the same patches decode to real visual objects: “bench”, “refrigerator”, “deck”. The model was literally “seeing” syntax instead of content.

The punchline: the current decoder-based VLM architecture has a structural flaw. The requirement to compress visual data into a text-shaped space creates an inevitable bottleneck. Future architectures may need separate, modality-specific encoders.

Bias-Aware Fine-Tuning: CLAIR Scores

The chart compares zero-shot, standard fine-tuning, and bias-aware fine-tuning on caption quality (CLAIR score, higher is better).

The Logit Lens: Seeing What the Model Sees

By applying the logit lens technique to latent representations at later decoder layers (26–28), the authors multiply the hidden state of specific image patches with the unembedding matrix to decode what concepts dominate each token:

,

Baseline: “,”

A visual patch in the baseline model decodes to punctuation — the highest-probability syntactic token. No visual meaning at all.

After debiasing: “bench” — the actual object in the patch.

a

Baseline: “a”

Another patch collapses to the most common English article. The visual content has been overwritten by text statistics.

After debiasing: “doorway” — the real architectural feature.

in

Baseline: “in”

A spatial preposition replaces fine-grained visual semantics. The model “sees” grammar, not content.

After debiasing: “man” — the actual person in the image region.

The deeper lesson

The authors argue that the current decoder-based architecture is fundamentally flawed: the structural necessity of compressing fine-grained visual space into a text manifold creates an unavoidable bottleneck. Future architectures may need separate, modality-specific encoders to preserve semantic integrity. The geometric debiasing is a powerful band-aid — but the real fix may require rethinking the architecture entirely.

74.70

CLAIR score on VRSBench (Qwen2.5-VL), vs 72.30 for standard SFT

63.80

CLAIR score on TextCaps (LLaVA-1.5), vs 60.30 for standard SFT

K = 2

Top PCs removed during fine-tuning: minimal intervention, maximal gain

Back to the top ↑

When LanguageOverwrites Vision

Inside the VLM

VLM Architecture

The Universal Text Subspace

Cross-Dataset Text Manifold Stability

When Alignment Overwrites Vision

Layer-wise Alignment Trajectory

The Orthogonal Cure

Geometric Debiasing Visualization

By the Numbers

Hallucination Benchmark Comparison

Ablation: How Many PCs to Remove?

What It All Means

Bias-Aware Fine-Tuning: CLAIR Scores

The Logit Lens: Seeing What the Model Sees

When Language
Overwrites Vision