An Interactive Reading of

When Language
Overwrites Vision

Over-Alignment and Geometric Debiasing in Vision-Language Models
The paper, in plain English

Vision-Language Models — the AI systems that look at an image and describe what they see — have an embarrassing habit: they confidently describe objects that aren’t there. Show them a dining table with no person nearby, and they’ll tell you about the person they expect to see. The problem isn’t that the camera is broken. The problem is that the model’s language side is overriding its visual side.

The authors trace this failure to a geometric root cause: during training, the model is forced to squeeze rich visual data into the narrow shape of language. To bridge this gap, the model takes a shortcut — it injects high-frequency text patterns into the top dimensions of its visual representations. These patterns carry no information about the actual image; they serve purely as a bridge to make the math work. The result is a universal linguistic bias that lives in the top principal components of the embedding space, stable across every dataset tested.

The fix is deceptively simple: project out those top components, and the model stops hallucinating. On LLaVA-1.5, the method cuts hallucination rates by up to 27%. On Qwen2.5-VL, it drops them by 21%. No retraining required — just a single matrix subtraction at inference time, adding zero computational overhead. When baked into fine-tuning, it improves caption quality on specialist domains like satellite imagery. The core insight: sometimes the best way to make AI see better is to stop it from listening to itself.

I
Over-Alignment
Forcing visual embeddings into the text manifold injects a statistical linguistic bias that systematically overshadows fine-grained visual evidence.
II
Universal Text Subspace
The bias concentrates in the top principal components of a dataset-agnostic text manifold — the same directions dominate regardless of which captions you use.
III
Geometric Debiasing
Projecting out those top PCs from vision embeddings slashes hallucinations by up to 27% with no retraining and zero computational overhead.
~ 20 minutes · 6 chapters · 5 interactive simulations
CHAPTER 1

Inside the VLM

A Vision-Language Model has three moving parts: eyes, a translator, and a mouth. The problem lives in the translator — and it’s subtler than you’d think.

Decoder-based VLMs couple a visual encoder with a pretrained language model via a cross-modal projector. The central challenge is the modality gap: visual and linguistic representations naturally occupy disjoint manifolds within the shared latent space.

Because the LLM decoder uses scaled dot-product attention, both modalities must share a consistent representation space for the dot product $\mathbf{Q} \cdot \mathbf{K}^T$ to yield meaningful similarity scores. This creates a mathematical necessity for some degree of cross-modal alignment.

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}}\right)\mathbf{V}$$

For the dot product $\mathbf{Q} \cdot \mathbf{K}^T$ to yield meaningful cross-modal similarity scores between visual queries and textual keys, both must operate within a shared, dimensionally consistent coordinate system. Some degree of alignment is unavoidable.

The question is: how much alignment is too much?

VLM Architecture

Click on any component to see details about its role and failure mode.

Select a component above to explore the VLM pipeline and understand where over-alignment originates.
Why this matters
The modality gap isn’t a bug — it’s a mathematical consequence of training vision and language encoders separately. The problem begins when the cross-modal projector aggressively closes this gap by injecting statistical textual patterns into the visual representations. What starts as a necessary bridge becomes a structural shortcut.
Next: The universal text subspace
CHAPTER 2

The Universal Text Subspace

Every text dataset — whether it’s COCO captions, chart descriptions, or document Q&A — carves out nearly the same top directions in embedding space. That’s the linguistic bias, and it’s everywhere.

To investigate the geometry of the shared space, the authors first rigorously define the text manifold. They compute principal components from caption embeddings drawn from pretraining datasets:

Given text token embeddings $\{\mathbf{t}_1, \dots, \mathbf{t}_N\}$, compute the covariance matrix and extract the top-$K$ principal components $\{\vec{k}_1, \dots, \vec{k}_K\}$, forming the text-induced PCA basis.

The critical finding: the top PCs form a consistently common manifold across all evaluated datasets. This means standard pretraining injects a universal statistical linguistic bias rather than task-specific semantic structures.

To verify stability, the authors compute the Frobenius norm between text manifolds from different datasets:

$$\text{Similarity} = \frac{\|\mathbf{U}_i^T \mathbf{U}_j\|_F}{\sqrt{\min(r_i, r_j)}}$$

Cross-Dataset Text Manifold Stability

The chart shows how similar the top principal components are across different text datasets. Hover for exact values.

Why this matters
If the linguistic bias were dataset-specific, you could fix it by changing your training data. But because it’s universal — the same top directions dominate regardless of what captions you use — the problem is structural. The only fix is to remove those directions from the visual representations.
Next: When alignment overwrites vision
CHAPTER 3

When Alignment Overwrites Vision

The model doesn’t just bridge the gap between vision and language. It overwrites visual detail with statistical text patterns. The evidence is in the layer-by-layer alignment scores.

The authors define the text-aligned representation of a vision embedding $\vec{v}^{(l)}$ at layer $l$ as its projection onto the text PCA basis:

$$\vec{z}^{(l)} = \sum_i \left(\left(\vec{v}^{(l)} - \vec{m}\right) \cdot \vec{k}_i\right) \vec{k}_i$$

where $\vec{m}$ is the mean of the text manifold. The alignment score is the ratio of the projected norm to the original centered norm:

$$\text{Align}^{(l)} = \frac{1}{M} \sum_{j=1}^{M} \frac{\|\vec{z}_j^{(l)}\|^2}{\|\vec{v}_j^{(l)} - \vec{m}\|^2}$$

As shown in Figure 2a of the paper, full vision embeddings exhibit aggressively high alignment in early-to-middle layers. When the top $k$ PCs are removed (Figure 2b), alignment drops significantly — confirming that linguistic bias is concentrated in those top directions.

Layer-wise Alignment Trajectory

0510

Drag the slider to see how removing top PCs changes the alignment trajectory across decoder layers.

Why this matters
The linear probe results are devastating. In the baseline model, visual decodability degrades through the network — the model literally forgets what it saw. When the top text PCs are projected out, decodability recovers. The top PCs aren’t helping the model see; they’re actively masking the visual signal.
Next: The orthogonal cure
CHAPTER 4

The Orthogonal Cure

Remove the top text principal components from vision embeddings. That’s it. The math is clean, the overhead is zero, and the effect is dramatic.

The training-free debiasing strategy starts by mean-centering the vision embedding $\vec{v}$ using the text manifold mean $\vec{m}$. It then projects out the top-$k$ principal components:

$$\vec{v}_{\text{debiased}} = \vec{v} - \sum_{i=1}^{k} \left(\left(\vec{v} - \vec{m}\right) \cdot \vec{k}_i\right) \vec{k}_i$$

This is the orthogonal complement of the vision embedding with respect to the top-$k$ text PCs. By projecting out these components, the method explicitly removes the textual bias, forcing the VLM to rely on generalized semantic information rather than statistical linguistic shortcuts.

The authors propose two complementary remedies:

Geometric Debiasing Visualization

025
180°360°

Adjust the sliders to see how the projection changes. The teal arrow is the debiased visual signal.

Original norm
1.00
Projected (bias) norm
0.00
Debiased norm
1.00
Bias fraction removed
0%
Why this matters
The elegance is in the geometry. The method doesn’t add anything — it removes. And because the text subspace is orthogonal to the fine-grained visual signal, removing the bias reveals rather than destroys. The model sees more by having less shoved into its representations.
Next: The benchmark evidence
CHAPTER 5

By the Numbers

On LLaVA-1.5, hallucination rates drop by up to 27%. On Qwen2.5-VL, by 21%. No retraining. No extra compute. Just geometry.

Hallucination Benchmark Comparison

Qwen2.5-VL-7B
7B parameters
Strong recent VLM with aggressive alignment.
LLaVA-1.5-7B
7B parameters
Widely-studied VLM with known hallucination issues.

Ablation: How Many PCs to Remove?

02550

Drag to see the sweet spot: too few PCs and the bias remains; too many and visual detail starts degrading.

-26.7%
AMBER CHAIR reduction on LLaVA-1.5-7B
-21.2%
CHAIRi reduction on Qwen2.5-VL-7B
0%
Additional computational overhead
Why this matters
The method beats multi-pass decoding strategies (VCD, SID, DMAS) that require running the model multiple times per inference. A single matrix subtraction outperforms contrastive decoding — because it attacks the root cause rather than the symptoms. And the ablation study reveals a sweet spot around K=2–5: enough to remove the bias, not enough to hurt visual detail.
Next: What it all means
CHAPTER 6

What It All Means

The fine-tuning results confirm the geometric story, the logit lens reveals what the model “sees” under the bias, and the limitations point to a deeper architectural lesson.

Bias-Aware Fine-Tuning: CLAIR Scores

The chart compares zero-shot, standard fine-tuning, and bias-aware fine-tuning on caption quality (CLAIR score, higher is better).

The Logit Lens: Seeing What the Model Sees

By applying the logit lens technique to latent representations at later decoder layers (26–28), the authors multiply the hidden state of specific image patches with the unembedding matrix to decode what concepts dominate each token:

,
Baseline: “,”
A visual patch in the baseline model decodes to punctuation — the highest-probability syntactic token. No visual meaning at all.
After debiasing: “bench” — the actual object in the patch.
a
Baseline: “a”
Another patch collapses to the most common English article. The visual content has been overwritten by text statistics.
After debiasing: “doorway” — the real architectural feature.
in
Baseline: “in”
A spatial preposition replaces fine-grained visual semantics. The model “sees” grammar, not content.
After debiasing: “man” — the actual person in the image region.
The deeper lesson
The authors argue that the current decoder-based architecture is fundamentally flawed: the structural necessity of compressing fine-grained visual space into a text manifold creates an unavoidable bottleneck. Future architectures may need separate, modality-specific encoders to preserve semantic integrity. The geometric debiasing is a powerful band-aid — but the real fix may require rethinking the architecture entirely.
74.70
CLAIR score on VRSBench (Qwen2.5-VL), vs 72.30 for standard SFT
63.80
CLAIR score on TextCaps (LLaVA-1.5), vs 60.30 for standard SFT
K = 2
Top PCs removed during fine-tuning: minimal intervention, maximal gain
Back to the top