An Interactive Reading of

The EΔ-MHC-Geo Transformer:
Adaptive Geodesic Operations
with Guaranteed Orthogonality

The paper, in plain English

Every modern neural network passes information from one layer to the next through residual connections — a simple shortcut that adds the input to the layer's output. This works, but it comes with a hidden cost: nothing stops the signal's magnitude from drifting as it flows through dozens of layers. Norms swell or collapse, gradients vanish or explode, and training becomes fragile. The field has been patching this problem with ad-hoc fixes for a decade.

This paper takes a geometric approach. Instead of an additive shortcut, it replaces the identity with a rotation — specifically, a Data-Dependent Cayley transform that takes two vectors computed from the input, builds a skew-symmetric matrix, and converts it into an orthogonal rotation. The key insight is elegant: skew-symmetry is a property of the algebraic form $uv^\top - vu^\top$, not of the specific values of $u$ and $v$. So making $u(x)$ and $v(x)$ input-dependent preserves every algebraic guarantee — orthogonality, isometry, determinant $+1$ — for every input, every $\beta$, at every training step.

The result: 3.8× better stability than standard GPT on long-horizon tasks, norm deviation of just 0.001 (vs. 0.474 for GPT), and the only architecture in the comparison that can both rotate and reflect — achieving 0.96 cosine alignment on a negation task that breaks pure rotation methods. All with 33% fewer layers.

I
Data-Dependent Cayley Rotation
Skew-symmetric generators $A = uv^\top - vu^\top$ feed into the Cayley transform, producing unconditionally orthogonal rotations that adapt to every input.
II
Hybrid Gate Architecture
A learned gate $\gamma$ blends Cayley rotation (det $=+1$) with Householder reflection (det $=-1$), accessing both components of the orthogonal group $O(n)$.
III
Midpoint Collapse Regularization
The penalty $L_{\text{gate}} = 4\gamma(1-\gamma)$ drives the gate toward binary decisions — jump between orthogonal components rather than swim through non-orthogonal space.
~ 20 minutes · 8 chapters · 7 interactive simulations
CHAPTER 1

The problem with skip connections

Residual connections let us train deep networks by adding a shortcut past each layer. But this shortcut — the identity — provides no geometric guarantees. Norms drift, gradients vanish, and the deeper the network, the worse it gets.

The standard residual connection computes $X_{l+1} = X_l + F(X_l)$, where $X_l$ is the layer's input and $F$ is the layer's transformation. The identity shortcut $X_l$ is the "skip" — it lets gradients flow backward unimpeded.

The problem: nothing constrains $\|X_{l+1}\|_2$. If $F(X_l)$ has components aligned with $X_l$, norms grow. If anti-aligned, norms shrink. Over dozens of layers, this drift compounds.

Standard Residual Connection
$$X_{l+1} = X_l + F(X_l)$$
DDL's Householder Operator — orthogonal only at $\beta \in \{0, 2\}$
$$X_{l+1} = H_\beta(X_l) + \beta k v^\top, \quad H_\beta = I - \beta k k^\top$$
Why this matters
Deep Delta Learning (DDL) tries to fix the norm drift by using a Householder reflection instead of the identity. But the Householder operator $H_\beta = I - \beta k k^\top$ is only orthogonal when $\beta = 0$ (trivial) or $\beta = 2$ (full reflection). During training, $\beta$ varies continuously — and at every other value, the orthogonality guarantee is broken.

The norm distortion is explicit (Corollary 5.2): $\|Hx\|_2^2 = \|x\|_2^2 + (\beta^2 - 2\beta)(k^\top x)^2$. When $0 < \beta < 2$, norms shrink. When $\beta > 2$, norms grow. There is no safe middle ground.

Interactive: Norm drift — standard vs. orthogonal residual

Drag the sliders to see how signal magnitude changes across network depth under three residual connection strategies.

Standard — Final Norm
DDL — Final Norm
Cayley — Final Norm
Next: the algebraic key that makes everything work
CHAPTER 2

The algebraic key

Every guarantee in this paper traces back to one fact: the matrix $A = uv^\top - vu^\top$ is always skew-symmetric, no matter how $u$ and $v$ are computed. That algebraic form is the load-bearing wall.

The paper defines two "generator networks" that compute vectors from the input's mean-pooled representation:

Generator Networks (Definition 3.1)
$$u(x) = W_u \cdot \bar{x} + b_u \in \mathbb{R}^n, \quad v(x) = W_v \cdot \bar{x} + b_v \in \mathbb{R}^n$$
Data-Dependent Skew-Symmetric Generator (Definition 3.2)
$$A(x) = u(x)v(x)^\top - v(x)u(x)^\top$$
Skew-Symmetry Preservation (Proposition 3.3)
$$A^\top = (uv^\top - vu^\top)^\top = vu^\top - uv^\top = -A$$
Why this matters
Corollary 3.4 is the architectural payload. The skew-symmetry of $A(x)$ holds regardless of how $u(x)$ and $v(x)$ are computed — whether by fixed parameters, linear layers, or deep neural networks. This means we can use arbitrarily expressive networks to determine which plane to rotate in, without ever worrying about breaking the algebraic structure that guarantees orthogonality downstream.

Interactive: Skew-symmetric generator explorer

Move the $u$ and $v$ vectors in 2D. The resulting skew-symmetric matrix $A = uv^\top - vu^\top$ always satisfies $A^\top = -A$.

A[0,1] (upper-right)
A[1,0] (lower-left)
Aᵀ + A (should be 0)
Eigenvalues (pure imaginary)
Next: how skew-symmetry becomes a rotation
CHAPTER 3

From skew-symmetry to rotation

The Cayley transform converts a skew-symmetric matrix into an orthogonal rotation matrix. The paper makes this transform data-dependent — and proves that orthogonality holds unconditionally.

Data-Dependent Cayley Transform (Definition 3.5)
$$Q(x) = \left(I + \frac{\beta(x)}{2} A(x)\right)^{-1}\left(I - \frac{\beta(x)}{2} A(x)\right)$$
Unconditional Orthogonality (Theorem 4.1)
$$Q(x)^\top Q(x) = I_n \quad \text{for all } u, v, \beta$$

The proof is four steps (Section 4). Setting $M = \frac{\beta}{2}A(x)$, we have $Q = (I+M)^{-1}(I-M)$. Since $M$ is skew-symmetric, $(I+M)$ and $(I-M)$ are polynomials in $M$ and therefore commute. The rest is algebra: $Q^\top Q = (I+M)(I-M)^{-1}(I+M)^{-1}(I-M) = I$.

Key corollaries that follow immediately:

Why this matters
Unlike DDL, where orthogonality is conditional on $\beta \in \{0, 2\}$, the Cayley transform is orthogonal for every $\beta$ and every input. Unlike mHC's Sinkhorn projection, which is only approximately orthogonal after 20+ iterations, the Cayley transform is exactly orthogonal in a single matrix solve. This is not a tuning trick — it's a structural guarantee baked into the algebra.

Interactive: Cayley rotation explorer

Adjust $\beta$ and the generator angle. Watch the 2D rotation update live — and verify that $Q^\top Q = I$ and $\det(Q) = +1$ at every setting.

Rotation angle θ
‖QᵀQ − I‖
det(Q)
Eigenvalues
Next: what the Cayley transform cannot do
CHAPTER 4

The negation gap

The Cayley transform produces beautiful rotations — but it can never negate a signal. Eigenvalue $\lambda = -1$ is algebraically excluded. For tasks that require rapid sign reversal ("Actually, no — I meant the opposite"), this is a real limitation.

Eigenvalue Exclusion (Theorem 4.6)
$$\lambda_k = e^{-2i\arctan(\beta\mu_k/2)}, \quad \arctan: \mathbb{R} \to \left(-\frac{\pi}{2}, \frac{\pi}{2}\right)$$ $$\text{Argument} \in (-\pi, \pi) \implies \lambda = -1 = e^{i\pi} \text{ is impossible}$$
Householder Reflection (Definition 6.1) — has eigenvalue $-1$ at $\beta = 2$
$$H_2(k) = I - 2kk^\top, \quad H_2(k)k = -k$$
Why this matters
For a model that needs to correct itself — "Wait, I was wrong, flip the signal" — the ability to negate is essential. The Cayley transform cannot do this, no matter how you set its parameters. But the Householder reflection at $\beta = 2$ does exactly this: $H_2(k)k = -k$. The paper's solution: use both, connected by a learned gate.

Interactive: Eigenvalue spectrum — Cayley vs. Householder

Drag $\beta$ to see how Cayley eigenvalues approach but never reach $-1$, while the Householder eigenvalue jumps exactly to $-1$ when $\beta = 2$.

Cayley angle (rad)
Distance to −1
Householder eigenvalue
Next: combining rotation and reflection
CHAPTER 5

The hybrid architecture

The EΔ-MHC-Geo Hybrid blends a Cayley rotation and a Householder reflection through a learned gate $\gamma$. At the gate boundaries, the operator is exactly orthogonal — accessing both connected components of $O(n)$.

EΔ-MHC-Geo Hybrid Operator (Definition 6.5)
$$G_\gamma(X) = \gamma(X) \cdot \underbrace{Q(X)X}_{\text{Cayley rotation}} + (1 - \gamma(X)) \cdot \underbrace{H_2(k(X))X}_{\text{Householder reflection}}$$
Full Layer Transition (Equation 6)
$$X_{l+1} = G_\gamma(X_l) + H_{\text{post}}^\top F\left(H_{\text{pre}} \cdot \text{LN}(G_\gamma(X_l))\right)$$

The gate $\gamma(X) = \sigma(W_\gamma \cdot \bar{X} + b_\gamma) \in (0,1)$ is a learned sigmoid. The Cayley branch $Q(X) \in SO(n)$ provides rotation ($\det = +1$); the Householder branch $H_2(k(X))$ provides reflection ($\det = -1$). At the boundaries:

Why this matters
The orthogonal group $O(n)$ has two disconnected components: $SO(n)$ (rotations, $\det = +1$) and $O(n) \setminus SO(n)$ (reflections, $\det = -1$). There is no continuous path between them that stays on the manifold. The hybrid gate is a learned decision about which component to use — the model learns to jump between them based on the task.

Interactive: Gate behavior — blending rotation and reflection

Drag the gate $\gamma$ slider. At $\gamma = 1$, the operator is pure Cayley rotation. At $\gamma = 0$, pure Householder reflection. In between, a non-orthogonal blend.

det(Gγ)
‖GᵀG − I‖
Norm ratio
Active operator
Next: forcing the gate to commit
CHAPTER 6

Jump, don't swim

The hybrid operator is only orthogonal at $\gamma = 0$ and $\gamma = 1$. The midpoint collapse regularizer $L_{\text{gate}} = 4\gamma(1-\gamma)$ pushes the gate toward these boundaries — with a surprising zero-gradient trap at $\gamma = 0.5$.

Midpoint Collapse Regularization (Definition 7.2)
$$L_{\text{gate}} = \lambda_{\text{gate}} \cdot 4\gamma(1 - \gamma)$$
Universal Zero-Gradient at Midpoint (Theorem 7.3)
$$\text{Any smooth, symmetric } f: [0,1] \to \mathbb{R} \text{ with } f(\gamma) = f(1-\gamma) \text{ has } f'(0.5) = 0$$

The total loss is $L_{\text{total}} = L_{\text{task}} + \sum_{\text{layers}} L_{\text{gate}}$. The regularization gradient $\partial L / \partial \gamma = 4(1 - 2\gamma)$ pushes $\gamma$ toward the boundaries. But at $\gamma = 0.5$ exactly, this gradient vanishes — and any smooth, symmetric regularizer suffers the same fate (Theorem 7.3).

Three escape mechanisms break the symmetry: (1) task loss gradient $L'_{\text{task}} \neq 0$, (2) input variation causing $\gamma$ to fluctuate across samples, and (3) biased initialization $b_\gamma \neq 0$.

Why this matters
The ablation (Table 9) shows that without regularization ($\lambda = 0$), the gate lingers in $\gamma \in [0.3, 0.7]$ and performance degrades by 44%. At $\lambda \geq 0.1$, binary polarization occurs and performance saturates. The regularizer isn't optional — it's the mechanism that makes the hybrid architecture work.

Interactive: Regularization dynamics

Adjust the regularization weight $\lambda$ and the task loss gradient. Simulate how $\gamma$ evolves over training steps.

Next: the experimental evidence
CHAPTER 7

The experiments speak

Four benchmarks, five architectures, matched parameters at ~1.79M each. The results confirm the algebraic predictions: EΔ-MHC-Geo dominates on stability and near-π rotation, while the hybrid gate learns to select the correct operator automatically.

All models were configured for fair comparison with matched parameter counts (~1.79M). Training used AdamW with cosine decay, 2000 iterations, and results are averaged over 3 random seeds.

Interactive: Benchmark comparison dashboard

Hover over bars for exact values. The charts show the paper's main experimental results from Table 6 and Table 12.

Why this matters
On the reflection task, EΔ-MHC-Geo's gate converges to $\gamma = 0.051 \pm 0.005$ — within 5.1% of the theoretical target $\gamma = 0$ (pure Householder). Meanwhile, DDL independently discovers $\beta = 1.995 \pm 0.001$ via gradient descent, confirming the Householder theory. JPmHC, which has no reflection branch, remains stuck with negative cosine alignment at all sample sizes. The model learns the correct operator — it isn't told which to use.
Next: what it all means
CHAPTER 8

What it all means

The EΔ-MHC-Geo Transformer makes a precise claim: the Cayley transform's orthogonality guarantee is algebraic, not parametric. Making the generator data-dependent preserves every guarantee. The experiments confirm it — but also reveal where the current approach hits its limits.

Theoretical contributions. The paper establishes that the Data-Dependent Cayley transform is unconditionally orthogonal (Theorem 4.1), isometric (Theorem 4.3), and a proper rotation (Theorem 4.4) for all inputs and all $\beta$. It proves that eigenvalue $\lambda = -1$ is excluded (Theorem 4.6), motivating the Householder branch. And it identifies a universal zero-gradient trap at $\gamma = 0.5$ (Theorem 7.3) for any smooth, symmetric regularizer.

Empirical highlights. At matched ~1.79M parameters with 3 seeds: best long-horizon stability (3.8× over GPT, 1.9× over JPmHC), norm deviation of just 0.001 (474× better than GPT), best near-π loss (4.5× over JPmHC on single-plane), 0.96 cosine alignment on the negation diagnostic — all with 33% fewer layers.

Limitations. Experiments are on synthetic benchmarks at ~1.79M parameters. The $O(n^3)$ Cayley matrix solve is negligible at $n = 4$ streams but could bottleneck at larger $n$. Three seeds per configuration provide narrow confidence intervals but are a modest sample. And the linear blending at intermediate $\gamma$ is non-orthogonal during early training.

Future directions. Scaling to large language models, geodesic interpolation on $O(n)$ to replace convex blending, and extending to unitary groups for complex-valued architectures.

The bottom line
The deepest message isn't about any single number. It's that the structure of inter-layer transformations matters as much as their quantity. EΔ-MHC-Geo achieves competitive or superior results with 33% fewer layers, suggesting that when residual connections respect the geometry of the signal space, networks can be shallower without being weaker.

The paper also makes a careful claim: it does not claim uniform superiority over JPmHC. JPmHC's wider representation and full-rank mixer excel on pure rotation. The contribution is the hybrid — the only evaluated architecture that can handle both rotation and reflection, with exact orthogonality at each component.

Built from arXiv:2605.06729 · Shahmansoori (2026)

Code: github.com/arash-shahmansoori/edelta