An Interactive Reading of

The EΔ-MHC-Geo Transformer:
Adaptive Geodesic Operations
with Guaranteed Orthogonality

Arash Shahmansoori
Independent Researcher · May 2026 · arXiv:2605.06729

The paper, in plain English

Every modern neural network passes information from one layer to the next through residual connections — a simple shortcut that adds the input to the layer's output. This works, but it comes with a hidden cost: nothing stops the signal's magnitude from drifting as it flows through dozens of layers. Norms swell or collapse, gradients vanish or explode, and training becomes fragile. The field has been patching this problem with ad-hoc fixes for a decade.

This paper takes a geometric approach. Instead of an additive shortcut, it replaces the identity with a rotation — specifically, a Data-Dependent Cayley transform that takes two vectors computed from the input, builds a skew-symmetric matrix, and converts it into an orthogonal rotation. The key insight is elegant: skew-symmetry is a property of the algebraic form $uv^\top - vu^\top$, not of the specific values of $u$ and $v$. So making $u(x)$ and $v(x)$ input-dependent preserves every algebraic guarantee — orthogonality, isometry, determinant $+1$ — for every input, every $\beta$, at every training step.

The result: 3.8× better stability than standard GPT on long-horizon tasks, norm deviation of just 0.001 (vs. 0.474 for GPT), and the only architecture in the comparison that can both rotate and reflect — achieving 0.96 cosine alignment on a negation task that breaks pure rotation methods. All with 33% fewer layers.

I

Data-Dependent Cayley Rotation

Skew-symmetric generators $A = uv^\top - vu^\top$ feed into the Cayley transform, producing unconditionally orthogonal rotations that adapt to every input.

II

Hybrid Gate Architecture

A learned gate $\gamma$ blends Cayley rotation (det $=+1$) with Householder reflection (det $=-1$), accessing both components of the orthogonal group $O(n)$.

III

Midpoint Collapse Regularization

The penalty $L_{\text{gate}} = 4\gamma(1-\gamma)$ drives the gate toward binary decisions — jump between orthogonal components rather than swim through non-orthogonal space.

~ 20 minutes · 8 chapters · 7 interactive simulations

CHAPTER 1

The problem with skip connections

Residual connections let us train deep networks by adding a shortcut past each layer. But this shortcut — the identity — provides no geometric guarantees. Norms drift, gradients vanish, and the deeper the network, the worse it gets.

In plain English

Imagine a row of people passing a bucket of water down a line. The standard shortcut says: "Just add the same bucket at each step." But what if each person also pours their own cup in? The bucket overflows by step 10. Or what if each person spills a little? By step 20, the bucket is empty. That's what happens inside a deep neural network — the signal either blows up or fades away.

The fix the paper proposes is more like a rotation. Instead of blindly adding, each layer rotates the bucket around a pivot point. A rotation preserves the bucket's size exactly — no overflow, no spill. The clever part: the direction of the rotation changes depending on the water's composition, so the network still adapts to different inputs.

Drag the depth slider in the simulation below and watch what happens to the signal magnitude under a standard residual versus an orthogonal one.

The standard residual connection computes $X_{l+1} = X_l + F(X_l)$, where $X_l$ is the layer's input and $F$ is the layer's transformation. The identity shortcut $X_l$ is the "skip" — it lets gradients flow backward unimpeded.

The problem: nothing constrains $\|X_{l+1}\|_2$. If $F(X_l)$ has components aligned with $X_l$, norms grow. If anti-aligned, norms shrink. Over dozens of layers, this drift compounds.

Standard Residual Connection

$$X_{l+1} = X_l + F(X_l)$$

DDL's Householder Operator — orthogonal only at $\beta \in \{0, 2\}$

$$X_{l+1} = H_\beta(X_l) + \beta k v^\top, \quad H_\beta = I - \beta k k^\top$$

Why this matters

Deep Delta Learning (DDL) tries to fix the norm drift by using a Householder reflection instead of the identity. But the Householder operator $H_\beta = I - \beta k k^\top$ is only orthogonal when $\beta = 0$ (trivial) or $\beta = 2$ (full reflection). During training, $\beta$ varies continuously — and at every other value, the orthogonality guarantee is broken.

The norm distortion is explicit (Corollary 5.2): $\|Hx\|_2^2 = \|x\|_2^2 + (\beta^2 - 2\beta)(k^\top x)^2$. When $0 < \beta < 2$, norms shrink. When $\beta > 2$, norms grow. There is no safe middle ground.

Interactive: Norm drift — standard vs. orthogonal residual

Drag the sliders to see how signal magnitude changes across network depth under three residual connection strategies.

Network depth (layers): 12

DDL β value: 1.0

Standard — Final Norm

—

DDL — Final Norm

—

Cayley — Final Norm

—

Next: the algebraic key that makes everything work →

CHAPTER 2

The algebraic key

Every guarantee in this paper traces back to one fact: the matrix $A = uv^\top - vu^\top$ is always skew-symmetric, no matter how $u$ and $v$ are computed. That algebraic form is the load-bearing wall.

In plain English

Think of $u$ and $v$ as two recipe ingredients. The operation "$u$ times $v$'s transpose minus $v$ times $u$'s transpose" is like a commutator — it measures how much $u$ and $v$ fail to commute with each other. The result always has a specific algebraic signature: it's skew-symmetric, meaning the matrix equals the negative of its own transpose.

This is the paper's deepest insight. Skew-symmetry doesn't come from choosing the right values for $u$ and $v$. It comes from the form of the expression $uv^\top - vu^\top$ itself. So you can compute $u$ and $v$ from any neural network you like — a linear layer, an MLP, a transformer — and skew-symmetry is preserved. The algebra is the invariant; the network is the variable.

In the simulation below, drag the components of $u$ and $v$ in any direction and watch: the generator $A$ stays skew-symmetric no matter what.

The paper defines two "generator networks" that compute vectors from the input's mean-pooled representation:

Generator Networks (Definition 3.1)

$$u(x) = W_u \cdot \bar{x} + b_u \in \mathbb{R}^n, \quad v(x) = W_v \cdot \bar{x} + b_v \in \mathbb{R}^n$$

Data-Dependent Skew-Symmetric Generator (Definition 3.2)

$$A(x) = u(x)v(x)^\top - v(x)u(x)^\top$$

Skew-Symmetry Preservation (Proposition 3.3)

$$A^\top = (uv^\top - vu^\top)^\top = vu^\top - uv^\top = -A$$

Why this matters

Corollary 3.4 is the architectural payload. The skew-symmetry of $A(x)$ holds regardless of how $u(x)$ and $v(x)$ are computed — whether by fixed parameters, linear layers, or deep neural networks. This means we can use arbitrarily expressive networks to determine which plane to rotate in, without ever worrying about breaking the algebraic structure that guarantees orthogonality downstream.

Interactive: Skew-symmetric generator explorer

Move the $u$ and $v$ vectors in 2D. The resulting skew-symmetric matrix $A = uv^\top - vu^\top$ always satisfies $A^\top = -A$.

u_x: 1.0

u_y: 0.5

v_x: -0.5

v_y: 1.0

A[0,1] (upper-right)

—

A[1,0] (lower-left)

—

Aᵀ + A (should be 0)

—

Eigenvalues (pure imaginary)

—

Next: how skew-symmetry becomes a rotation →

CHAPTER 3

From skew-symmetry to rotation

The Cayley transform converts a skew-symmetric matrix into an orthogonal rotation matrix. The paper makes this transform data-dependent — and proves that orthogonality holds unconditionally.

In plain English

Imagine you have a spinning top. The direction it spins is determined by a single angle. The Cayley transform is the mathematical machinery that takes a "spin descriptor" (the skew-symmetric matrix $A$) and produces the actual rotation — the matrix $Q$ that describes exactly where every point ends up after the spin.

The remarkable property: for any spin descriptor — no matter how wild the inputs that generated it — the resulting rotation is always a proper rotation. It never stretches space. It never compresses space. It only rotates. The parameter $\beta$ controls how far to rotate, and the paper proves this works for every value of $\beta$, not just special ones.

Drag the $\beta$ slider in the simulation and watch: the rotation angle changes, but $\det(Q) = +1$ and $Q^\top Q = I$ — always.

Data-Dependent Cayley Transform (Definition 3.5)

$$Q(x) = \left(I + \frac{\beta(x)}{2} A(x)\right)^{-1}\left(I - \frac{\beta(x)}{2} A(x)\right)$$

Unconditional Orthogonality (Theorem 4.1)

$$Q(x)^\top Q(x) = I_n \quad \text{for all } u, v, \beta$$

The proof is four steps (Section 4). Setting $M = \frac{\beta}{2}A(x)$, we have $Q = (I+M)^{-1}(I-M)$. Since $M$ is skew-symmetric, $(I+M)$ and $(I-M)$ are polynomials in $M$ and therefore commute. The rest is algebra: $Q^\top Q = (I+M)(I-M)^{-1}(I+M)^{-1}(I-M) = I$.

Key corollaries that follow immediately:

Isometry (Theorem 4.3): $\|Q(x)y\|_2 = \|y\|_2$ for all $y$.
Proper rotation (Theorem 4.4): $\det(Q(x)) = +1$.
Non-singularity (Theorem 4.5): $I + \frac{\beta}{2}A$ is always invertible.

Why this matters

Unlike DDL, where orthogonality is conditional on $\beta \in \{0, 2\}$, the Cayley transform is orthogonal for every $\beta$ and every input. Unlike mHC's Sinkhorn projection, which is only approximately orthogonal after 20+ iterations, the Cayley transform is exactly orthogonal in a single matrix solve. This is not a tuning trick — it's a structural guarantee baked into the algebra.

Interactive: Cayley rotation explorer

Adjust $\beta$ and the generator angle. Watch the 2D rotation update live — and verify that $Q^\top Q = I$ and $\det(Q) = +1$ at every setting.

β (rotation magnitude): 1.0

Generator angle (u vs v): 45°

Rotation angle θ

—

‖QᵀQ − I‖

—

det(Q)

—

Eigenvalues

—

Next: what the Cayley transform cannot do →

CHAPTER 4

The negation gap

The Cayley transform produces beautiful rotations — but it can never negate a signal. Eigenvalue $\lambda = -1$ is algebraically excluded. For tasks that require rapid sign reversal ("Actually, no — I meant the opposite"), this is a real limitation.

In plain English

Think of a record player. A rotation can spin the record to any angle — 90°, 180°, almost a full turn. But a rotation can never flip the record over. That flip — turning the grooves face-down — requires a reflection, not a rotation. It takes the record into a different space entirely.

The Cayley transform is like the record player. It can rotate to any angle in $(-\pi, \pi)$ — but it can never reach exactly $\pi$ (a full half-turn, which would negate the signal). The paper proves this rigorously: the eigenvalues of $Q$ are $e^{-2i\arctan(\beta\mu/2)}$, and since $\arctan$ maps to $(-\pi/2, \pi/2)$, the argument can never reach $\pm\pi$. Negation is topologically out of reach.

The simulation below shows the eigenvalue spectrum of $Q$ as $\beta$ grows. Watch the eigenvalues circle toward $-1$ — but never touch it.

Eigenvalue Exclusion (Theorem 4.6)

$$\lambda_k = e^{-2i\arctan(\beta\mu_k/2)}, \quad \arctan: \mathbb{R} \to \left(-\frac{\pi}{2}, \frac{\pi}{2}\right)$$ $$\text{Argument} \in (-\pi, \pi) \implies \lambda = -1 = e^{i\pi} \text{ is impossible}$$

Householder Reflection (Definition 6.1) — has eigenvalue $-1$ at $\beta = 2$

$$H_2(k) = I - 2kk^\top, \quad H_2(k)k = -k$$

Why this matters

For a model that needs to correct itself — "Wait, I was wrong, flip the signal" — the ability to negate is essential. The Cayley transform cannot do this, no matter how you set its parameters. But the Householder reflection at $\beta = 2$ does exactly this: $H_2(k)k = -k$. The paper's solution: use both, connected by a learned gate.

Interactive: Eigenvalue spectrum — Cayley vs. Householder

Drag $\beta$ to see how Cayley eigenvalues approach but never reach $-1$, while the Householder eigenvalue jumps exactly to $-1$ when $\beta = 2$.

β value: 1.0

Cayley angle (rad)

—

Distance to −1

—

Householder eigenvalue

—

Next: combining rotation and reflection →

CHAPTER 5

The hybrid architecture

The EΔ-MHC-Geo Hybrid blends a Cayley rotation and a Householder reflection through a learned gate $\gamma$. At the gate boundaries, the operator is exactly orthogonal — accessing both connected components of $O(n)$.

In plain English

Imagine you're building a door with two hinges. One hinge can only swing the door forward (rotation). The other can flip it entirely (reflection). You install a smart switch that picks which hinge to activate — and the switch learns from experience which one works better for each situation.

That's the EΔ-MHC-Geo Hybrid. When $\gamma \to 1$, the door swings via the Cayley hinge — a smooth, angle-preserving rotation with $\det = +1$. When $\gamma \to 0$, it flips via the Householder hinge — a sharp reflection with $\det = -1$. The switch ($\gamma$) is a sigmoid that takes the input's statistics and decides: "Does this situation call for a rotation or a reflection?"

In the simulation below, drag the gate slider $\gamma$ between 0 and 1 and watch the operator morph — and see the norm deviation spike in the middle, where neither component dominates.

EΔ-MHC-Geo Hybrid Operator (Definition 6.5)

$$G_\gamma(X) = \gamma(X) \cdot \underbrace{Q(X)X}_{\text{Cayley rotation}} + (1 - \gamma(X)) \cdot \underbrace{H_2(k(X))X}_{\text{Householder reflection}}$$

Full Layer Transition (Equation 6)

$$X_{l+1} = G_\gamma(X_l) + H_{\text{post}}^\top F\left(H_{\text{pre}} \cdot \text{LN}(G_\gamma(X_l))\right)$$

The gate $\gamma(X) = \sigma(W_\gamma \cdot \bar{X} + b_\gamma) \in (0,1)$ is a learned sigmoid. The Cayley branch $Q(X) \in SO(n)$ provides rotation ($\det = +1$); the Householder branch $H_2(k(X))$ provides reflection ($\det = -1$). At the boundaries:

$\gamma \to 1$: pure Cayley rotation — $\det(G) = +1$, exactly orthogonal.
$\gamma \to 0$: pure Householder reflection — $\det(G) = -1$, exactly orthogonal.
$\gamma \in (0,1)$: a blend — not orthogonal (Theorem 7.1), but driven toward boundaries by the midpoint collapse regularizer.

Why this matters

The orthogonal group $O(n)$ has two disconnected components: $SO(n)$ (rotations, $\det = +1$) and $O(n) \setminus SO(n)$ (reflections, $\det = -1$). There is no continuous path between them that stays on the manifold. The hybrid gate is a learned decision about which component to use — the model learns to jump between them based on the task.

Interactive: Gate behavior — blending rotation and reflection

Drag the gate $\gamma$ slider. At $\gamma = 1$, the operator is pure Cayley rotation. At $\gamma = 0$, pure Householder reflection. In between, a non-orthogonal blend.

γ (gate): 1.0

Reflection axis angle: 0°

det(G_γ)

—

‖GᵀG − I‖

—

Norm ratio

—

Active operator

—

Next: forcing the gate to commit →

CHAPTER 6

Jump, don't swim

The hybrid operator is only orthogonal at $\gamma = 0$ and $\gamma = 1$. The midpoint collapse regularizer $L_{\text{gate}} = 4\gamma(1-\gamma)$ pushes the gate toward these boundaries — with a surprising zero-gradient trap at $\gamma = 0.5$.

In plain English

Imagine you're standing on a narrow ridge between two valleys. The ridge represents $\gamma = 0.5$ — the worst possible place, where the operator is least orthogonal. The regularizer is like gravity: it pulls you down into one valley or the other. But there's a catch: at the exact center of the ridge ($\gamma = 0.5$), gravity is perfectly balanced — the gradient is zero. You're stuck unless something else pushes you off.

That "something else" is the task loss. The task itself creates an asymmetry that tips the balance, nudging $\gamma$ away from 0.5. Once it moves even slightly, the regularizer's gradient takes over and accelerates the collapse to the nearest boundary.

In the simulation, adjust the task gradient and watch how $\gamma$ evolves over training steps. With zero task gradient, it stays stuck at 0.5. With even a tiny push, it collapses to a boundary.

Midpoint Collapse Regularization (Definition 7.2)

$$L_{\text{gate}} = \lambda_{\text{gate}} \cdot 4\gamma(1 - \gamma)$$

Universal Zero-Gradient at Midpoint (Theorem 7.3)

$$\text{Any smooth, symmetric } f: [0,1] \to \mathbb{R} \text{ with } f(\gamma) = f(1-\gamma) \text{ has } f'(0.5) = 0$$

The total loss is $L_{\text{total}} = L_{\text{task}} + \sum_{\text{layers}} L_{\text{gate}}$. The regularization gradient $\partial L / \partial \gamma = 4(1 - 2\gamma)$ pushes $\gamma$ toward the boundaries. But at $\gamma = 0.5$ exactly, this gradient vanishes — and any smooth, symmetric regularizer suffers the same fate (Theorem 7.3).

Three escape mechanisms break the symmetry: (1) task loss gradient $L'_{\text{task}} \neq 0$, (2) input variation causing $\gamma$ to fluctuate across samples, and (3) biased initialization $b_\gamma \neq 0$.

Why this matters

The ablation (Table 9) shows that without regularization ($\lambda = 0$), the gate lingers in $\gamma \in [0.3, 0.7]$ and performance degrades by 44%. At $\lambda \geq 0.1$, binary polarization occurs and performance saturates. The regularizer isn't optional — it's the mechanism that makes the hybrid architecture work.

Interactive: Regularization dynamics

Adjust the regularization weight $\lambda$ and the task loss gradient. Simulate how $\gamma$ evolves over training steps.

λ_gate (regularization weight): 0.1

Task gradient (escape force): 0.05

Initial γ: 0.50

Next: the experimental evidence →

CHAPTER 7

The experiments speak

Four benchmarks, five architectures, matched parameters at ~1.79M each. The results confirm the algebraic predictions: EΔ-MHC-Geo dominates on stability and near-π rotation, while the hybrid gate learns to select the correct operator automatically.

In plain English

The authors tested five transformer architectures — standard GPT, DDL, mHC, JPmHC (a concurrent work), and their own EΔ-MHC-Geo — on carefully designed tasks that isolate specific geometric properties. It's like running five cars through five different obstacle courses, each designed to test a different capability.

The gyroscope course tests: "Can you track a rotation accurately over 255 steps?" JPmHC wins here, benefiting from a wider representation. The stability course tests: "Can you preserve signal magnitude over 127 steps?" EΔ-MHC-Geo wins by 1.9× over JPmHC. The negation course tests: "Can you flip a signal?" Only EΔ-MHC-Geo succeeds — the others literally can't do it.

The headline: EΔ-MHC-Geo achieves all this with just 6 layers — 33% fewer than baselines. Geometric structure substitutes for raw depth.

All models were configured for fair comparison with matched parameter counts (~1.79M). Training used AdamW with cosine decay, 2000 iterations, and results are averaged over 3 random seeds.

Interactive: Benchmark comparison dashboard

Hover over bars for exact values. The charts show the paper's main experimental results from Table 6 and Table 12.

Why this matters

On the reflection task, EΔ-MHC-Geo's gate converges to $\gamma = 0.051 \pm 0.005$ — within 5.1% of the theoretical target $\gamma = 0$ (pure Householder). Meanwhile, DDL independently discovers $\beta = 1.995 \pm 0.001$ via gradient descent, confirming the Householder theory. JPmHC, which has no reflection branch, remains stuck with negative cosine alignment at all sample sizes. The model learns the correct operator — it isn't told which to use.

Next: what it all means →

CHAPTER 8

What it all means

The EΔ-MHC-Geo Transformer makes a precise claim: the Cayley transform's orthogonality guarantee is algebraic, not parametric. Making the generator data-dependent preserves every guarantee. The experiments confirm it — but also reveal where the current approach hits its limits.

In plain English

Imagine discovering that a safety feature in a car — say, anti-lock brakes — doesn't depend on how hard you press the pedal, what kind of road you're on, or how fast you're going. It works unconditionally. That's what this paper proves about the Cayley transform's orthogonality: it doesn't depend on the input, the training stage, or the value of $\beta$. It's baked into the algebra.

The practical implication: you can build residual connections that are provably norm-preserving without any iterative projection, soft penalty, or special-case handling. The model gets to choose which direction to rotate in (data-dependent), but the rotation itself is always orthogonal by construction.

The limitation: at 1.79M parameters on synthetic benchmarks, this is a proof of concept. Scaling to billion-parameter language models is the open question. And the hybrid gate's linear blending between rotation and reflection is theoretically imperfect — a future geodesic interpolation on $O(n)$ could replace it.

Theoretical contributions. The paper establishes that the Data-Dependent Cayley transform is unconditionally orthogonal (Theorem 4.1), isometric (Theorem 4.3), and a proper rotation (Theorem 4.4) for all inputs and all $\beta$. It proves that eigenvalue $\lambda = -1$ is excluded (Theorem 4.6), motivating the Householder branch. And it identifies a universal zero-gradient trap at $\gamma = 0.5$ (Theorem 7.3) for any smooth, symmetric regularizer.

Empirical highlights. At matched ~1.79M parameters with 3 seeds: best long-horizon stability (3.8× over GPT, 1.9× over JPmHC), norm deviation of just 0.001 (474× better than GPT), best near-π loss (4.5× over JPmHC on single-plane), 0.96 cosine alignment on the negation diagnostic — all with 33% fewer layers.

Limitations. Experiments are on synthetic benchmarks at ~1.79M parameters. The $O(n^3)$ Cayley matrix solve is negligible at $n = 4$ streams but could bottleneck at larger $n$. Three seeds per configuration provide narrow confidence intervals but are a modest sample. And the linear blending at intermediate $\gamma$ is non-orthogonal during early training.

Future directions. Scaling to large language models, geodesic interpolation on $O(n)$ to replace convex blending, and extending to unitary groups for complex-valued architectures.

The bottom line

The deepest message isn't about any single number. It's that the structure of inter-layer transformations matters as much as their quantity. EΔ-MHC-Geo achieves competitive or superior results with 33% fewer layers, suggesting that when residual connections respect the geometry of the signal space, networks can be shallower without being weaker.

The paper also makes a careful claim: it does not claim uniform superiority over JPmHC. JPmHC's wider representation and full-rank mixer excel on pure rotation. The contribution is the hybrid — the only evaluated architecture that can handle both rotation and reflection, with exact orthogonality at each component.

Built from arXiv:2605.06729 · Shahmansoori (2026)

Code: github.com/arash-shahmansoori/edelta

The EΔ-MHC-Geo Transformer:Adaptive Geodesic Operationswith Guaranteed Orthogonality

The problem with skip connections

Interactive: Norm drift — standard vs. orthogonal residual

The algebraic key

Interactive: Skew-symmetric generator explorer

From skew-symmetry to rotation

Interactive: Cayley rotation explorer

The negation gap

Interactive: Eigenvalue spectrum — Cayley vs. Householder

The hybrid architecture

Interactive: Gate behavior — blending rotation and reflection

Jump, don't swim

Interactive: Regularization dynamics

The experiments speak

Interactive: Benchmark comparison dashboard

What it all means

The EΔ-MHC-Geo Transformer:
Adaptive Geodesic Operations
with Guaranteed Orthogonality