Every modern neural network passes information from one layer to the next through residual connections — a simple shortcut that adds the input to the layer's output. This works, but it comes with a hidden cost: nothing stops the signal's magnitude from drifting as it flows through dozens of layers. Norms swell or collapse, gradients vanish or explode, and training becomes fragile. The field has been patching this problem with ad-hoc fixes for a decade.
This paper takes a geometric approach. Instead of an additive shortcut, it replaces the identity with a rotation — specifically, a Data-Dependent Cayley transform that takes two vectors computed from the input, builds a skew-symmetric matrix, and converts it into an orthogonal rotation. The key insight is elegant: skew-symmetry is a property of the algebraic form $uv^\top - vu^\top$, not of the specific values of $u$ and $v$. So making $u(x)$ and $v(x)$ input-dependent preserves every algebraic guarantee — orthogonality, isometry, determinant $+1$ — for every input, every $\beta$, at every training step.
The result: 3.8× better stability than standard GPT on long-horizon tasks, norm deviation of just 0.001 (vs. 0.474 for GPT), and the only architecture in the comparison that can both rotate and reflect — achieving 0.96 cosine alignment on a negation task that breaks pure rotation methods. All with 33% fewer layers.
Residual connections let us train deep networks by adding a shortcut past each layer. But this shortcut — the identity — provides no geometric guarantees. Norms drift, gradients vanish, and the deeper the network, the worse it gets.
The standard residual connection computes $X_{l+1} = X_l + F(X_l)$, where $X_l$ is the layer's input and $F$ is the layer's transformation. The identity shortcut $X_l$ is the "skip" — it lets gradients flow backward unimpeded.
The problem: nothing constrains $\|X_{l+1}\|_2$. If $F(X_l)$ has components aligned with $X_l$, norms grow. If anti-aligned, norms shrink. Over dozens of layers, this drift compounds.
Drag the sliders to see how signal magnitude changes across network depth under three residual connection strategies.
Every guarantee in this paper traces back to one fact: the matrix $A = uv^\top - vu^\top$ is always skew-symmetric, no matter how $u$ and $v$ are computed. That algebraic form is the load-bearing wall.
The paper defines two "generator networks" that compute vectors from the input's mean-pooled representation:
Move the $u$ and $v$ vectors in 2D. The resulting skew-symmetric matrix $A = uv^\top - vu^\top$ always satisfies $A^\top = -A$.
The Cayley transform converts a skew-symmetric matrix into an orthogonal rotation matrix. The paper makes this transform data-dependent — and proves that orthogonality holds unconditionally.
The proof is four steps (Section 4). Setting $M = \frac{\beta}{2}A(x)$, we have $Q = (I+M)^{-1}(I-M)$. Since $M$ is skew-symmetric, $(I+M)$ and $(I-M)$ are polynomials in $M$ and therefore commute. The rest is algebra: $Q^\top Q = (I+M)(I-M)^{-1}(I+M)^{-1}(I-M) = I$.
Key corollaries that follow immediately:
Adjust $\beta$ and the generator angle. Watch the 2D rotation update live — and verify that $Q^\top Q = I$ and $\det(Q) = +1$ at every setting.
The Cayley transform produces beautiful rotations — but it can never negate a signal. Eigenvalue $\lambda = -1$ is algebraically excluded. For tasks that require rapid sign reversal ("Actually, no — I meant the opposite"), this is a real limitation.
Drag $\beta$ to see how Cayley eigenvalues approach but never reach $-1$, while the Householder eigenvalue jumps exactly to $-1$ when $\beta = 2$.
The EΔ-MHC-Geo Hybrid blends a Cayley rotation and a Householder reflection through a learned gate $\gamma$. At the gate boundaries, the operator is exactly orthogonal — accessing both connected components of $O(n)$.
The gate $\gamma(X) = \sigma(W_\gamma \cdot \bar{X} + b_\gamma) \in (0,1)$ is a learned sigmoid. The Cayley branch $Q(X) \in SO(n)$ provides rotation ($\det = +1$); the Householder branch $H_2(k(X))$ provides reflection ($\det = -1$). At the boundaries:
Drag the gate $\gamma$ slider. At $\gamma = 1$, the operator is pure Cayley rotation. At $\gamma = 0$, pure Householder reflection. In between, a non-orthogonal blend.
The hybrid operator is only orthogonal at $\gamma = 0$ and $\gamma = 1$. The midpoint collapse regularizer $L_{\text{gate}} = 4\gamma(1-\gamma)$ pushes the gate toward these boundaries — with a surprising zero-gradient trap at $\gamma = 0.5$.
The total loss is $L_{\text{total}} = L_{\text{task}} + \sum_{\text{layers}} L_{\text{gate}}$. The regularization gradient $\partial L / \partial \gamma = 4(1 - 2\gamma)$ pushes $\gamma$ toward the boundaries. But at $\gamma = 0.5$ exactly, this gradient vanishes — and any smooth, symmetric regularizer suffers the same fate (Theorem 7.3).
Three escape mechanisms break the symmetry: (1) task loss gradient $L'_{\text{task}} \neq 0$, (2) input variation causing $\gamma$ to fluctuate across samples, and (3) biased initialization $b_\gamma \neq 0$.
Adjust the regularization weight $\lambda$ and the task loss gradient. Simulate how $\gamma$ evolves over training steps.
Four benchmarks, five architectures, matched parameters at ~1.79M each. The results confirm the algebraic predictions: EΔ-MHC-Geo dominates on stability and near-π rotation, while the hybrid gate learns to select the correct operator automatically.
All models were configured for fair comparison with matched parameter counts (~1.79M). Training used AdamW with cosine decay, 2000 iterations, and results are averaged over 3 random seeds.
Hover over bars for exact values. The charts show the paper's main experimental results from Table 6 and Table 12.
The EΔ-MHC-Geo Transformer makes a precise claim: the Cayley transform's orthogonality guarantee is algebraic, not parametric. Making the generator data-dependent preserves every guarantee. The experiments confirm it — but also reveal where the current approach hits its limits.
Theoretical contributions. The paper establishes that the Data-Dependent Cayley transform is unconditionally orthogonal (Theorem 4.1), isometric (Theorem 4.3), and a proper rotation (Theorem 4.4) for all inputs and all $\beta$. It proves that eigenvalue $\lambda = -1$ is excluded (Theorem 4.6), motivating the Householder branch. And it identifies a universal zero-gradient trap at $\gamma = 0.5$ (Theorem 7.3) for any smooth, symmetric regularizer.
Empirical highlights. At matched ~1.79M parameters with 3 seeds: best long-horizon stability (3.8× over GPT, 1.9× over JPmHC), norm deviation of just 0.001 (474× better than GPT), best near-π loss (4.5× over JPmHC on single-plane), 0.96 cosine alignment on the negation diagnostic — all with 33% fewer layers.
Limitations. Experiments are on synthetic benchmarks at ~1.79M parameters. The $O(n^3)$ Cayley matrix solve is negligible at $n = 4$ streams but could bottleneck at larger $n$. Three seeds per configuration provide narrow confidence intervals but are a modest sample. And the linear blending at intermediate $\gamma$ is non-orthogonal during early training.
Future directions. Scaling to large language models, geodesic interpolation on $O(n)$ to replace convex blending, and extending to unitary groups for complex-valued architectures.
Built from arXiv:2605.06729 · Shahmansoori (2026)