Theoretical Foundations
This section explains the mathematical foundations of Sparse Selective Hyper-Connections (SHC).
The Stability Problem
Standard Residual Connections
The standard residual connection for layer \(l\) is:
This identity mapping ensures signal conservation: \(\mathbf{x}_L = \mathbf{x}_0 + \sum_{l=0}^{L-1} f_l(\mathbf{x}_l)\).
Hyper-Connections
Hyper-Connections generalize residuals by expanding to \(n\) parallel streams with learnable mixing:
where:
\(\bar{\mathbf{x}}_l \in \mathbb{R}^{n \times d}\) is the multi-stream hidden state
\(\mathbf{H}^{\text{res}}, \mathbf{H}^{\text{pre}}, \mathbf{H}^{\text{post}} \in \mathbb{R}^{n \times n}\) are learnable mixing matrices
The Problem: Spectral Norm Explosion
Without constraints, spectral norms compound across layers:
Unconstrained matrices yield gain magnitudes of ~3000 at 60 layers, causing gradient explosion and training collapse.
The Cayley Transform
The Cayley transform provides a closed-form bijection between skew-symmetric matrices and orthogonal matrices:
where \(\mathbf{A} = -\mathbf{A}^\top\) is a skew-symmetric matrix.
Properties
For any \(\mathbf{Q}\) generated by the Cayley transform:
Orthogonality: \(\mathbf{Q}^\top \mathbf{Q} = \mathbf{I}\)
Unit Spectral Norm: \(\rho(\mathbf{Q}) = 1\) (exactly)
Norm Preservation: \(\|\mathbf{Q}\mathbf{x}\|_2 = \|\mathbf{x}\|_2\)
Parameterization
A skew-symmetric matrix \(\mathbf{A}\) requires only \(\frac{n(n-1)}{2}\) free parameters (the upper triangle), making it efficient to learn.
from shc.layers import CayleyTransform
# 4x4 orthogonal matrix from 6 parameters
cayley = CayleyTransform(n=4) # 4*(4-1)/2 = 6 params
Q = cayley()
# Verify orthogonality
import torch
assert torch.allclose(Q.T @ Q, torch.eye(4), atol=1e-5)
assert torch.allclose(cayley.get_spectral_norm(), torch.tensor(1.0), atol=1e-4)
Sparse Orthogonal Mixtures
SHC parameterizes \(\mathbf{H}^{\text{res}}\) as a sparse mixture of \(k\) orthogonal matrices:
where:
\(\boldsymbol{\alpha}(\mathbf{x}) = \text{softmax}(\mathbf{W}_\alpha \mathbf{x}) \in \Delta^{k-1}\) are input-dependent mixing weights
Each \(\mathbf{Q}_i\) is an orthogonal matrix from the Cayley transform
Bounded Spectral Norm (Proposition 1)
For any convex combination of orthogonal matrices:
Proof: By the triangle inequality and orthogonality:
Why This Matters
This bound is:
Exact by construction (no iteration needed)
Stronger than mHC which achieves \(\rho \approx 1.6\) with 20 Sinkhorn iterations
Stable for arbitrarily deep networks
from shc.layers import SparseOrthogonalMixture
routing = SparseOrthogonalMixture(n=4, k=2, hidden_dim=768)
x = torch.randn(2, 768)
# Spectral norm is ALWAYS ≤ 1
norms = routing.get_spectral_norm(x)
assert (norms <= 1.0 + 1e-4).all()
Factorized Stream Representation
The Memory Problem
Multi-stream architectures expand hidden state from \(d\) to \(n \times d\), causing \(n\times\) KV cache expansion.
Low-Rank Observation
Empirically, the expanded representation is low-rank: PCA on \(\bar{\mathbf{x}}\) shows the first principal component captures 85% of variance.
Rank-r Factorization
We compress via learned low-rank projections:
where:
\(\mathbf{U} \in \mathbb{R}^{n \times r}\)
\(\boldsymbol{\Sigma} \in \mathbb{R}^{r \times r}\) (diagonal)
\(\mathbf{V} \in \mathbb{R}^{d \times r}\)
Result
Rank-1 (\(r=1\)): 99% reconstruction accuracy, 70% memory savings
Cache reduces from \(4\times\) to \(\sim 1.2\times\) baseline
Bounded Signal Propagation
Theorem (Multi-Layer Stability)
For any SHC network with \(L\) layers:
Proof: Each \(\mathbf{H}^{\text{res}}_l\) has \(\rho \leq 1\) by Proposition 1. By submultiplicativity:
This guarantees stable gradients regardless of network depth.
Comparison with mHC
Property |
mHC (Sinkhorn) |
SHC (Cayley) |
|---|---|---|
Spectral norm bound |
\(\rho \leq 1.6\) (approximate) |
\(\rho \leq 1\) (exact) |
Routing computation |
\(\mathcal{O}(20n^2)\) iterations |
\(\mathcal{O}(kn^3)\) closed-form |
KV cache |
\(n \times d\) |
\(\sim d\) (factorized) |
Stability guarantee |
Approximate (iteration-dependent) |
Exact (by construction) |
References
He et al., “Deep Residual Learning for Image Recognition” (2016)
Cayley, “Sur quelques propriétés des déterminants gauches” (1846)
Birkhoff, “Three observations on linear algebra” (1946)
Sinkhorn & Knopp, “Concerning nonnegative matrices” (1967)