Theoretical Foundations

This section explains the mathematical foundations of Sparse Selective Hyper-Connections (SHC).

The Stability Problem

Standard Residual Connections

The standard residual connection for layer \(l\) is:

\[\mathbf{x}_{l+1} = \mathbf{x}_l + f_l(\mathbf{x}_l)\]

This identity mapping ensures signal conservation: \(\mathbf{x}_L = \mathbf{x}_0 + \sum_{l=0}^{L-1} f_l(\mathbf{x}_l)\).

Hyper-Connections

Hyper-Connections generalize residuals by expanding to \(n\) parallel streams with learnable mixing:

\[\bar{\mathbf{x}}_{l+1} = \mathbf{H}^{\text{res}}_l \bar{\mathbf{x}}_l + \mathbf{H}^{\text{post}}_l f_l(\mathbf{H}^{\text{pre}}_l \bar{\mathbf{x}}_l)\]

where:

\(\bar{\mathbf{x}}_l \in \mathbb{R}^{n \times d}\) is the multi-stream hidden state
\(\mathbf{H}^{\text{res}}, \mathbf{H}^{\text{pre}}, \mathbf{H}^{\text{post}} \in \mathbb{R}^{n \times n}\) are learnable mixing matrices

The Problem: Spectral Norm Explosion

Without constraints, spectral norms compound across layers:

\[\prod_{l=1}^{L} \mathbf{H}^{\text{res}}_l\]

Unconstrained matrices yield gain magnitudes of ~3000 at 60 layers, causing gradient explosion and training collapse.

The Cayley Transform

The Cayley transform provides a closed-form bijection between skew-symmetric matrices and orthogonal matrices:

\[\mathbf{Q}(\mathbf{A}) = (\mathbf{I} - \mathbf{A})(\mathbf{I} + \mathbf{A})^{-1}\]

where \(\mathbf{A} = -\mathbf{A}^\top\) is a skew-symmetric matrix.

Properties

For any \(\mathbf{Q}\) generated by the Cayley transform:

Orthogonality: \(\mathbf{Q}^\top \mathbf{Q} = \mathbf{I}\)
Unit Spectral Norm: \(\rho(\mathbf{Q}) = 1\) (exactly)
Norm Preservation: \(\|\mathbf{Q}\mathbf{x}\|_2 = \|\mathbf{x}\|_2\)

Parameterization

A skew-symmetric matrix \(\mathbf{A}\) requires only \(\frac{n(n-1)}{2}\) free parameters (the upper triangle), making it efficient to learn.

from shc.layers import CayleyTransform

# 4x4 orthogonal matrix from 6 parameters
cayley = CayleyTransform(n=4)  # 4*(4-1)/2 = 6 params
Q = cayley()

# Verify orthogonality
import torch
assert torch.allclose(Q.T @ Q, torch.eye(4), atol=1e-5)
assert torch.allclose(cayley.get_spectral_norm(), torch.tensor(1.0), atol=1e-4)

Sparse Orthogonal Mixtures

SHC parameterizes \(\mathbf{H}^{\text{res}}\) as a sparse mixture of \(k\) orthogonal matrices:

\[\mathbf{H}^{\text{res}} = \sum_{i=1}^{k} \alpha_i(\mathbf{x}) \cdot \mathbf{Q}_i\]

where:

\(\boldsymbol{\alpha}(\mathbf{x}) = \text{softmax}(\mathbf{W}_\alpha \mathbf{x}) \in \Delta^{k-1}\) are input-dependent mixing weights
Each \(\mathbf{Q}_i\) is an orthogonal matrix from the Cayley transform

Bounded Spectral Norm (Proposition 1)

For any convex combination of orthogonal matrices:

\[\rho\left(\mathbf{H}^{\text{res}}\right) \leq 1\]

Proof: By the triangle inequality and orthogonality:

\[\rho\left(\sum_i \alpha_i \mathbf{Q}_i\right) \leq \sum_i \alpha_i \rho(\mathbf{Q}_i) = \sum_i \alpha_i = 1\]

Why This Matters

This bound is:

Exact by construction (no iteration needed)
Stronger than mHC which achieves \(\rho \approx 1.6\) with 20 Sinkhorn iterations
Stable for arbitrarily deep networks

from shc.layers import SparseOrthogonalMixture

routing = SparseOrthogonalMixture(n=4, k=2, hidden_dim=768)
x = torch.randn(2, 768)

# Spectral norm is ALWAYS ≤ 1
norms = routing.get_spectral_norm(x)
assert (norms <= 1.0 + 1e-4).all()

Factorized Stream Representation

The Memory Problem

Multi-stream architectures expand hidden state from \(d\) to \(n \times d\), causing \(n\times\) KV cache expansion.

Low-Rank Observation

Empirically, the expanded representation is low-rank: PCA on \(\bar{\mathbf{x}}\) shows the first principal component captures 85% of variance.

Rank-r Factorization

We compress via learned low-rank projections:

\[\bar{\mathbf{x}} \approx \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top\]

where:

\(\mathbf{U} \in \mathbb{R}^{n \times r}\)
\(\boldsymbol{\Sigma} \in \mathbb{R}^{r \times r}\) (diagonal)
\(\mathbf{V} \in \mathbb{R}^{d \times r}\)

Result

Rank-1 (\(r=1\)): 99% reconstruction accuracy, 70% memory savings
Cache reduces from \(4\times\) to \(\sim 1.2\times\) baseline

Bounded Signal Propagation

Theorem (Multi-Layer Stability)

For any SHC network with \(L\) layers:

\[\rho\left(\prod_{l=1}^{L} \mathbf{H}^{\text{res}}_l\right) \leq 1\]

Proof: Each \(\mathbf{H}^{\text{res}}_l\) has \(\rho \leq 1\) by Proposition 1. By submultiplicativity:

\[\rho\left(\prod_{l=1}^{L} \mathbf{H}^{\text{res}}_l\right) \leq \prod_{l=1}^{L} \rho(\mathbf{H}^{\text{res}}_l) \leq 1\]

This guarantees stable gradients regardless of network depth.

Comparison with mHC

Property	mHC (Sinkhorn)	SHC (Cayley)
Spectral norm bound	\(\rho \leq 1.6\) (approximate)	\(\rho \leq 1\) (exact)
Routing computation	\(\mathcal{O}(20n^2)\) iterations	\(\mathcal{O}(kn^3)\) closed-form
KV cache	\(n \times d\)	\(\sim d\) (factorized)
Stability guarantee	Approximate (iteration-dependent)	Exact (by construction)

References

He et al., “Deep Residual Learning for Image Recognition” (2016)
Cayley, “Sur quelques propriétés des déterminants gauches” (1846)
Birkhoff, “Three observations on linear algebra” (1946)
Sinkhorn & Knopp, “Concerning nonnegative matrices” (1967)