Theoretical Foundations

This section explains the mathematical foundations of Sparse Selective Hyper-Connections (SHC).

The Stability Problem

Standard Residual Connections

The standard residual connection for layer \(l\) is:

\[\mathbf{x}_{l+1} = \mathbf{x}_l + f_l(\mathbf{x}_l)\]

This identity mapping ensures signal conservation: \(\mathbf{x}_L = \mathbf{x}_0 + \sum_{l=0}^{L-1} f_l(\mathbf{x}_l)\).

Hyper-Connections

Hyper-Connections generalize residuals by expanding to \(n\) parallel streams with learnable mixing:

\[\bar{\mathbf{x}}_{l+1} = \mathbf{H}^{\text{res}}_l \bar{\mathbf{x}}_l + \mathbf{H}^{\text{post}}_l f_l(\mathbf{H}^{\text{pre}}_l \bar{\mathbf{x}}_l)\]

where:

  • \(\bar{\mathbf{x}}_l \in \mathbb{R}^{n \times d}\) is the multi-stream hidden state

  • \(\mathbf{H}^{\text{res}}, \mathbf{H}^{\text{pre}}, \mathbf{H}^{\text{post}} \in \mathbb{R}^{n \times n}\) are learnable mixing matrices

The Problem: Spectral Norm Explosion

Without constraints, spectral norms compound across layers:

\[\prod_{l=1}^{L} \mathbf{H}^{\text{res}}_l\]

Unconstrained matrices yield gain magnitudes of ~3000 at 60 layers, causing gradient explosion and training collapse.

The Cayley Transform

The Cayley transform provides a closed-form bijection between skew-symmetric matrices and orthogonal matrices:

\[\mathbf{Q}(\mathbf{A}) = (\mathbf{I} - \mathbf{A})(\mathbf{I} + \mathbf{A})^{-1}\]

where \(\mathbf{A} = -\mathbf{A}^\top\) is a skew-symmetric matrix.

Properties

For any \(\mathbf{Q}\) generated by the Cayley transform:

  1. Orthogonality: \(\mathbf{Q}^\top \mathbf{Q} = \mathbf{I}\)

  2. Unit Spectral Norm: \(\rho(\mathbf{Q}) = 1\) (exactly)

  3. Norm Preservation: \(\|\mathbf{Q}\mathbf{x}\|_2 = \|\mathbf{x}\|_2\)

Parameterization

A skew-symmetric matrix \(\mathbf{A}\) requires only \(\frac{n(n-1)}{2}\) free parameters (the upper triangle), making it efficient to learn.

from shc.layers import CayleyTransform

# 4x4 orthogonal matrix from 6 parameters
cayley = CayleyTransform(n=4)  # 4*(4-1)/2 = 6 params
Q = cayley()

# Verify orthogonality
import torch
assert torch.allclose(Q.T @ Q, torch.eye(4), atol=1e-5)
assert torch.allclose(cayley.get_spectral_norm(), torch.tensor(1.0), atol=1e-4)

Sparse Orthogonal Mixtures

SHC parameterizes \(\mathbf{H}^{\text{res}}\) as a sparse mixture of \(k\) orthogonal matrices:

\[\mathbf{H}^{\text{res}} = \sum_{i=1}^{k} \alpha_i(\mathbf{x}) \cdot \mathbf{Q}_i\]

where:

  • \(\boldsymbol{\alpha}(\mathbf{x}) = \text{softmax}(\mathbf{W}_\alpha \mathbf{x}) \in \Delta^{k-1}\) are input-dependent mixing weights

  • Each \(\mathbf{Q}_i\) is an orthogonal matrix from the Cayley transform

Bounded Spectral Norm (Proposition 1)

For any convex combination of orthogonal matrices:

\[\rho\left(\mathbf{H}^{\text{res}}\right) \leq 1\]

Proof: By the triangle inequality and orthogonality:

\[\rho\left(\sum_i \alpha_i \mathbf{Q}_i\right) \leq \sum_i \alpha_i \rho(\mathbf{Q}_i) = \sum_i \alpha_i = 1\]

Why This Matters

This bound is:

  • Exact by construction (no iteration needed)

  • Stronger than mHC which achieves \(\rho \approx 1.6\) with 20 Sinkhorn iterations

  • Stable for arbitrarily deep networks

from shc.layers import SparseOrthogonalMixture

routing = SparseOrthogonalMixture(n=4, k=2, hidden_dim=768)
x = torch.randn(2, 768)

# Spectral norm is ALWAYS ≤ 1
norms = routing.get_spectral_norm(x)
assert (norms <= 1.0 + 1e-4).all()

Factorized Stream Representation

The Memory Problem

Multi-stream architectures expand hidden state from \(d\) to \(n \times d\), causing \(n\times\) KV cache expansion.

Low-Rank Observation

Empirically, the expanded representation is low-rank: PCA on \(\bar{\mathbf{x}}\) shows the first principal component captures 85% of variance.

Rank-r Factorization

We compress via learned low-rank projections:

\[\bar{\mathbf{x}} \approx \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top\]

where:

  • \(\mathbf{U} \in \mathbb{R}^{n \times r}\)

  • \(\boldsymbol{\Sigma} \in \mathbb{R}^{r \times r}\) (diagonal)

  • \(\mathbf{V} \in \mathbb{R}^{d \times r}\)

Result

  • Rank-1 (\(r=1\)): 99% reconstruction accuracy, 70% memory savings

  • Cache reduces from \(4\times\) to \(\sim 1.2\times\) baseline

Bounded Signal Propagation

Theorem (Multi-Layer Stability)

For any SHC network with \(L\) layers:

\[\rho\left(\prod_{l=1}^{L} \mathbf{H}^{\text{res}}_l\right) \leq 1\]

Proof: Each \(\mathbf{H}^{\text{res}}_l\) has \(\rho \leq 1\) by Proposition 1. By submultiplicativity:

\[\rho\left(\prod_{l=1}^{L} \mathbf{H}^{\text{res}}_l\right) \leq \prod_{l=1}^{L} \rho(\mathbf{H}^{\text{res}}_l) \leq 1\]

This guarantees stable gradients regardless of network depth.

Comparison with mHC

Property

mHC (Sinkhorn)

SHC (Cayley)

Spectral norm bound

\(\rho \leq 1.6\) (approximate)

\(\rho \leq 1\) (exact)

Routing computation

\(\mathcal{O}(20n^2)\) iterations

\(\mathcal{O}(kn^3)\) closed-form

KV cache

\(n \times d\)

\(\sim d\) (factorized)

Stability guarantee

Approximate (iteration-dependent)

Exact (by construction)

References

  1. He et al., “Deep Residual Learning for Image Recognition” (2016)

  2. Cayley, “Sur quelques propriétés des déterminants gauches” (1846)

  3. Birkhoff, “Three observations on linear algebra” (1946)

  4. Sinkhorn & Knopp, “Concerning nonnegative matrices” (1967)