# Theoretical Foundations

This section explains the mathematical foundations of Sparse Selective Hyper-Connections (SHC).

## The Stability Problem

### Standard Residual Connections

The standard residual connection for layer $l$ is:

$$\mathbf{x}_{l+1} = \mathbf{x}_l + f_l(\mathbf{x}_l)$$

This identity mapping ensures signal conservation: $\mathbf{x}_L = \mathbf{x}_0 + \sum_{l=0}^{L-1} f_l(\mathbf{x}_l)$.

### Hyper-Connections

Hyper-Connections generalize residuals by expanding to $n$ parallel streams with learnable mixing:

$$\bar{\mathbf{x}}_{l+1} = \mathbf{H}^{\text{res}}_l \bar{\mathbf{x}}_l + \mathbf{H}^{\text{post}}_l f_l(\mathbf{H}^{\text{pre}}_l \bar{\mathbf{x}}_l)$$

where:
- $\bar{\mathbf{x}}_l \in \mathbb{R}^{n \times d}$ is the multi-stream hidden state
- $\mathbf{H}^{\text{res}}, \mathbf{H}^{\text{pre}}, \mathbf{H}^{\text{post}} \in \mathbb{R}^{n \times n}$ are learnable mixing matrices

### The Problem: Spectral Norm Explosion

Without constraints, spectral norms compound across layers:

$$\prod_{l=1}^{L} \mathbf{H}^{\text{res}}_l$$

Unconstrained matrices yield gain magnitudes of ~3000 at 60 layers, causing gradient explosion and training collapse.

## The Cayley Transform

The **Cayley transform** provides a closed-form bijection between skew-symmetric matrices and orthogonal matrices:

$$\mathbf{Q}(\mathbf{A}) = (\mathbf{I} - \mathbf{A})(\mathbf{I} + \mathbf{A})^{-1}$$

where $\mathbf{A} = -\mathbf{A}^\top$ is a skew-symmetric matrix.

### Properties

For any $\mathbf{Q}$ generated by the Cayley transform:

1. **Orthogonality**: $\mathbf{Q}^\top \mathbf{Q} = \mathbf{I}$
2. **Unit Spectral Norm**: $\rho(\mathbf{Q}) = 1$ (exactly)
3. **Norm Preservation**: $\|\mathbf{Q}\mathbf{x}\|_2 = \|\mathbf{x}\|_2$

### Parameterization

A skew-symmetric matrix $\mathbf{A}$ requires only $\frac{n(n-1)}{2}$ free parameters (the upper triangle), making it efficient to learn.

```python
from shc.layers import CayleyTransform

# 4x4 orthogonal matrix from 6 parameters
cayley = CayleyTransform(n=4)  # 4*(4-1)/2 = 6 params
Q = cayley()

# Verify orthogonality
import torch
assert torch.allclose(Q.T @ Q, torch.eye(4), atol=1e-5)
assert torch.allclose(cayley.get_spectral_norm(), torch.tensor(1.0), atol=1e-4)
```

## Sparse Orthogonal Mixtures

SHC parameterizes $\mathbf{H}^{\text{res}}$ as a **sparse mixture** of $k$ orthogonal matrices:

$$\mathbf{H}^{\text{res}} = \sum_{i=1}^{k} \alpha_i(\mathbf{x}) \cdot \mathbf{Q}_i$$

where:
- $\boldsymbol{\alpha}(\mathbf{x}) = \text{softmax}(\mathbf{W}_\alpha \mathbf{x}) \in \Delta^{k-1}$ are input-dependent mixing weights
- Each $\mathbf{Q}_i$ is an orthogonal matrix from the Cayley transform

### Bounded Spectral Norm (Proposition 1)

**For any convex combination of orthogonal matrices:**

$$\rho\left(\mathbf{H}^{\text{res}}\right) \leq 1$$

**Proof**: By the triangle inequality and orthogonality:

$$\rho\left(\sum_i \alpha_i \mathbf{Q}_i\right) \leq \sum_i \alpha_i \rho(\mathbf{Q}_i) = \sum_i \alpha_i = 1$$

### Why This Matters

This bound is:
- **Exact by construction** (no iteration needed)
- **Stronger than mHC** which achieves $\rho \approx 1.6$ with 20 Sinkhorn iterations
- **Stable for arbitrarily deep networks**

```python
from shc.layers import SparseOrthogonalMixture

routing = SparseOrthogonalMixture(n=4, k=2, hidden_dim=768)
x = torch.randn(2, 768)

# Spectral norm is ALWAYS ≤ 1
norms = routing.get_spectral_norm(x)
assert (norms <= 1.0 + 1e-4).all()
```

## Factorized Stream Representation

### The Memory Problem

Multi-stream architectures expand hidden state from $d$ to $n \times d$, causing $n\times$ KV cache expansion.

### Low-Rank Observation

Empirically, the expanded representation is low-rank: PCA on $\bar{\mathbf{x}}$ shows the first principal component captures 85% of variance.

### Rank-r Factorization

We compress via learned low-rank projections:

$$\bar{\mathbf{x}} \approx \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top$$

where:
- $\mathbf{U} \in \mathbb{R}^{n \times r}$
- $\boldsymbol{\Sigma} \in \mathbb{R}^{r \times r}$ (diagonal)
- $\mathbf{V} \in \mathbb{R}^{d \times r}$

### Result

- **Rank-1** ($r=1$): 99% reconstruction accuracy, 70% memory savings
- Cache reduces from $4\times$ to $\sim 1.2\times$ baseline

## Bounded Signal Propagation

### Theorem (Multi-Layer Stability)

For any SHC network with $L$ layers:

$$\rho\left(\prod_{l=1}^{L} \mathbf{H}^{\text{res}}_l\right) \leq 1$$

**Proof**: Each $\mathbf{H}^{\text{res}}_l$ has $\rho \leq 1$ by Proposition 1. By submultiplicativity:

$$\rho\left(\prod_{l=1}^{L} \mathbf{H}^{\text{res}}_l\right) \leq \prod_{l=1}^{L} \rho(\mathbf{H}^{\text{res}}_l) \leq 1$$

This guarantees stable gradients regardless of network depth.

## Comparison with mHC

| Property | mHC (Sinkhorn) | SHC (Cayley) |
|----------|----------------|--------------|
| Spectral norm bound | $\rho \leq 1.6$ (approximate) | $\rho \leq 1$ (exact) |
| Routing computation | $\mathcal{O}(20n^2)$ iterations | $\mathcal{O}(kn^3)$ closed-form |
| KV cache | $n \times d$ | $\sim d$ (factorized) |
| Stability guarantee | Approximate (iteration-dependent) | Exact (by construction) |

## References

1. He et al., "Deep Residual Learning for Image Recognition" (2016)
2. Cayley, "Sur quelques propriétés des déterminants gauches" (1846)
3. Birkhoff, "Three observations on linear algebra" (1946)
4. Sinkhorn & Knopp, "Concerning nonnegative matrices" (1967)