# Architecture

The SHC Transformer architecture extends the standard transformer with multi-stream residual connections using sparse orthogonal routing.

## Overview

```
┌─────────────────────────────────────────────────────────┐
│                    SHC Transformer                       │
├─────────────────────────────────────────────────────────┤
│                                                          │
│   Input: token_ids (batch, seq_len)                     │
│              ↓                                           │
│   ┌─────────────────────┐                               │
│   │   Token Embedding   │  vocab_size → hidden_dim      │
│   └─────────────────────┘                               │
│              ↓                                           │
│   ┌─────────────────────┐                               │
│   │    N × SHC Block    │  With orthogonal routing      │
│   └─────────────────────┘                               │
│              ↓                                           │
│   ┌─────────────────────┐                               │
│   │      RMS Norm       │                               │
│   └─────────────────────┘                               │
│              ↓                                           │
│   ┌─────────────────────┐                               │
│   │     LM Head         │  hidden_dim → vocab_size      │
│   └─────────────────────┘                               │
│              ↓                                           │
│   Output: logits (batch, seq_len, vocab_size)           │
│                                                          │
└─────────────────────────────────────────────────────────┘
```

## SHC Block

Each SHC Block implements Algorithm 1 from the paper:

```
SHC Block Forward Pass
──────────────────────
Input: x ∈ ℝ^d, layer index l

1. n_eff ← AdaptiveRank(x, l)         # Adaptive stream expansion

2. IF n_eff > 1:
   a. x̄ ← StreamExpand(x, n_eff)     # Expand to n streams
   b. α ← softmax(W_α · x̄)           # Compute mixing weights
   c. H^res ← Σ αᵢ · Q(Aᵢ)           # Cayley routing matrix
   
   d. x̄_out ← H^res · x̄ + H^post · f(H^pre · x̄)
              ↑           ↑         ↑
           residual    output    input
           routing     routing   routing
   
   e. x_out ← Compress(x̄_out, r=1)   # Factorized cache

3. ELSE:
   x_out ← x + f(x)                   # Standard residual

Return: x_out
```

## Model Configurations

| Size | Hidden Dim | Layers | Heads | FFN Dim | Parameters |
|------|------------|--------|-------|---------|------------|
| 500M | 1024 | 24 | 16 | 4096 | ~500M |
| 1B | 2048 | 24 | 16 | 8192 | ~1B |
| 3B | 2560 | 32 | 32 | 10240 | ~3B |
| 7B | 4096 | 32 | 32 | 11008 | ~7B |

```python
from shc.models import get_config, SHCTransformer

# Load predefined configuration
config = get_config('3b')
model = SHCTransformer(config)
```

## Core Components

### CayleyTransform

Generates orthogonal matrices with exactly $\rho = 1$:

```python
from shc.layers import CayleyTransform

cayley = CayleyTransform(n=4, init_scale=0.01)
Q = cayley()  # 4×4 orthogonal matrix
```

### SparseOrthogonalMixture

Input-dependent mixture of $k$ orthogonal matrices:

```python
from shc.layers import SparseOrthogonalMixture

routing = SparseOrthogonalMixture(
    n=4,           # Number of streams
    k=2,           # Number of orthogonal matrices
    hidden_dim=768 # Dimension for computing mixing weights
)

H_res = routing(x)  # (batch, n, n) routing matrix
```

### FactorizedKVCache

Low-rank compression for efficient caching:

```python
from shc.layers import FactorizedKVCache

cache = FactorizedKVCache(
    n=4,    # Number of streams
    d=768,  # Hidden dimension
    r=1     # Factorization rank
)
```

### AdaptiveRankSelector

Layer-wise and input-dependent effective rank:

```python
from shc.layers import AdaptiveRankSelector

selector = AdaptiveRankSelector(n=4, hidden_dim=768)
n_eff = selector(x)  # Effective number of streams
```

## Multi-Head Attention

Standard multi-head attention with RoPE positional encoding:

```python
from shc.blocks import MultiHeadAttention

attention = MultiHeadAttention(
    hidden_dim=768,
    n_heads=12,
    max_seq_len=4096,
    use_rope=True
)
```

## Feed-Forward Network

SwiGLU activation with learnable gating:

```python
from shc.blocks import FeedForward

ffn = FeedForward(
    hidden_dim=768,
    ffn_dim=3072  # Typically 4× hidden_dim
)
```

## Generation

Autoregressive generation with KV caching:

```python
output = model.generate(
    input_ids,
    max_new_tokens=100,
    temperature=0.7,
    top_k=50,
    top_p=0.9,
    do_sample=True
)
```

## SSM Distillation

For O(L) inference, distill into a State Space Model:

```python
from shc.models import SSMStudent
from shc.training import DistillationTrainer

# Create student matching teacher dimensions
student = SSMStudent.from_teacher_config(teacher.config)

# Distill
trainer = DistillationTrainer(teacher, student, config, data)
trainer.train()

# Student generates without KV cache
output = student.generate(input_ids, max_new_tokens=100)
```