We didn't go looking for a routing mechanism in dense transformer models — there isn't one in the architecture. But every token entering Mistral-7B, Mistral-Small-24B, and Qwen3-4B gets routed into one of two processing paths at layers 3-4, and the routing is consistent enough that it's clearly a real circuit, not a statistical artifact.
The core finding
Every token entering the model gets routed into one of two paths:
- Mode A (93–95% of tokens): shallow processing, minimal transformation
- Mode B (5–7%): deep processing, massive representational shift
There's no mixture-of-experts routing. There's no explicit gating layer. The model develops this routing during training, on its own.
Cross-architecture confirmation
| Model | Family | Parameters | Gate Layer | Mode B share |
|---|---|---|---|---|
| Mistral-7B | Mistral | 7B | L3-4 | ~50% of outlier tokens |
| Mistral-Small-24B | Mistral | 24B | L3-4 | 47.6% of responses |
| Qwen3-4B | Qwen | 4B | L3-4 | 7.27% of tokens |
Statistical evidence
| Model | Evidence |
|---|---|
| Mistral-7B | L2-L3 avg distance +2,170%, std dev +17,750%, avg/median ratio 17.9× |
| Mistral-Small-24B | Cohen's d = 4.8, Silhouette = 0.83, AUC-ROC = 0.97, p < 0.001 |
| Qwen3-4B | Feature 374 correlation 0.96 across L3–L5, bimodal ratio 1.09 |
Causal proof
Ablating the gate at Layer 3 in Qwen3-4B:
| Metric | Baseline | After ablation | Change |
|---|---|---|---|
| L6 Mode B mean | 308.6 | 20.5 | -93.4% |
| L6 extreme tokens (>p99.9) | 20 | 0 | -100% |
| L6 max | 11,475.4 | 147.3 | -98.7% |
Removing the gate removes the downstream behavior. That's the test we wanted — a functional circuit, not a correlation.
Three-stage pipeline (Qwen3-4B, 200K tokens)
| Stage | Layers | Behavior | Mode B % |
|---|---|---|---|
| Shallow triage | L0–L5 | Bimodal gate, selective routing | 5–17% |
| Deep explosion | L6–L15 | Std dev explodes 4,289%, tokens leave vocabulary space | 0.1% |
| Final routing | L16–L35 | Steady divergence, second bimodal gate at L35 | 0.2–10.9% |
The L5 compression dip
Every L6-extreme token follows the same trajectory: scores build through L0–L4, drop at L5, then explode 300×+ at L6.
Token ' unwanted': L3=43.4, L4=48.3, L5=31.2, L6=11,409.6
Token ' tops': L3=38.5, L4=45.1, L5=25.7, L6=10,678.6
It's a two-stage funnel: 100% of L6-extreme tokens were L3 Mode B, but only 1.6% of L3 Mode B tokens become L6-extreme. The first gate qualifies; the second commits.
Where standard SAEs break
Sparse autoencoders, the dominant interpretability tool, completely fail on deep computation layers in these models. A pipeline of SipIt invertibility + a pre-trained SAE + GLP diffusion prior recovers what they miss.
| Layer | Role | SAE explained variance | Pipeline explained variance |
|---|---|---|---|
| L3 | Gate | 92.2% | 99.99% |
| L5 | Compression | 85.2% | 99.99% |
| L6 | Explosion | -905% | 100.0% |
| L8 | Post-explosion | -1,111% | 100.0% |
| L16 | Deep compute | -1,058% | 100.0% |
| L24 | Deepest | -3,059% | 100.0% |
| L35 | Final gate | 86.9% | 99.97% |
The takeaway we drew: deep computation layers use dense / distributed representations, not the sparse features SAEs are built to extract. That's a blind spot in the dominant interpretability approach, not a defect in our SAEs.
Thinking-mode experiment
Qwen3's token does not widen the gate. It amplifies depth.
| Layer | Baseline Mode B % | Think Mode B % | Baseline max | Think max |
|---|---|---|---|---|
| L6 | 0.2% | 0.2% | 11,679 | 17,013 (+45.6%) |
| L35 | 8.9% | 8.1% | 3,896 | 7,272 (+86.7%) |
Emotion probe results
| Category | L6 mean | L6 max | vs. neutral |
|---|---|---|---|
| Emotion | 7,218 | 10,405 | +48% |
| High valence | 6,383 | 7,639 | +31% |
| Neutral | 4,874 | 5,438 | baseline |
The gate fires on the first token of the sentence. The model decides at onset whether deep processing is needed.