May 2, 2026 interpretabilitysparse-autoencodersgemma-4datasets

60 Layers of Interpretability: Publishing Gemma-4-31B SAE Features on HuggingFace

3,000 interpreted and verified sparse autoencoder features for every layer of Google's Gemma-4-31B.

By Adam Kruger

We trained sparse autoencoders on all 60 layers of Gemma-4-31B and are releasing the results — weights, labels, verification scores, and activation examples — as an open dataset on HuggingFace.

What We're Releasing

Dataset: Adam1010/gemma-4-31b-sae-features

For each of the 60 transformer layers:

Trained SAE weights (TopK-64 sparsity, 43,008 features, 8x expansion from d_model=5376)
50 selected features per layer with full metadata
Dual auto-interpretation — each feature independently labeled by Claude and GPT-4o
Cross-validation results — labels tested against held-out activation examples
Linear probe accuracy — per-feature classifier validation
Confidence scores — weighted composite of all verification signals
SIPIT scores — measuring how far each layer's activations are from the token embedding space

Total: 3,000 interpreted features with 104GB of trained SAE weights.

The Verification Pipeline

We didn't just train the SAEs and call it done. Every feature went through a 6-phase verification pipeline:

SIPIT Scoring — measures the "invertibility" of each layer's activations relative to the token embedding space. This tells you which layers are closest to tokens (early and late) and which are in the deep abstract integration zone (around layer 27).
Feature Selection — top 50 features per layer ranked by SIPIT divergence and activation statistics. Not random — the most informative features in each layer.
Dual Auto-Interpretation — two independent LLMs (Claude Sonnet and GPT-4o) label each feature based on its top-activating examples. Agreement between the two serves as a reliability signal.
Cross-Validation — labels tested on new activation examples the interpreters hadn't seen. 35.8% confirmation rate — honest, not inflated.
Linear Probes — per-feature classifiers trained to predict the label from the activation. 99.3% accuracy for high-confidence features.
Confidence Scoring — weighted composite: 20% agreement, 35% cross-validation, 35% probe accuracy, 10% activation health.

The SIPIT Profile

The SIPIT scores reveal Gemma-4's three processing phases:

Layers 0-22: Gradual descent from token space (cosine similarity drops from 0.56 to 0.25)
Layers 23-28: The integration zone — deepest abstraction, cosine drops to 0.178 at layer 27
Layers 57-59: Token rendering — cosine jumps back to 0.87-0.92

This three-phase pattern (early processing → deep integration → token rendering) appears across architectures. We've seen the same profile in Qwen-3-4B (integration at 44% depth) and Gemma-3-1B (integration at 42% depth). The integration depth at ~40-45% seems universal for attention-based transformers.

How to Use It

import json

# Load all features
with open("gemma4_sae_features.json") as f:
    features = json.load(f)

# Find high-confidence features at the integration layer
integration_features = [
    f for f in features
    if f["layer"] == 27 and f["confidence_tier"] == "high"
]

# See what the model is "thinking about" at its deepest abstraction
for f in integration_features[:5]:
    print(f"Feature {f['feature_id']}: {f['claude_label'][:80]}")

Why This Matters

Mechanistic interpretability needs data. Training SAEs is expensive — the full pipeline for 60 layers took significant compute. By releasing the weights AND the verification data, we're making it possible for researchers to:

Study how features evolve across layers without retraining
Build on verified interpretations instead of starting from scratch
Compare feature behavior across the integration/thinking/codec phases
Use the SIPIT profile to identify which layers matter for their research

The dataset is Apache 2.0 licensed. Use it, build on it, cite it.

Built by Light of Baldr LLC. Get in touch if any of this is useful to you.