Blog

Research notes, infrastructure war stories, and updates from the workshop.

#memory#inference#interpretability#architecture

What If a Model Could Remember What It Learned?

We’re building activation-level memory for AI inference — a model that thinks differently because of what it has experienced before, not just one with more text in the prompt. Early results, a real selectivity number, and a provisional patent on the way.

Read
#interpretability#topology#mechanistic#sparse-autoencoders

Topology of Thought — Three-Instrument Convergence

Three independent measurement instruments — persistent homology, SIPIT invertibility, and SAE decomposition — converge on a universal three-phase structure inside transformer residual streams. Confirmed across 4 attention-based architectures, falsified in Mamba.

Read
#interpretability#topology#state-space-models#falsification

The Mamba Counterexample

If topological integration were universal, state space models should show it too. They don’t. Mamba-370m maintains fragmented representations end-to-end.

Read
#interpretability#mechanistic#sparse-autoencoders#transformers

Emergent Computational Gating in Dense Transformers

Dense transformers develop bimodal processing gates at layers 3-4 that nobody designed. Confirmed across three model families. Standard SAEs fail at -3,059% on deep layers; a SipIt + SAE + GLP pipeline recovers them.

Read
#interpretability#safety#infrastructure#sparse-autoencoders

MRI for AI — Activation-Level Detection

What if you could read what a model is actually computing while it generates an answer? 182 runs across 29 models, 0.96 calibration accuracy. Activation-based detection runs about 10× more reliable than text-only analysis.

Read
#safety#deception#benchmark#evaluation

Agent Deception Benchmark

We measured self-assessment calibration across 29 frontier models. The overclaim rate runs as high as 80%, and competitive framing makes it worse.

Read
#evaluation#alignment#goodhart#math

The Goodhart Gap

4 of 5 frontier models correctly explained a discount calculation but produced the wrong final answer. The gap shrinks with scale but persists. 61,678 math problems evaluated.

Read