Why I’m Building Frontier Interpretability on a Homelab — and Betting on Mojo to Get There

Serious AI research is supposed to need a data center. I’m wagering it doesn’t — a quiet home Blackwell cluster, custom Mojo kernels nobody else is writing, and upstream contributions to MAX. Here’s the why behind the push.

There's an assumption baked into modern AI research, and almost nobody questions it: that serious work requires a data center. Racks of H100s, a cloud bill with a comma in it, someone else's hardware metered by the hour. For a lot of research, that's simply true. But for the work I actually care about — looking inside models, not just running them — I've come to believe it's a habit, not a law.

The economics nobody questions

Mechanistic interpretability is unusually demanding in a specific way. You need the full-precision weights of a frontier-scale model, and you need to read its activations — the intermediate state, across every layer, for thousands of tokens. A commercial API won't give you that; you can't see inside a black box you're only allowed to query. And a single consumer GPU can't hold a 400-billion-parameter model in the first place. So the field quietly sorts itself into people with hyperscale budgets and everyone else.

I didn't love being in the second group. So I designed my way toward the first — on a homelab budget.

The bet: a basement Blackwell cluster

NVIDIA's DGX Spark — the Grace Blackwell GB10 — does something interesting: it fuses CPU and GPU into one coherent, unified memory space. The official guidance caps you at two "stacked" nodes. But that cap is a software-support boundary, not a hardware wall. On paper, a ring of ten of them yields roughly 1.28 TB of unified Blackwell memory — enough to load a full-precision 400B+ model with hundreds of gigabytes to spare for activations and context.

The catch is the interconnect. The Spark's embedded ConnectX-7 doesn't expose a single fat 400 Gbps lane the way a server card does; its real, sustained per-port throughput is far lower. A naive design reaches for a high-speed switch — but a 400GbE switch costs more than several of the nodes, idles at jet-engine volume, and burns power around the clock. None of that survives contact with a quiet, low-carbon home. So the design goes the other way: a software-defined ring topology over RoCE v2, no switch, with the workload arranged as pipeline parallelism rather than the all-to-all chatter of tensor parallelism. Nodes 3 through 10 sleep until a "research mode" wakes them. Phased out, the whole thing lands near $55k — for capacity that normally implies a million-dollar cluster.

The "unsupported" scalability of the Spark isn't a hardware impossibility. It's a software-orchestration problem — and software is the part I can change.

Where it actually gets interesting: the kernels

Here's the part I want to be honest about. Stacking DGX Sparks isn't novel — plenty of people have bought a handful and wired them together. The hardware is available; the llama.cpp crowd has already pushed it around. What I haven't seen anyone else doing is writing custom kernels to exploit this specific architecture as hard as it can be exploited — treating the GB10's unified memory and the ring's pipeline as a first-class target rather than something to run a generic container on top of.

That's the difference between assembling hardware and building an instrument. And it's where Mojo comes in.

Why I bet on Mojo

Mojo is the first language I've used that is genuinely CPU- and GPU-first, with zero-cost abstractions that let you write something Pythonic and have it compile down to the metal. It gives you real SIMD and the ability to actually reach the hardware instead of negotiating with three layers of framework to get there. And it's explicit — which, if you've read anything else here, you know is the quality I trust most in a tool. It doesn't hide the machine from you.

The ecosystem is small. It's growing slower than Rust did, and I'm fine with that — partly because the things I want to do don't depend on a crowd, and partly because being early is the only time your contributions actually shape where something goes.

Proof, on the table

Talk is cheap, so here's the receipt. Over a focused stretch, I contributed consumer-Blackwell support upstream to modular/modular — the kind of plumbing that has to exist before any of the cluster work is possible:

›#6593 — an arch-specific sm_121a target for the DGX Spark / GB10.
›#6599 — sm_121 / sm_121a platform-detection predicates, split out so they could land cleanly on their own.

Both were reviewed by Modular's Brad Larson and merged into the upstream Mojo sources. Small patches, in the scheme of things — but they're the foundation the rest of this is built on, and they're in the toolchain now, for anyone running on consumer Blackwell.

Why I push so hard for this

It would be easy to read all of this as a grant pitch, and yes — contributing to the ecosystem is part of how the Modular grant program works. But that's not really why I do it. I do it because I want the kind of AI research that matters to be possible outside the hyperscale economy — so that understanding what these systems are doing isn't a privilege reserved for whoever owns the most GPUs. And selfishly, I do it because being early to a small, growing ecosystem is how you find your people. I'd rather help build the thing I want to belong to than wait to be let into someone else's.

I got curious about what's actually happening inside these models. That curiosity turned into a research program, which turned into a hardware problem, which turned into writing kernels for a language most people haven't tried yet. Every step followed from the one before it. None of it was a plan. That's usually how the good work starts.