← Blog
#max-engine#modular#dgx-spark#infrastructure

Running Gemma-4-31B on DGX Spark with MAX Engine — What Broke and How We Fixed It

Getting Modular's MAX Engine to serve a 31B parameter model on NVIDIA's smallest Grace Blackwell system.

If you have a DGX Spark and tried to run MAX Engine, you probably hit a wall. We did too. Here's what we found, what we fixed, and what's still broken — so you don't have to spend a weekend debugging it.

The Setup

  • NVIDIA DGX Spark (GB10, ARM64 Grace CPU, 128GB unified memory)
  • MAX Engine 26.4.0 nightly
  • Gemma-4-31B-IT

We wanted to serve Gemma-4 locally for research. MAX Engine claims day-zero Gemma-4 support and DGX Spark compatibility. The reality was more nuanced.

Issue 1: Memory Estimation on Unified Memory

MAX's memory estimator reads CUDA-reported GPU memory (~34GB) and rejects any model larger than that. On the DGX Spark, the GPU and CPU share a 128GB unified memory pool — but CUDA only reports the GPU's nominal allocation as "GPU memory."

The fix: Detect unified memory systems (check /sys/class/dmi/id/product_name for "DGX_Spark" or "AI TOP ATOM") and read system available memory from /proc/meminfo instead of CUDA's free_memory stat.

This is a one-function patch to MemoryEstimator.free_memory() in MAX's pipeline configuration.

Issue 2: Pydantic Type Resolution

MAX's PipelineConfig and MAXModelConfig use deferred type annotations that reference PyTorch tensor types. If you don't import torch before the config objects are created, Pydantic throws a PydanticUserError about models not being "fully defined."

The fix: Import torch and call model_rebuild() on both config classes before running the CLI.

Issue 3: CUDA Memory Overallocation

Without constraints, CUDA allocates the entire unified memory pool during model loading and graph compilation. On a desktop GPU with separate VRAM, this is fine — the OS has its own memory. On unified memory, CUDA eating everything kills the OS.

The fix: We use a CUDA memory limiter shim — a shared library loaded via LD_PRELOAD that intercepts cuMemAlloc at the driver level and enforces a hard ceiling. Set it to 90-96GB to leave headroom for the OS. (We've open-sourced this tool separately — link at bottom.)

Issue 4: The Sampling Kernel (Still Open)

With all three patches applied, MAX Engine successfully:

  • Compiles the Gemma-4 language model graph (285 seconds)
  • Compiles the vision model (5 seconds)
  • Loads weights into unified memory
  • Captures device graphs
  • Starts the OpenAI-compatible API server

It even responds to inference requests — with greedy sampling (temperature=0).

But the top-K/top-P sampling kernel crashes with CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES. The kernel at topk_fi.mojo:1419 requests more registers or threads per block than the GB10's SM 12.1 architecture supports.

This is a kernel tuning issue inside MAX's compiled core — we can't patch it ourselves. We've filed a detailed bug report with full reproduction steps (modular/modular#6488).

Workaround: Use temperature: 0 for greedy decoding. It bypasses the broken sampling kernel entirely and produces correct output.

Results

With all patches applied:

What Status
Model compilation Works (290s total)
Weight loading Works
Server startup Works
Greedy sampling Works
Top-K/Top-P sampling Broken (kernel resource limit)

Gemma-4-31B serves inference on a DGX Spark through MAX Engine. Not perfectly — but it works.

For DGX Spark Owners

If you want to try this yourself:

  1. Install the CUDA memory limiter shim before anything else
  2. Apply the unified memory patch to MAX's memory estimator
  3. Use --max-batch-size 4 to keep graph capture memory reasonable
  4. Set temperature: 0 until Modular fixes the sampling kernel

We're happy to share our wrapper script and patches. Reach out or check the GitHub issue for details.