Blackwell Shenanigans 003: Gemma 4 MTP on a B300

The brief was wonderfully irresponsible in the useful way: Google had released Gemma 4 assistant drafters, SGLang had a fresh Gemma 4 MTP path, and we had a single Yotta B300 with about 275 GB of VRAM sitting there looking too expensive to leave idle.

So we did the obvious thing: put google/gemma-4-26B-A4B-it on the B300, pair it with google/gemma-4-26B-A4B-it-assistant, serve it through SGLang, and make a little race dashboard because watching throughput tick upward is more emotionally honest than pretending this is normal lab behavior.

The useful result:

Gemma 4 MTP averaged 1.628x faster decode throughput than target-only SGLang across the tested matrix.

No multi-GPU heroics. No vendor endpoint. One NVIDIA B300 SXM6 AC, one SGLang server at a time, baseline first, MTP second.

What Changed Upstream

SGLang PR #24436 added Gemma 4 MTP support and was merged on 2026-05-07.

The important bit is that Gemma 4's assistant checkpoints are not ordinary EAGLE3-style draft models. The SGLang PR introduces a FROZEN_KV_MTP speculative path where the assistant uses the target model's KV cache instead of owning a separate KV cache.

In practice, the launcher can still pass:

--speculative-algorithm NEXTN
--speculative-draft-model-path google/gemma-4-26B-A4B-it-assistant
--speculative-num-steps 5
--speculative-num-draft-tokens 6
--speculative-eagle-topk 1

SGLang recognizes the Gemma 4 assistant and promotes the path internally to the Gemma-specific Frozen-KV MTP implementation.

Hardware And Stack

The test box was a Yotta B300 instance:

GPU: NVIDIA B300 SXM6 AC
VRAM: 275040 MiB
Driver: 580.126.09
OS: Ubuntu 22.04.5
Python: 3.10.12

The serving stack was pinned instead of trusting whatever the day had lying around:

SGLang:       0.5.12.dev155+gbcf8d100d
SGLang commit: bcf8d100d06a70009f5b591652464f1e7ff86116
Transformers: 5.8.0.dev0
Transformers commit: 2c7d385621c80fee70c1472f3a622fcba2c93fb9
Torch:        2.11.0+cu130
CUDA home:    /usr/local/cuda-13.0

B300 reports sm_103, so the CUDA 12.8 toolchain was not enough for the JIT path. The final box needed CUDA 13 NVCC plus the CUDA 13 cuRAND development package:

apt-get install -y cuda-nvcc-13-0 libcurand-dev-13-0 protobuf-compiler

That was not glamorous. It was, however, the difference between "Blackwell benchmark" and "compiler error wearing a lab coat."

The Actual SGLang Config

Baseline used the same target model without speculative flags:

sglang serve \
  --model-path google/gemma-4-26B-A4B-it \
  --served-model-name google/gemma-4-26B-A4B-it \
  --host 0.0.0.0 \
  --port 30000 \
  --tp 1 \
  --trust-remote-code \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --mem-fraction-static 0.85 \
  --context-length 262144 \
  --max-running-requests 48 \
  --enable-metrics

The MTP run added:

--speculative-algorithm NEXTN
--speculative-draft-model-path google/gemma-4-26B-A4B-it-assistant
--speculative-num-steps 5
--speculative-num-draft-tokens 6
--speculative-eagle-topk 1

The launch environment also mattered:

CUDA_HOME=/usr/local/cuda-13.0
FLASHINFER_DISABLE_VERSION_CHECK=1
SAFETENSORS_FAST_GPU=1
SGLANG_DISABLE_DEEP_GEMM=1
SGLANG_ENABLE_DEEP_GEMM=0
SGLANG_JIT_DEEPGEMM_PRECOMPILE=0
OMP_NUM_THREADS=8

The actual launcher that produced the run is here:

The Benchmark Command

This was the benchmark shape for the result below:

VARIANT=26B-A4B \
MODEL=google/gemma-4-26B-A4B-it \
TP_SIZE=1 \
HOST=127.0.0.1 \
PORT=30000 \
CONCURRENCY=1,4,8 \
CONTEXTS=0,8192,32768 \
DURATION=30 \
MAX_TOKENS=8192 \
IGNORE_EOS=1 \
MIN_TOKENS=8192 \
WAIT_TIMEOUT=3600 \
MEM_FRACTION_STATIC=0.85 \
MAX_RUNNING_REQUESTS=48 \
scripts/bench-gemma4-mtp-sglang.sh

The dashboard watched:

results/gemma4-mtp-sglang/baseline.live.json
results/gemma4-mtp-sglang/mtp_topk1.live.json
results/gemma4-mtp-sglang/logs/baseline.log
results/gemma4-mtp-sglang/logs/mtp_topk1.log

The final public artifacts are here:

Why Forced Decode Was Necessary

The first pass used MAX_TOKENS=2048 and let the model stop naturally. That produced a tempting but broken comparison: several long-context decode cells ended after roughly one second because the model simply answered and stopped.

That is not a throughput benchmark. That is Gemma politely leaving the room.

The final pass set:

IGNORE_EOS=1
MIN_TOKENS=8192
MAX_TOKENS=8192

That forced the decode cells to run for the configured window, except for one fast 0-context, concurrency=1 MTP cell that hit the 8192 token cap at about 25.2s. Every other comparison cell ran at roughly 30s.

Results

Forced-output aggregate decode throughput:

Context tokens	Concurrency	Baseline tok/s	MTP tok/s	Speedup	MTP accept rate
0	1	188.961	352.938	1.868x	1.0000
0	4	592.781	1013.181	1.709x	0.9600
0	8	1035.166	1785.913	1.725x	0.8431
8192	1	171.862	288.548	1.679x	0.9700
8192	4	560.446	792.961	1.415x	0.8875
8192	8	974.841	1784.425	1.830x	0.9300
32768	1	135.122	189.541	1.403x	0.9950
32768	4	475.501	744.750	1.566x	1.0000
32768	8	881.758	1284.233	1.456x	0.9012

Average speedup across the matrix: 1.628x.

Minimum speedup: 1.403x.

Maximum speedup: 1.868x.

That is a real result. The interesting part is not just the zero-context speedup, either. The 32k context cells still improved:

32k / C1: 1.403x
32k / C4: 1.566x
32k / C8: 1.456x

The earlier long-context cliff disappeared once the benchmark forced sustained decode.

Why There Are Zero-Context Rows

The 0 context rows are not the target workload. They are the control lane.

They answer: how much does Gemma 4 MTP help pure decode before cached-prefix attention, KV-cache pressure, and long-context memory traffic enter the story?

There are three zero-context rows because the second axis is concurrency:

context axis:      0, 8k, 32k
concurrency axis:  1, 4, 8

So 0/C1, 0/C4, and 0/C8 are three different serving situations, not three copies of the same test. That matters because speculative decoding is batch-sensitive. The control lane tells us whether MTP is helping the decode loop itself, and the 8k/32k lanes tell us how much of that survives once context is involved.

What Happened At 100k?

We did not run a 100k or 128k decode cell in the final comparison matrix.

The benchmark did run prefill/TTFT checks up to 128k before decode:

Context tokens	Baseline TTFT	Baseline prefill tok/s	MTP TTFT	MTP prefill tok/s
8192	0.137s	73,160	0.131s	76,424
16384	0.319s	55,739	0.315s	56,335
32768	1.027s	32,692	1.018s	32,949
65536	3.839s	17,185	3.831s	17,215
131072	14.717s	8,921	14.671s	8,949

That proves the long-prefix ingest path worked on this B300. It does not prove decode throughput at 100k+ context under concurrency.

The next long-context race card should be 64k and 128k decode with lower concurrency first:

CONTEXTS=0,32768,65536,131072
CONCURRENCY=1,2,4
DURATION=45
MAX_TOKENS=8192
IGNORE_EOS=1
MIN_TOKENS=8192

I would not jump straight to 128k/C8 until the token budget is checked. concurrency * (context + max_tokens) gets impolite very quickly.

What I Believe

This is worth treating as a practical serving knob, not just a fun PR stunt.

The B300 had enough memory to serve the 26B-A4B target plus assistant on one GPU with TP_SIZE=1. The MTP path loaded cleanly, captured the Frozen-KV draft CUDA graphs, passed smoke tests, and produced better throughput across every measured cell.

The acceptance rates were also strong. High throughput with terrible acceptance is just a benchmark wearing fake glasses. This run had 0.84-1.00 accept rate across the forced-output matrix.

The next sensible experiments:

Repeat the forced-output run with topk=3 and topk=5.
Add 131072 context cells after increasing the decode token budget.
Try the dense 31B Gemma 4 target on the same B300.
Compare against vLLM's Gemma 4 MTP recipe once the exact same hardware is free.

For now, the answer to "does Gemma 4 MTP do real work on a single rented Blackwell B300?" is yes.

Annoyingly, satisfyingly, measurably yes.