Blackwell Shenanigans
Blackwell Shenanigans 003: Gemma 4 MTP on a B300
Google shipped Gemma 4 assistant drafters, SGLang merged Frozen-KV MTP support, and a single B300 turned it into a real 1.6x decode-speed win.
The brief was wonderfully irresponsible in the useful way: Google had released Gemma 4 assistant drafters, SGLang had a fresh Gemma 4 MTP path, and we had a single Yotta B300 with about 275 GB of VRAM sitting there looking too expensive to leave idle.
So we did the obvious thing: put google/gemma-4-26B-A4B-it on the B300, pair it with google/gemma-4-26B-A4B-it-assistant, serve it through SGLang, and make a little race dashboard because watching throughput tick upward is more emotionally honest than pretending this is normal lab behavior.
The useful result:
Gemma 4 MTP averaged 1.628x faster decode throughput than target-only SGLang across the tested matrix.
No multi-GPU heroics. No vendor endpoint. One NVIDIA B300 SXM6 AC, one SGLang server at a time, baseline first, MTP second.
What Changed Upstream
SGLang PR #24436 added Gemma 4 MTP support and was merged on 2026-05-07.
The important bit is that Gemma 4's assistant checkpoints are not ordinary EAGLE3-style draft models. The SGLang PR introduces a FROZEN_KV_MTP speculative path where the assistant uses the target model's KV cache instead of owning a separate KV cache.
In practice, the launcher can still pass:
--speculative-algorithm NEXTN
--speculative-draft-model-path google/gemma-4-26B-A4B-it-assistant
--speculative-num-steps 5
--speculative-num-draft-tokens 6
--speculative-eagle-topk 1
SGLang recognizes the Gemma 4 assistant and promotes the path internally to the Gemma-specific Frozen-KV MTP implementation.
Hardware And Stack
The test box was a Yotta B300 instance:
GPU: NVIDIA B300 SXM6 AC
VRAM: 275040 MiB
Driver: 580.126.09
OS: Ubuntu 22.04.5
Python: 3.10.12
The serving stack was pinned instead of trusting whatever the day had lying around:
SGLang: 0.5.12.dev155+gbcf8d100d
SGLang commit: bcf8d100d06a70009f5b591652464f1e7ff86116
Transformers: 5.8.0.dev0
Transformers commit: 2c7d385621c80fee70c1472f3a622fcba2c93fb9
Torch: 2.11.0+cu130
CUDA home: /usr/local/cuda-13.0
B300 reports sm_103, so the CUDA 12.8 toolchain was not enough for the JIT path. The final box needed CUDA 13 NVCC plus the CUDA 13 cuRAND development package:
apt-get install -y cuda-nvcc-13-0 libcurand-dev-13-0 protobuf-compiler
That was not glamorous. It was, however, the difference between "Blackwell benchmark" and "compiler error wearing a lab coat."
The Actual SGLang Config
Baseline used the same target model without speculative flags:
sglang serve \
--model-path google/gemma-4-26B-A4B-it \
--served-model-name google/gemma-4-26B-A4B-it \
--host 0.0.0.0 \
--port 30000 \
--tp 1 \
--trust-remote-code \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--mem-fraction-static 0.85 \
--context-length 262144 \
--max-running-requests 48 \
--enable-metrics
The MTP run added:
--speculative-algorithm NEXTN
--speculative-draft-model-path google/gemma-4-26B-A4B-it-assistant
--speculative-num-steps 5
--speculative-num-draft-tokens 6
--speculative-eagle-topk 1
The launch environment also mattered:
CUDA_HOME=/usr/local/cuda-13.0
FLASHINFER_DISABLE_VERSION_CHECK=1
SAFETENSORS_FAST_GPU=1
SGLANG_DISABLE_DEEP_GEMM=1
SGLANG_ENABLE_DEEP_GEMM=0
SGLANG_JIT_DEEPGEMM_PRECOMPILE=0
OMP_NUM_THREADS=8
The actual launcher that produced the run is here:
- serve-gemma4-sglang.sh
- bench-gemma4-mtp-sglang.sh
- benchmark_sglang.py
- install-gemma4-mtp-sglang.sh
- smoke_test_gemma4_openai.py
- gemma4_derby_dashboard.py
The Benchmark Command
This was the benchmark shape for the result below:
VARIANT=26B-A4B \
MODEL=google/gemma-4-26B-A4B-it \
TP_SIZE=1 \
HOST=127.0.0.1 \
PORT=30000 \
CONCURRENCY=1,4,8 \
CONTEXTS=0,8192,32768 \
DURATION=30 \
MAX_TOKENS=8192 \
IGNORE_EOS=1 \
MIN_TOKENS=8192 \
WAIT_TIMEOUT=3600 \
MEM_FRACTION_STATIC=0.85 \
MAX_RUNNING_REQUESTS=48 \
scripts/bench-gemma4-mtp-sglang.sh
The dashboard watched:
results/gemma4-mtp-sglang/baseline.live.json
results/gemma4-mtp-sglang/mtp_topk1.live.json
results/gemma4-mtp-sglang/logs/baseline.log
results/gemma4-mtp-sglang/logs/mtp_topk1.log
The final public artifacts are here:
Why Forced Decode Was Necessary
The first pass used MAX_TOKENS=2048 and let the model stop naturally. That produced a tempting but broken comparison: several long-context decode cells ended after roughly one second because the model simply answered and stopped.
That is not a throughput benchmark. That is Gemma politely leaving the room.
The final pass set:
IGNORE_EOS=1
MIN_TOKENS=8192
MAX_TOKENS=8192
That forced the decode cells to run for the configured window, except for one fast 0-context, concurrency=1 MTP cell that hit the 8192 token cap at about 25.2s. Every other comparison cell ran at roughly 30s.
Results
Forced-output aggregate decode throughput:
| Context tokens | Concurrency | Baseline tok/s | MTP tok/s | Speedup | MTP accept rate |
|---|---|---|---|---|---|
| 0 | 1 | 188.961 | 352.938 | 1.868x | 1.0000 |
| 0 | 4 | 592.781 | 1013.181 | 1.709x | 0.9600 |
| 0 | 8 | 1035.166 | 1785.913 | 1.725x | 0.8431 |
| 8192 | 1 | 171.862 | 288.548 | 1.679x | 0.9700 |
| 8192 | 4 | 560.446 | 792.961 | 1.415x | 0.8875 |
| 8192 | 8 | 974.841 | 1784.425 | 1.830x | 0.9300 |
| 32768 | 1 | 135.122 | 189.541 | 1.403x | 0.9950 |
| 32768 | 4 | 475.501 | 744.750 | 1.566x | 1.0000 |
| 32768 | 8 | 881.758 | 1284.233 | 1.456x | 0.9012 |
Average speedup across the matrix: 1.628x.
Minimum speedup: 1.403x.
Maximum speedup: 1.868x.
That is a real result. The interesting part is not just the zero-context speedup, either. The 32k context cells still improved:
32k / C1: 1.403x
32k / C4: 1.566x
32k / C8: 1.456x
The earlier long-context cliff disappeared once the benchmark forced sustained decode.
Why There Are Zero-Context Rows
The 0 context rows are not the target workload. They are the control lane.
They answer: how much does Gemma 4 MTP help pure decode before cached-prefix attention, KV-cache pressure, and long-context memory traffic enter the story?
There are three zero-context rows because the second axis is concurrency:
context axis: 0, 8k, 32k
concurrency axis: 1, 4, 8
So 0/C1, 0/C4, and 0/C8 are three different serving situations, not three copies of the same test. That matters because speculative decoding is batch-sensitive. The control lane tells us whether MTP is helping the decode loop itself, and the 8k/32k lanes tell us how much of that survives once context is involved.
What Happened At 100k?
We did not run a 100k or 128k decode cell in the final comparison matrix.
The benchmark did run prefill/TTFT checks up to 128k before decode:
| Context tokens | Baseline TTFT | Baseline prefill tok/s | MTP TTFT | MTP prefill tok/s |
|---|---|---|---|---|
| 8192 | 0.137s | 73,160 | 0.131s | 76,424 |
| 16384 | 0.319s | 55,739 | 0.315s | 56,335 |
| 32768 | 1.027s | 32,692 | 1.018s | 32,949 |
| 65536 | 3.839s | 17,185 | 3.831s | 17,215 |
| 131072 | 14.717s | 8,921 | 14.671s | 8,949 |
That proves the long-prefix ingest path worked on this B300. It does not prove decode throughput at 100k+ context under concurrency.
The next long-context race card should be 64k and 128k decode with lower concurrency first:
CONTEXTS=0,32768,65536,131072
CONCURRENCY=1,2,4
DURATION=45
MAX_TOKENS=8192
IGNORE_EOS=1
MIN_TOKENS=8192
I would not jump straight to 128k/C8 until the token budget is checked. concurrency * (context + max_tokens) gets impolite very quickly.
What I Believe
This is worth treating as a practical serving knob, not just a fun PR stunt.
The B300 had enough memory to serve the 26B-A4B target plus assistant on one GPU with TP_SIZE=1. The MTP path loaded cleanly, captured the Frozen-KV draft CUDA graphs, passed smoke tests, and produced better throughput across every measured cell.
The acceptance rates were also strong. High throughput with terrible acceptance is just a benchmark wearing fake glasses. This run had 0.84-1.00 accept rate across the forced-output matrix.
The next sensible experiments:
- Repeat the forced-output run with
topk=3andtopk=5. - Add
131072context cells after increasing the decode token budget. - Try the dense
31BGemma 4 target on the same B300. - Compare against vLLM's Gemma 4 MTP recipe once the exact same hardware is free.
For now, the answer to "does Gemma 4 MTP do real work on a single rented Blackwell B300?" is yes.
Annoyingly, satisfyingly, measurably yes.
Replies
Comments, annotations, and Kirsten rebuttals live here.