GuideProduction

GLM-5.1 FP8 on Modal

Overview & Architecture

Model specs, hardware choices, and design decisions.

Production GLM-5.1-FP8 on Modal with SGLang, 8× NVIDIA B200, EAGLE speculative decoding, and BF16 KV. Traffic lands on SGLang's native OpenAI-compatible HTTP surface: no extra proxy.

01

At a glance

Specs, API surface, and how replicas scale on Modal.

Model

GLM-5.1-FP8

MoE: 754B total, ~40B active per token. 256 routed experts, 8 active. MIT license.

Context

200K in

Long context with MLA-compressed KV and DSA-style sparse attention paths in the base model. This guide caps prefill and totals lower for safety.

Engine

SGLang 0.5.10+

RadixAttention, OpenAI-compatible HTTP, EAGLE v2. Use a build with correct FP8 dequant kernels (naive load can be wrong).

Hardware

8× B200

192 GB HBM3e each, 1.5 TB VRAM total. Tensor parallel TP=8. Blackwell SM100 (kernels are architecture-locked).

Precision

FP8 E4M3 + BF16 KV

Weights FP8 for bandwidth and tensor-core throughput. KV stays BF16 for stability with EAGLE and known FP8-KV issues upstream.

API

/v1/chat/completions

Streaming, tools, reasoning_content. HTTPS to Modal, no extra proxy.

Scaling

min=0 · max=3 · down=900s

Scale-to-zero when idle, bounded replicas, 15 min scaledown window.

Why this stack exists: model lineage, precision, and what we optimize for in production.

02

Model and precision context

GLM-5.1 on Blackwell: MoE, MLA/DSA, and why FP8 weights with BF16 KV is the default recipe here.

Why this model on this stack

GLM-5.1 is a next-generation autoregressive language model from Z.ai (roots in Tsinghua KEG research). The FP8 release targets Hopper and Blackwell tensor cores: E4M3 weights cut memory traffic versus BF16 and unlock higher FP8 math throughput, with typical benchmark deltas under about 1% when kernels and scales are correct.

Multi-latent attention (MLA) shrinks KV footprint; DeepSeek-style sparse attention (DSA) trims attention cost on long sequences. Together they make a 200K context window practical on finite VRAM, even though serving still needs careful flags, BF16 KV for this recipe, and precompiled GEMM artifacts for fast cold starts.

GLM-5.1 is also aimed at long-horizon agent workflows (planning, tools, iteration). This guide focuses on production inference on Modal, not benchmark leaderboards.

FP8 E4M3

8-bit float: 1 sign, 4 exponent, 3 mantissa bits. Roughly 2× memory savings vs BF16 weights on Blackwell, native tensor cores, no emulation.

Serving requirement

Load the checkpoint only through frameworks that implement the right FP8 dequant path (here: SGLang 0.5.10+). vLLM 0.19.0+ is another option in the ecosystem; wrong loaders silently misbehave.

Mixture-of-Experts (MoE)

Total params: 754B

Active per token: ~40B

Activation ratio: ~5.3%

Only a fraction of weights fire per token. This gives massive model capacity at manageable compute cost per inference step.

Why FP8 over BF16/INT8?

Throughput: 2× on Blackwell

Memory: 50% of BF16

Accuracy delta:<1%

The E4M3 format preserves dynamic range for activation spikes while compressing weights, making it optimal for large MoE models where memory bandwidth is the primary bottleneck.

RadixAttention (SGLang)

TTFT reduction: 40–60%

Best for: Multi-turn, agents

Efficiently shares prefix KV caches across requests. For agents or system prompts that repeat, this dramatically reduces time-to-first-token.

Reasoning & Tools

Reasoning parser: glm45

Tool parser: glm47

Built-in chain-of-thought routing separates reasoning from final answers. Deterministic tool-calling via GLM's native templates, exposed through the OpenAI-compatible API.

FP8 E4M3 Format Deep Dive

Structure: 1 sign bit, 4 exponent bits, 3 mantissa bits. The 4 exponent bits provide sufficient dynamic range (±448) to handle activation spikes in transformer layers.
Hardware support:Blackwell's 5th-gen Tensor Cores have native FP8 matrix-multiply-accumulate (MMA) instructions. No software emulation, no quant/dequant overhead during compute.
Memory bandwidth: LLM inference is memory-bound. Half the bytes per weight means double the effective bandwidth, which directly translates to higher tokens/second.
03

Infrastructure topology

Modal volumes, GPU class, and SGLang as the single HTTP front, from weights to clients.

Two Modal volumes isolate slow I/O from the GPU path: weights under /model-cache, DeepGEMM cache under /dg-cache. The container reloads both on boot for consistent volume metadata.

Modal

  • 8× NVIDIA B200 (Blackwell)
  • 1.5 TB VRAM total
  • Scale-to-zero when idle
glm51-model-weights

~700 GB FP8 safetensors at /model-cache (HF, Xet).

glm51-deepgemm-cache

Precompiled SM100 FP8 GEMM at /dg-cache (B200-specific).

SGLang TP=8

  • EAGLE v2 speculative decode (MTP head in checkpoint)
  • TRT-LLM NSA backends for Blackwell prefill and decode
  • BF16 KV (avoids EAGLE + FP8 KV failure modes)
  • glm45 reasoning parser, glm47 tool parser

Crash monitor thread to os._exit(1) if child dies

Log stream thread to Modal dashboard

@modal.web_server to port 8000 (no sidecar)

Clients

OpenAI SDK, curl, agents

HTTPS to /v1/chat/completions

04

Serving and optimization stack

SGLang, EAGLE, DeepGEMM, and BF16 KV: how the pieces fit together.

SGLang v0.5.10+

Primary Engine

RadixAttention shares prefix KV across requests (40–60% TTFT reduction for agents and fixed system prompts). 29% higher throughput than vLLM on MoE architectures in independent benchmarks.

  • • Native OpenAI-compatible HTTP — no proxy needed
  • • Built-in EAGLE v2 speculative decoding
  • • Modal recommends for online workloads

EAGLE v2 Speculative Decoding

2.6× Speedup

Lightweight Multi-Token Prediction (MTP) head trained alongside the base model drafts 3–4 tokens ahead. The base model verifies drafts in parallel.

  • • TPOT: ~20 ms → ~7.7 ms
  • • Accept length: ~3.5 tokens average
  • • No separate draft model required
  • • Trade-off: TTFT inflates above ~50 concurrent requests

DeepGEMM JIT Compilation

SM100 Kernels

CUDA kernel library optimized for FP8 General Matrix Multiplication on Blackwell. Compiles kernels at runtime for specific SM architecture (sm100) and matrix shapes.

  • • Pre-compiled into Modal Volume — no 10–15 min JIT on cold start
  • • Architecture lock: B200 kernels ≠ H200, A100, T4
  • • Compilation must match serving hardware exactly

BF16 KV Cache (Not FP8)

Stability Choice

Despite FP8 weights, KV cache is explicitly BF16. This is a deliberate mitigation for known upstream issues, not an oversight.

  • • #22359: EAGLE + FP8 KV crashes on Blackwell
  • • #21291: flashmla_kv accuracy degradation on B200
  • • #17526: FP8 KV 10% slower due to quant overhead

TRT-LLM NSA/DSA Backends

NVIDIA Sparse Attention (NSA): Optimized for long-context and sparse patterns, reducing memory bandwidth pressure. The --attention-backend nsa flag enables this on Blackwell.
TensorRT-LLM kernels: For both decode (--nsa-decode-backend trtllm) and prefill (--nsa-prefill-backend trtllm), which are mathematically stable on B200 and fix accuracy drops observed in other backends.
05

Scale-to-zero profile

Autoscaler knobs you will see in deploy.py; confirm values for your workload.

Modal autoscaler knobs for this profile. Values are illustrative of the guide deployment; confirm in your own deploy.py. Cost and cold-start implications are covered in Tune & Operate.

min_containers
0
Scale to zero when idle. Lowest baseline cost, longest cold path.
max_containers
3
Caps replicas. With 48 concurrent slots per replica, up to 144 concurrent requests across three replicas.
scaledown_window
900
Seconds idle before scale-in (15 min). Smooths bursty traffic without immediate teardown.
max_inputs
20
Per-container queue depth at the Modal layer to avoid sudden spikes starving memory.
timeout
86400
24h container lifetime to mitigate slow leaks or fragmentation on very long runs.
06

Performance snapshot

Order-of-magnitude decode and latency numbers; validate on your traffic mix.

Order-of-magnitude numbers from internal runs and SGLang cookbook-style measurements. Validate on your own mix of concurrency, context length, and tools. For tuning detail and cold-start trade-offs, see Tune & Operate.

TPOT (approx.)

Baseline ~20 ms
EAGLE ~7.7 ms

EAGLE verify path.

Decode throughput (aggregate)

Baseline ~1,750 tok/s
EAGLE 4,600+ tok/s

TTFT (warm, low concurrency)

~246 ms

EAGLE can inflate TTFT when concurrency per replica is very high; cap max running requests accordingly.

EAGLE accept length

~3.5 tokens

Typical draft acceptance.

07

Design choices

Engine, GPU, KV, speculative decode, and images: the decisions this guide encodes.

Engine

SGLang

Strong MoE throughput in many setups versus vLLM, native EAGLE and RadixAttention, and a Modal-friendly single HTTP process.

GPU

8×B200

192 GB per GPU headroom for BF16 KV and long context. TRT-LLM DSA and NSA paths target Blackwell SM100.

KV cache

BF16

Avoids FP8 KV plus EAGLE crashes (#22359), extra overhead (#17526), and some decode accuracy issues (#21291).

Spec decode

EAGLE v2

Large decode speedup with an MTP head shipped in the checkpoint: no separate draft model to deploy.

Proxy

None

SGLang speaks OpenAI-compatible JSON; @modal.web_server routes straight to port 8000.

Download image

debian-slim + HF Hub

CPU-only weight sync avoids pulling a multi-gigabyte CUDA image for I/O-only tasks.

Serve image

lmsysorg/sglang:latest

Bundles CUDA, SGLang, and DeepGEMM. Do not overwrite huggingface-hub in the image (import compatibility with sync jobs).