GuideProduction

GLM-5.1 FP8 on Modal

Overview & Architecture

Model specs, hardware choices, and design decisions.

Production GLM-5.1-FP8 on Modal with SGLang, 8× NVIDIA B200, EAGLE speculative decoding, and BF16 KV. Traffic lands on SGLang's native OpenAI-compatible HTTP surface: no extra proxy.

At a glance

Specs, API surface, and how replicas scale on Modal.

Model

GLM-5.1-FP8

MoE: 754B total, ~40B active per token. 256 routed experts, 8 active. MIT license.

Context

200K in

Long context with MLA-compressed KV and DSA-style sparse attention paths in the base model. This guide caps prefill and totals lower for safety.

Engine

SGLang 0.5.10+

RadixAttention, OpenAI-compatible HTTP, EAGLE v2. Use a build with correct FP8 dequant kernels (naive load can be wrong).

Hardware

8× B200

192 GB HBM3e each, 1.5 TB VRAM total. Tensor parallel TP=8. Blackwell SM100 (kernels are architecture-locked).

Precision

FP8 E4M3 + BF16 KV

Weights FP8 for bandwidth and tensor-core throughput. KV stays BF16 for stability with EAGLE and known FP8-KV issues upstream.

API

/v1/chat/completions

Streaming, tools, reasoning_content. HTTPS to Modal, no extra proxy.

Scaling

min=0 · max=3 · down=900s

Scale-to-zero when idle, bounded replicas, 15 min scaledown window.

Why this stack exists: model lineage, precision, and what we optimize for in production.

Model and precision context

GLM-5.1 on Blackwell: MoE, MLA/DSA, and why FP8 weights with BF16 KV is the default recipe here.

Why this model on this stack

GLM-5.1 is a next-generation autoregressive language model from Z.ai (roots in Tsinghua KEG research). The FP8 release targets Hopper and Blackwell tensor cores: E4M3 weights cut memory traffic versus BF16 and unlock higher FP8 math throughput, with typical benchmark deltas under about 1% when kernels and scales are correct.

Multi-latent attention (MLA) shrinks KV footprint; DeepSeek-style sparse attention (DSA) trims attention cost on long sequences. Together they make a 200K context window practical on finite VRAM, even though serving still needs careful flags, BF16 KV for this recipe, and precompiled GEMM artifacts for fast cold starts.

GLM-5.1 is also aimed at long-horizon agent workflows (planning, tools, iteration). This guide focuses on production inference on Modal, not benchmark leaderboards.

FP8 E4M3

8-bit float: 1 sign, 4 exponent, 3 mantissa bits. Roughly 2× memory savings vs BF16 weights on Blackwell, native tensor cores, no emulation.

Serving requirement

Load the checkpoint only through frameworks that implement the right FP8 dequant path (here: SGLang 0.5.10+). vLLM 0.19.0+ is another option in the ecosystem; wrong loaders silently misbehave.

Mixture-of-Experts (MoE)

Total params: 754B

Active per token: ~40B

Activation ratio: ~5.3%

Only a fraction of weights fire per token. This gives massive model capacity at manageable compute cost per inference step.

Why FP8 over BF16/INT8?

Throughput: 2× on Blackwell

Memory: 50% of BF16

Accuracy delta:<1%

The E4M3 format preserves dynamic range for activation spikes while compressing weights, making it optimal for large MoE models where memory bandwidth is the primary bottleneck.

RadixAttention (SGLang)

TTFT reduction: 40–60%

Best for: Multi-turn, agents

Efficiently shares prefix KV caches across requests. For agents or system prompts that repeat, this dramatically reduces time-to-first-token.

Reasoning & Tools

Reasoning parser: glm45

Tool parser: glm47

Built-in chain-of-thought routing separates reasoning from final answers. Deterministic tool-calling via GLM's native templates, exposed through the OpenAI-compatible API.

FP8 E4M3 Format Deep Dive

Structure: 1 sign bit, 4 exponent bits, 3 mantissa bits. The 4 exponent bits provide sufficient dynamic range (±448) to handle activation spikes in transformer layers.

Hardware support:Blackwell's 5th-gen Tensor Cores have native FP8 matrix-multiply-accumulate (MMA) instructions. No software emulation, no quant/dequant overhead during compute.

Memory bandwidth: LLM inference is memory-bound. Half the bytes per weight means double the effective bandwidth, which directly translates to higher tokens/second.

Infrastructure topology

Modal volumes, GPU class, and SGLang as the single HTTP front, from weights to clients.

Two Modal volumes isolate slow I/O from the GPU path: weights under /model-cache, DeepGEMM cache under /dg-cache. The container reloads both on boot for consistent volume metadata.

Modal

▸8× NVIDIA B200 (Blackwell)
▸1.5 TB VRAM total
▸Scale-to-zero when idle

glm51-model-weights

~700 GB FP8 safetensors at /model-cache (HF, Xet).

glm51-deepgemm-cache

Precompiled SM100 FP8 GEMM at /dg-cache (B200-specific).

SGLang TP=8

●EAGLE v2 speculative decode (MTP head in checkpoint)
●TRT-LLM NSA backends for Blackwell prefill and decode
●BF16 KV (avoids EAGLE + FP8 KV failure modes)
●glm45 reasoning parser, glm47 tool parser

Crash monitor thread to os._exit(1) if child dies

Log stream thread to Modal dashboard

@modal.web_server to port 8000 (no sidecar)

Clients

OpenAI SDK, curl, agents

HTTPS to /v1/chat/completions

Serving and optimization stack

SGLang, EAGLE, DeepGEMM, and BF16 KV: how the pieces fit together.

SGLang v0.5.10+

Primary Engine

RadixAttention shares prefix KV across requests (40–60% TTFT reduction for agents and fixed system prompts). 29% higher throughput than vLLM on MoE architectures in independent benchmarks.

• Native OpenAI-compatible HTTP — no proxy needed
• Built-in EAGLE v2 speculative decoding
• Modal recommends for online workloads

EAGLE v2 Speculative Decoding

2.6× Speedup

Lightweight Multi-Token Prediction (MTP) head trained alongside the base model drafts 3–4 tokens ahead. The base model verifies drafts in parallel.

• TPOT: ~20 ms → ~7.7 ms
• Accept length: ~3.5 tokens average
• No separate draft model required
• Trade-off: TTFT inflates above ~50 concurrent requests

DeepGEMM JIT Compilation

SM100 Kernels

CUDA kernel library optimized for FP8 General Matrix Multiplication on Blackwell. Compiles kernels at runtime for specific SM architecture (sm100) and matrix shapes.

• Pre-compiled into Modal Volume — no 10–15 min JIT on cold start
• Architecture lock: B200 kernels ≠ H200, A100, T4
• Compilation must match serving hardware exactly

BF16 KV Cache (Not FP8)

Stability Choice

Despite FP8 weights, KV cache is explicitly BF16. This is a deliberate mitigation for known upstream issues, not an oversight.

• #22359: EAGLE + FP8 KV crashes on Blackwell
• #21291: flashmla_kv accuracy degradation on B200
• #17526: FP8 KV 10% slower due to quant overhead

TRT-LLM NSA/DSA Backends

NVIDIA Sparse Attention (NSA): Optimized for long-context and sparse patterns, reducing memory bandwidth pressure. The --attention-backend nsa flag enables this on Blackwell.

TensorRT-LLM kernels: For both decode (--nsa-decode-backend trtllm) and prefill (--nsa-prefill-backend trtllm), which are mathematically stable on B200 and fix accuracy drops observed in other backends.

Scale-to-zero profile

Autoscaler knobs you will see in deploy.py; confirm values for your workload.

Modal autoscaler knobs for this profile. Values are illustrative of the guide deployment; confirm in your own deploy.py. Cost and cold-start implications are covered in Tune & Operate.

min_containers: 0; Scale to zero when idle. Lowest baseline cost, longest cold path.
max_containers: 3; Caps replicas. With 48 concurrent slots per replica, up to 144 concurrent requests across three replicas.
scaledown_window: 900; Seconds idle before scale-in (15 min). Smooths bursty traffic without immediate teardown.
max_inputs: 20; Per-container queue depth at the Modal layer to avoid sudden spikes starving memory.
timeout: 86400; 24h container lifetime to mitigate slow leaks or fragmentation on very long runs.

Performance snapshot

Order-of-magnitude decode and latency numbers; validate on your traffic mix.

Order-of-magnitude numbers from internal runs and SGLang cookbook-style measurements. Validate on your own mix of concurrency, context length, and tools. For tuning detail and cold-start trade-offs, see Tune & Operate.

TPOT (approx.)

Baseline ~20 ms

EAGLE ~7.7 ms

EAGLE verify path.

Decode throughput (aggregate)

Baseline ~1,750 tok/s

EAGLE 4,600+ tok/s

TTFT (warm, low concurrency)

~246 ms

EAGLE can inflate TTFT when concurrency per replica is very high; cap max running requests accordingly.

EAGLE accept length

~3.5 tokens

Typical draft acceptance.

Design choices

Engine, GPU, KV, speculative decode, and images: the decisions this guide encodes.

Engine

SGLang

Strong MoE throughput in many setups versus vLLM, native EAGLE and RadixAttention, and a Modal-friendly single HTTP process.

GPU

8×B200

192 GB per GPU headroom for BF16 KV and long context. TRT-LLM DSA and NSA paths target Blackwell SM100.

KV cache

BF16

Avoids FP8 KV plus EAGLE crashes (#22359), extra overhead (#17526), and some decode accuracy issues (#21291).

Spec decode

EAGLE v2

Large decode speedup with an MTP head shipped in the checkpoint: no separate draft model to deploy.

Proxy

None

SGLang speaks OpenAI-compatible JSON; @modal.web_server routes straight to port 8000.

Download image

debian-slim + HF Hub

CPU-only weight sync avoids pulling a multi-gigabyte CUDA image for I/O-only tasks.

Serve image

lmsysorg/sglang:latest

Bundles CUDA, SGLang, and DeepGEMM. Do not overwrite huggingface-hub in the image (import compatibility with sync jobs).

Continue the guide

Cold starts, upstream mitigations, diagnostics, and full flag tables live in later sections.

Deployment Pipeline

One-time setup tasks, volumes, and verification steps.

Open section

Configuration & Flags

SGLang flags, API secrets, and Modal decorators.

Open section

Tune & Operate

Performance tuning, cold starts, diagnostics, and warning triage.

Open section

Code Walkthrough

Annotated deploy.py: lifecycle, monitoring, and graceful shutdown.

Open section

External references