GuideProduction

GLM-5.1 FP8 on Modal

Overview & Architecture

Model specs, hardware choices, and design decisions.

Production GLM-5.1-FP8 on Modal with SGLang, 8× NVIDIA B200, EAGLE speculative decoding, and BF16 KV. Traffic lands on SGLang's native OpenAI-compatible HTTP surface: no extra proxy.

01

At a glance

Specs, API surface, and how replicas scale on Modal.

Model

GLM-5.1-FP8

MoE: 754B total, ~40B active per token. 256 routed experts, 8 active. MIT license.

Context

200K in

Long context with MLA-compressed KV and DSA-style sparse attention paths in the base model. This guide caps prefill and totals lower for safety.

Engine

SGLang 0.5.10+

RadixAttention, OpenAI-compatible HTTP, EAGLE v2. Use a build with correct FP8 dequant kernels (naive load can be wrong).

Hardware

8× B200

192 GB HBM3e each, 1.5 TB VRAM total. Tensor parallel TP=8. Blackwell SM100 (kernels are architecture-locked).

Precision

FP8 E4M3 + BF16 KV

Weights FP8 for bandwidth and tensor-core throughput. KV stays BF16 for stability with EAGLE and known FP8-KV issues upstream.

API

/v1/chat/completions

Streaming, tools, reasoning_content. HTTPS to Modal, no extra proxy.

Scaling

min=0 · max=3 · down=900s

Scale-to-zero when idle, bounded replicas, 15 min scaledown window.

Why this stack exists: model lineage, precision, and what we optimize for in production.

02

Model and precision context

GLM-5.1 on Blackwell: MoE, MLA/DSA, and why FP8 weights with BF16 KV is the default recipe here.

Why this model on this stack

GLM-5.1 is a large MoE autoregressive model from Z.ai (roots in Tsinghua KEG research). The FP8 release targets Hopper and Blackwell tensor cores: E4M3 weights cut memory traffic versus BF16 and unlock higher FP8 math throughput, with typical benchmark deltas under about 1% when kernels and scales are correct.

Multi-latent attention (MLA) shrinks KV footprint; DeepSeek-style sparse attention (DSA) trims attention cost on long sequences. Together they make a 200K context window practical on finite VRAM, even though serving still needs careful flags, BF16 KV for this recipe, and precompiled GEMM artifacts for fast cold starts.

GLM-5.1 is also aimed at long-horizon agent workflows (planning, tools, iteration). This guide focuses on production inference on Modal, not benchmark leaderboards.

FP8 E4M3

8-bit float: 1 sign, 4 exponent, 3 mantissa bits. Roughly 2× memory savings vs BF16 weights on Blackwell, native tensor cores, no emulation.

Serving requirement

Load the checkpoint only through frameworks that implement the right FP8 dequant path (here: SGLang 0.5.10+). vLLM 0.19.0+ is another option in the ecosystem; wrong loaders silently misbehave.

03

Infrastructure topology

Modal volumes, GPU class, and SGLang as the single HTTP front, from weights to clients.

Two Modal volumes isolate slow I/O from the GPU path: weights under /model-cache, DeepGEMM cache under /dg-cache. The container reloads both on boot for consistent volume metadata.

Modal

  • 8× NVIDIA B200 (Blackwell)
  • 1.5 TB VRAM total
  • Scale-to-zero when idle
glm51-model-weights

~700 GB FP8 safetensors at /model-cache (HF, Xet).

glm51-deepgemm-cache

Precompiled SM100 FP8 GEMM at /dg-cache (B200-specific).

SGLang TP=8

  • EAGLE v2 speculative decode (MTP head in checkpoint)
  • TRT-LLM NSA backends for Blackwell prefill and decode
  • BF16 KV (avoids EAGLE + FP8 KV failure modes)
  • glm45 reasoning parser, glm47 tool parser

Crash monitor thread to os._exit(1) if child dies

Log stream thread to Modal dashboard

@modal.web_server to port 8000 (no sidecar)

Clients

OpenAI SDK, curl, agents

HTTPS to /v1/chat/completions

04

Serving and optimization stack

SGLang, EAGLE, DeepGEMM, and BF16 KV: how the pieces fit together.

SGLang

Primary server: RadixAttention shares prefix KV across requests (strong for agents and fixed system prompts). OpenAI-compatible HTTP so @modal.web_server can attach directly, no proxy hop.

EAGLE v2

Lightweight multi-token prediction head drafts several tokens; the main model verifies in parallel. Large decode speedup without shipping a second draft model. Watch TTFT under very high concurrency on a single replica.

DeepGEMM

FP8 GEMM kernels JIT for SM100 shapes. We bake precompiled artifacts into a Modal volume so cold start is not blocked by 10 to 15 minute JIT. Binaries are SM-specific: B200 builds are not portable to H100 or T4.

BF16 KV

We keep KV in BF16 even with FP8 weights: upstream issues around FP8 KV plus EAGLE, accuracy on some decode paths, and extra quant overhead. BF16 is the stable default for this stack and long context.

05

Scale-to-zero profile

Autoscaler knobs you will see in deploy.py; confirm values for your workload.

Modal autoscaler knobs for this profile. Values are illustrative of the guide deployment; confirm in your own deploy.py. Cost and cold-start implications are covered in Tune & Operate.

min_containers
0
Scale to zero when idle. Lowest baseline cost, longest cold path.
max_containers
3
Caps replicas. With 48 concurrent slots per replica, up to 144 concurrent requests across three replicas.
scaledown_window
900
Seconds idle before scale-in (15 min). Smooths bursty traffic without immediate teardown.
max_inputs
20
Per-container queue depth at the Modal layer to avoid sudden spikes starving memory.
timeout
86400
24h container lifetime to mitigate slow leaks or fragmentation on very long runs.
06

Performance snapshot

Order-of-magnitude decode and latency numbers; validate on your traffic mix.

Order-of-magnitude numbers from internal runs and SGLang cookbook-style measurements. Validate on your own mix of concurrency, context length, and tools. For tuning detail and cold-start trade-offs, see Tune & Operate.

TPOT (approx.)

Baseline ~20 ms
EAGLE ~7.7 ms

EAGLE verify path.

Decode throughput (aggregate)

Baseline ~1,750 tok/s
EAGLE 4,600+ tok/s

TTFT (warm, low concurrency)

~246 ms

EAGLE can inflate TTFT when concurrency per replica is very high; cap max running requests accordingly.

EAGLE accept length

~3.5 tokens

Typical draft acceptance.

07

Design choices

Engine, GPU, KV, speculative decode, and images: the decisions this guide encodes.

Engine

SGLang

Strong MoE throughput in many setups versus vLLM, native EAGLE and RadixAttention, and a Modal-friendly single HTTP process.

GPU

8×B200

192 GB per GPU headroom for BF16 KV and long context. TRT-LLM DSA and NSA paths target Blackwell SM100.

KV cache

BF16

Avoids FP8 KV plus EAGLE crashes (#22359), extra overhead (#17526), and some decode accuracy issues (#21291).

Spec decode

EAGLE v2

Large decode speedup with an MTP head shipped in the checkpoint: no separate draft model to deploy.

Proxy

None

SGLang speaks OpenAI-compatible JSON; @modal.web_server routes straight to port 8000.

Download image

debian-slim + HF Hub

CPU-only weight sync avoids pulling a multi-gigabyte CUDA image for I/O-only tasks.

Serve image

lmsysorg/sglang:latest

Bundles CUDA, SGLang, and DeepGEMM. Do not overwrite huggingface-hub in the image (import compatibility with sync jobs).