SGLang
Primary server: RadixAttention shares prefix KV across requests (strong for agents and fixed system prompts). OpenAI-compatible HTTP so @modal.web_server can attach directly, no proxy hop.
Overview & Architecture
Model specs, hardware choices, and design decisions.
Production GLM-5.1-FP8 on Modal with SGLang, 8× NVIDIA B200, EAGLE speculative decoding, and BF16 KV. Traffic lands on SGLang's native OpenAI-compatible HTTP surface: no extra proxy.
Specs, API surface, and how replicas scale on Modal.
Model
GLM-5.1-FP8
MoE: 754B total, ~40B active per token. 256 routed experts, 8 active. MIT license.
Context
200K in
Long context with MLA-compressed KV and DSA-style sparse attention paths in the base model. This guide caps prefill and totals lower for safety.
Engine
SGLang 0.5.10+
RadixAttention, OpenAI-compatible HTTP, EAGLE v2. Use a build with correct FP8 dequant kernels (naive load can be wrong).
Hardware
8× B200
192 GB HBM3e each, 1.5 TB VRAM total. Tensor parallel TP=8. Blackwell SM100 (kernels are architecture-locked).
Precision
FP8 E4M3 + BF16 KV
Weights FP8 for bandwidth and tensor-core throughput. KV stays BF16 for stability with EAGLE and known FP8-KV issues upstream.
API
/v1/chat/completions
Streaming, tools, reasoning_content. HTTPS to Modal, no extra proxy.
Scaling
min=0 · max=3 · down=900s
Scale-to-zero when idle, bounded replicas, 15 min scaledown window.
Why this stack exists: model lineage, precision, and what we optimize for in production.
GLM-5.1 on Blackwell: MoE, MLA/DSA, and why FP8 weights with BF16 KV is the default recipe here.
GLM-5.1 is a large MoE autoregressive model from Z.ai (roots in Tsinghua KEG research). The FP8 release targets Hopper and Blackwell tensor cores: E4M3 weights cut memory traffic versus BF16 and unlock higher FP8 math throughput, with typical benchmark deltas under about 1% when kernels and scales are correct.
Multi-latent attention (MLA) shrinks KV footprint; DeepSeek-style sparse attention (DSA) trims attention cost on long sequences. Together they make a 200K context window practical on finite VRAM, even though serving still needs careful flags, BF16 KV for this recipe, and precompiled GEMM artifacts for fast cold starts.
GLM-5.1 is also aimed at long-horizon agent workflows (planning, tools, iteration). This guide focuses on production inference on Modal, not benchmark leaderboards.
8-bit float: 1 sign, 4 exponent, 3 mantissa bits. Roughly 2× memory savings vs BF16 weights on Blackwell, native tensor cores, no emulation.
Load the checkpoint only through frameworks that implement the right FP8 dequant path (here: SGLang 0.5.10+). vLLM 0.19.0+ is another option in the ecosystem; wrong loaders silently misbehave.
Modal volumes, GPU class, and SGLang as the single HTTP front, from weights to clients.
Two Modal volumes isolate slow I/O from the GPU path: weights under /model-cache, DeepGEMM cache under /dg-cache. The container reloads both on boot for consistent volume metadata.
Modal
~700 GB FP8 safetensors at /model-cache (HF, Xet).
Precompiled SM100 FP8 GEMM at /dg-cache (B200-specific).
SGLang TP=8
Crash monitor thread to os._exit(1) if child dies
Log stream thread to Modal dashboard
@modal.web_server to port 8000 (no sidecar)
Clients
OpenAI SDK, curl, agents
HTTPS to /v1/chat/completions
SGLang, EAGLE, DeepGEMM, and BF16 KV: how the pieces fit together.
Primary server: RadixAttention shares prefix KV across requests (strong for agents and fixed system prompts). OpenAI-compatible HTTP so @modal.web_server can attach directly, no proxy hop.
Lightweight multi-token prediction head drafts several tokens; the main model verifies in parallel. Large decode speedup without shipping a second draft model. Watch TTFT under very high concurrency on a single replica.
FP8 GEMM kernels JIT for SM100 shapes. We bake precompiled artifacts into a Modal volume so cold start is not blocked by 10 to 15 minute JIT. Binaries are SM-specific: B200 builds are not portable to H100 or T4.
We keep KV in BF16 even with FP8 weights: upstream issues around FP8 KV plus EAGLE, accuracy on some decode paths, and extra quant overhead. BF16 is the stable default for this stack and long context.
Autoscaler knobs you will see in deploy.py; confirm values for your workload.
Modal autoscaler knobs for this profile. Values are illustrative of the guide deployment; confirm in your own deploy.py. Cost and cold-start implications are covered in Tune & Operate.
Order-of-magnitude decode and latency numbers; validate on your traffic mix.
Order-of-magnitude numbers from internal runs and SGLang cookbook-style measurements. Validate on your own mix of concurrency, context length, and tools. For tuning detail and cold-start trade-offs, see Tune & Operate.
TPOT (approx.)
EAGLE verify path.
Decode throughput (aggregate)
TTFT (warm, low concurrency)
EAGLE can inflate TTFT when concurrency per replica is very high; cap max running requests accordingly.
EAGLE accept length
Typical draft acceptance.
Engine, GPU, KV, speculative decode, and images: the decisions this guide encodes.
Engine
SGLang
Strong MoE throughput in many setups versus vLLM, native EAGLE and RadixAttention, and a Modal-friendly single HTTP process.
GPU
8×B200
192 GB per GPU headroom for BF16 KV and long context. TRT-LLM DSA and NSA paths target Blackwell SM100.
KV cache
BF16
Avoids FP8 KV plus EAGLE crashes (#22359), extra overhead (#17526), and some decode accuracy issues (#21291).
Spec decode
EAGLE v2
Large decode speedup with an MTP head shipped in the checkpoint: no separate draft model to deploy.
Proxy
None
SGLang speaks OpenAI-compatible JSON; @modal.web_server routes straight to port 8000.
Download image
debian-slim + HF Hub
CPU-only weight sync avoids pulling a multi-gigabyte CUDA image for I/O-only tasks.
Serve image
lmsysorg/sglang:latest
Bundles CUDA, SGLang, and DeepGEMM. Do not overwrite huggingface-hub in the image (import compatibility with sync jobs).