GLM-5.1 FP8 on Modal
Configuration & Flags
SGLang flags, API secrets, and Modal decorators.
Configuration & flags
Every flag below is wired through _build_sglang_cmd() and Modal decorators in deploy.py. Values mirror the production v4 script: Blackwell backends, EAGLE speculative decoding, BF16 KV cache (no FP8 KV dtype), and extended watchdog for multi-minute weight loads.
Rationale for tuning (EAGLE trade-offs, BF16 KV, cold start) lives in Tune & Operate.
Top-level Python constants
Tune these before deploy, especially keep-alive cron flags if you use the optional /health ping.
Constant
MODEL_REPO
Value
zai-org/GLM-5.1-FP8
Note
Hugging Face model id
Constant
MODEL_NAME
Value
glm-5.1
Note
Served model id in API
Constant
GPU_TYPE / GPU_COUNT
Value
B200 / 8
Note
Must match DeepGEMM compile hardware
Constant
ENABLE_KEEPALIVE_CRON
Value
False
Note
Set True + DEPLOYED_URL for business-hours ping
Constant
DEPLOYED_URL
Value
""
Note
Base URL for keep_warm cron /health
| Constant | Value | Note |
|---|---|---|
| MODEL_REPO | zai-org/GLM-5.1-FP8 | Hugging Face model id |
| MODEL_NAME | glm-5.1 | Served model id in API |
| GPU_TYPE / GPU_COUNT | B200 / 8 | Must match DeepGEMM compile hardware |
| ENABLE_KEEPALIVE_CRON | False | Set True + DEPLOYED_URL for business-hours ping |
| DEPLOYED_URL | "" | Base URL for keep_warm cron /health |
SGLang serving flags
Passed to python3 -m sglang.launch_server, grouped the same way they read in the launch command.
Model loading & parallelism
Flag
--model-path
Value
/model-cache/GLM-5.1-FP8
Rationale
Modal volume mount path where snapshot_download wrote weights.
Flag
--served-model-name
Value
glm-5.1
Rationale
OpenAI-compatible model id for /v1/models and chat requests.
Flag
--tp
Value
8
Rationale
Tensor parallel degree - shards the MoE across all B200s; EP stays 1.
Flag
--trust-remote-code
Value
enabled
Rationale
Required for newer GLM architecture code in the HF repo.
Flag
--ep
Value
1
Rationale
Expert parallelism disabled at EP>1 because TP fits full model in VRAM.
Flag
--watchdog-timeout
Value
1200
Rationale
20 minutes - default watchdog can kill during 5–7 min weight load from volume.
| Flag | Value | Rationale |
|---|---|---|
| --model-path | /model-cache/GLM-5.1-FP8 | Modal volume mount path where snapshot_download wrote weights. |
| --served-model-name | glm-5.1 | OpenAI-compatible model id for /v1/models and chat requests. |
| --tp | 8 | Tensor parallel degree - shards the MoE across all B200s; EP stays 1. |
| --trust-remote-code | enabled | Required for newer GLM architecture code in the HF repo. |
| --ep | 1 | Expert parallelism disabled at EP>1 because TP fits full model in VRAM. |
| --watchdog-timeout | 1200 | 20 minutes - default watchdog can kill during 5–7 min weight load from volume. |
Inference backends (Blackwell)
Flag
--attention-backend
Value
nsa
Rationale
NVIDIA sparse attention path tuned for long context on Blackwell.
Flag
--nsa-decode-backend / --nsa-prefill-backend
Value
trtllm
Rationale
Consistent TRT-LLM kernels across prefill/decode - avoids flashmla accuracy issues on B200.
Flag
--moe-runner-backend
Value
flashinfer_trtllm
Rationale
MoE expert routing through FlashInfer + TRT-LLM integration.
Flag
--enable-flashinfer-allreduce-fusion
Value
True
Rationale
Fuses tensor-parallel all-reduce with attention to cut comm overhead.
| Flag | Value | Rationale |
|---|---|---|
| --attention-backend | nsa | NVIDIA sparse attention path tuned for long context on Blackwell. |
| --nsa-decode-backend / --nsa-prefill-backend | trtllm | Consistent TRT-LLM kernels across prefill/decode - avoids flashmla accuracy issues on B200. |
| --moe-runner-backend | flashinfer_trtllm | MoE expert routing through FlashInfer + TRT-LLM integration. |
| --enable-flashinfer-allreduce-fusion | True | Fuses tensor-parallel all-reduce with attention to cut comm overhead. |
Speculative decoding (EAGLE v2)
Flag
--speculative-algorithm
Value
EAGLE
Rationale
Uses built-in MTP head for draft tokens.
Flag
--speculative-num-steps
Value
3
Rationale
Draft depth - balances acceptance vs verify cost.
Flag
--speculative-eagle-topk
Value
1
Rationale
Top-1 draft per step for high acceptance rate.
Flag
--speculative-num-draft-tokens
Value
4
Rationale
Draft window size k - main model verifies in one forward where possible.
| Flag | Value | Rationale |
|---|---|---|
| --speculative-algorithm | EAGLE | Uses built-in MTP head for draft tokens. |
| --speculative-num-steps | 3 | Draft depth - balances acceptance vs verify cost. |
| --speculative-eagle-topk | 1 | Top-1 draft per step for high acceptance rate. |
| --speculative-num-draft-tokens | 4 | Draft window size k - main model verifies in one forward where possible. |
Reasoning & tool parsers
Flag
--reasoning-parser
Value
glm45
Rationale
Splits chain-of-thought into reasoning_content in OpenAI-compatible payloads.
Flag
--tool-call-parser
Value
glm47
Rationale
Deterministic tool-call formatting for GLM-5.1 templates.
| Flag | Value | Rationale |
|---|---|---|
| --reasoning-parser | glm45 | Splits chain-of-thought into reasoning_content in OpenAI-compatible payloads. |
| --tool-call-parser | glm47 | Deterministic tool-call formatting for GLM-5.1 templates. |
KV cache & throughput
Flag
--mem-fraction-static
Value
0.88
Rationale
Aggressive KV reservation after weights - safe on 192 GB HBM3e per GPU.
Flag
--max-running-requests
Value
48
Rationale
Beyond this, EAGLE verifier queues grow and TTFT spikes under load.
Flag
--max-prefill-tokens
Value
32768
Rationale
Caps a single prefill batch to reduce prefill storms / OOM risk.
Flag
--max-total-tokens
Value
65536
Rationale
Per-request token ceiling so one client cannot monopolize KV.
Flag
--kv-cache-dtype fp8
Value
(omitted)
Rationale
Intentionally not set - default BF16 KV mitigates #22359, #17526, #21291 class issues.
| Flag | Value | Rationale |
|---|---|---|
| --mem-fraction-static | 0.88 | Aggressive KV reservation after weights - safe on 192 GB HBM3e per GPU. |
| --max-running-requests | 48 | Beyond this, EAGLE verifier queues grow and TTFT spikes under load. |
| --max-prefill-tokens | 32768 | Caps a single prefill batch to reduce prefill storms / OOM risk. |
| --max-total-tokens | 65536 | Per-request token ceiling so one client cannot monopolize KV. |
| --kv-cache-dtype fp8 | (omitted) | Intentionally not set - default BF16 KV mitigates #22359, #17526, #21291 class issues. |
Stability & observability
Flag
--enable-metrics
Value
True
Rationale
Surfaces Prometheus metrics for latency, KV usage, and engine health.
Flag
--api-key
Value
from API_KEY secret (optional in cmd)
Rationale
When set, SGLang requires Authorization on inference routes.
| Flag | Value | Rationale |
|---|---|---|
| --enable-metrics | True | Surfaces Prometheus metrics for latency, KV usage, and engine health. |
| --api-key | from API_KEY secret (optional in cmd) | When set, SGLang requires Authorization on inference routes. |
Modal infrastructure
Compute, timeouts, secrets, and image environment variables.
Compute & scaling (@app.cls)
Parameter
gpu
Value
B200:8
Rationale
Must match DeepGEMM compile job - SM100-specific binaries.
Parameter
min_containers
Value
0
Rationale
$0 at idle; first request pays cold start.
Parameter
max_containers
Value
3
Rationale
Hard cap on spend parallelism; 3×48 max running requests.
Parameter
scaledown_window
Value
900
Rationale
15 min keep-warm after last request for bursty traffic.
| Parameter | Value | Rationale |
|---|---|---|
| gpu | B200:8 | Must match DeepGEMM compile job - SM100-specific binaries. |
| min_containers | 0 | $0 at idle; first request pays cold start. |
| max_containers | 3 | Hard cap on spend parallelism; 3×48 max running requests. |
| scaledown_window | 900 | 15 min keep-warm after last request for bursty traffic. |
Timeouts
Scope
Server @app.cls timeout
Value
86400
Rationale
24h max container life - limits fragmentation / leaks.
Scope
@modal.web_server startup_timeout
Value
900
Rationale
15 min for health to pass after process start.
Scope
download_model timeout
Value
7200
Rationale
2h ceiling for 700 GB pull.
Scope
compile_deepgemm timeout
Value
3600
Rationale
1h ceiling for compile + streaming logs.
| Scope | Value | Rationale |
|---|---|---|
| Server @app.cls timeout | 86400 | 24h max container life - limits fragmentation / leaks. |
| @modal.web_server startup_timeout | 900 | 15 min for health to pass after process start. |
| download_model timeout | 7200 | 2h ceiling for 700 GB pull. |
| compile_deepgemm timeout | 3600 | 1h ceiling for compile + streaming logs. |
Secrets
Secret name
huggingface-secret
Keys
HF_TOKEN
Used by
download_model (CPU function)
Secret name
glm51-api-key
Keys
API_KEY
Used by
Server class - passed into SGLang --api-key when set
| Secret name | Keys | Used by |
|---|---|---|
| huggingface-secret | HF_TOKEN | download_model (CPU function) |
| glm51-api-key | API_KEY | Server class - passed into SGLang --api-key when set |
Image environment variables
Variable
HF_XET_HIGH_PERFORMANCE
Value
1
Where
Download + SGLang images
Why
Faster chunked transfers for huge HF artifacts.
Variable
SGLANG_ENABLE_SPEC_V2
Value
1
Where
SGLang image
Why
Enables latest speculative decoding pipeline (EAGLE v2).
Variable
SGLANG_DG_CACHE_DIR
Value
/dg-cache
Where
SGLang image
Why
DeepGEMM writes compiled kernels into the mounted volume.
Variable
HF_HUB_OFFLINE
Value
1
Where
Subprocess env in serve()
Why
Serving should read weights only from volume - no runtime hub fetch.
| Variable | Value | Where | Why |
|---|---|---|---|
| HF_XET_HIGH_PERFORMANCE | 1 | Download + SGLang images | Faster chunked transfers for huge HF artifacts. |
| SGLANG_ENABLE_SPEC_V2 | 1 | SGLang image | Enables latest speculative decoding pipeline (EAGLE v2). |
| SGLANG_DG_CACHE_DIR | /dg-cache | SGLang image | DeepGEMM writes compiled kernels into the mounted volume. |
| HF_HUB_OFFLINE | 1 | Subprocess env in serve() | Serving should read weights only from volume - no runtime hub fetch. |
Known issues & mitigations
Cross-check with upstream issues before changing KV dtype or NSA backends.
Issue
Symptom
EAGLE + FP8 KV Unsupported h_q / crash on Blackwell
Our fix
Do not pass fp8 KV dtype - keep BF16 KV default.
Issue
Symptom
flashmla_kv decode accuracy regression on B200
Our fix
Force TRT-LLM backends for NSA decode and prefill.
Issue
DeepGEMM JIT
Symptom
10–15 min blocked compile on cold start if not pre-cached
Our fix
Run compile_deepgemm once on B200; persist to glm51-deepgemm-cache.
Issue
Watchdog default
Symptom
Process killed mid weight load
Our fix
--watchdog-timeout 1200 in launch command.
| Issue | Symptom | Our fix |
|---|---|---|
| #22359 | EAGLE + FP8 KV Unsupported h_q / crash on Blackwell | Do not pass fp8 KV dtype - keep BF16 KV default. |
| #21291 | flashmla_kv decode accuracy regression on B200 | Force TRT-LLM backends for NSA decode and prefill. |
| #17526 | FP8 KV slower than BF16 due to quant overhead | BF16 KV default. |
| DeepGEMM JIT | 10–15 min blocked compile on cold start if not pre-cached | Run compile_deepgemm once on B200; persist to glm51-deepgemm-cache. |
| Watchdog default | Process killed mid weight load | --watchdog-timeout 1200 in launch command. |
Next steps
- Tune & Operate covers EAGLE trade-offs, BF16 KV rationale, warmup & keep-alive, and diagnostics.
- Code Walkthrough shows how flags are assembled and how subprocess lifecycle is managed.