GuideProduction

GLM-5.1 FP8 on Modal

Configuration & Flags

SGLang flags, API secrets, and Modal decorators.

Every flag below is wired through _build_sglang_cmd() and Modal decorators in deploy.py. Values mirror the production v4 script: Blackwell backends, EAGLE speculative decoding, BF16 KV cache (no FP8 KV dtype), and extended watchdog for multi-minute weight loads.

Rationale for tuning (EAGLE trade-offs, BF16 KV, cold start) lives in Tune & Operate.

Top-level Python constants

Tune these before deploy, especially keep-alive cron flags if you use the optional /health ping.

Constant

MODEL_REPO

Value

zai-org/GLM-5.1-FP8

Note

Hugging Face model id

Constant

MODEL_NAME

Value

glm-5.1

Note

Served model id in API

Constant

GPU_TYPE / GPU_COUNT

Value

B200 / 8

Note

Must match DeepGEMM compile hardware

Constant

ENABLE_KEEPALIVE_CRON

Value

False

Note

Set True + DEPLOYED_URL for business-hours ping

Constant

DEPLOYED_URL

Value

Note

Base URL for keep_warm cron /health

Constant	Value	Note
MODEL_REPO	zai-org/GLM-5.1-FP8	Hugging Face model id
MODEL_NAME	glm-5.1	Served model id in API
GPU_TYPE / GPU_COUNT	B200 / 8	Must match DeepGEMM compile hardware
ENABLE_KEEPALIVE_CRON	False	Set True + DEPLOYED_URL for business-hours ping
DEPLOYED_URL	""	Base URL for keep_warm cron /health

SGLang serving flags

Passed to python3 -m sglang.launch_server, grouped the same way they read in the launch command.

Model loading & parallelism

Flag

--model-path

Value

/model-cache/GLM-5.1-FP8

Rationale

Modal volume mount path where snapshot_download wrote weights.

Flag

--served-model-name

Value

glm-5.1

Rationale

OpenAI-compatible model id for /v1/models and chat requests.

Flag

--tp

Value

Rationale

Tensor parallel degree - shards the MoE across all B200s; EP stays 1.

Flag

--trust-remote-code

Value

enabled

Rationale

Required for newer GLM architecture code in the HF repo.

Flag

--ep

Value

Rationale

Expert parallelism disabled at EP>1 because TP fits full model in VRAM.

Flag

--watchdog-timeout

Value

1200

Rationale

20 minutes - default watchdog can kill during 5–7 min weight load from volume.

Flag	Value	Rationale
--model-path	/model-cache/GLM-5.1-FP8	Modal volume mount path where snapshot_download wrote weights.
--served-model-name	glm-5.1	OpenAI-compatible model id for /v1/models and chat requests.
--tp	8	Tensor parallel degree - shards the MoE across all B200s; EP stays 1.
--trust-remote-code	enabled	Required for newer GLM architecture code in the HF repo.
--ep	1	Expert parallelism disabled at EP>1 because TP fits full model in VRAM.
--watchdog-timeout	1200	20 minutes - default watchdog can kill during 5–7 min weight load from volume.

Inference backends (Blackwell)

Flag

--attention-backend

Value

nsa

Rationale

NVIDIA sparse attention path tuned for long context on Blackwell.

Flag

--nsa-decode-backend / --nsa-prefill-backend

Value

trtllm

Rationale

Consistent TRT-LLM kernels across prefill/decode - avoids flashmla accuracy issues on B200.

Flag

--moe-runner-backend

Value

flashinfer_trtllm

Rationale

MoE expert routing through FlashInfer + TRT-LLM integration.

Flag

--enable-flashinfer-allreduce-fusion

Value

True

Rationale

Fuses tensor-parallel all-reduce with attention to cut comm overhead.

Flag	Value	Rationale
--attention-backend	nsa	NVIDIA sparse attention path tuned for long context on Blackwell.
--nsa-decode-backend / --nsa-prefill-backend	trtllm	Consistent TRT-LLM kernels across prefill/decode - avoids flashmla accuracy issues on B200.
--moe-runner-backend	flashinfer_trtllm	MoE expert routing through FlashInfer + TRT-LLM integration.
--enable-flashinfer-allreduce-fusion	True	Fuses tensor-parallel all-reduce with attention to cut comm overhead.

Speculative decoding (EAGLE v2)

Flag

--speculative-algorithm

Value

EAGLE

Rationale

Uses built-in MTP head for draft tokens.

Flag

--speculative-num-steps

Value

Rationale

Draft depth - balances acceptance vs verify cost.

Flag

--speculative-eagle-topk

Value

Rationale

Top-1 draft per step for high acceptance rate.

Flag

--speculative-num-draft-tokens

Value

Rationale

Draft window size k - main model verifies in one forward where possible.

Flag	Value	Rationale
--speculative-algorithm	EAGLE	Uses built-in MTP head for draft tokens.
--speculative-num-steps	3	Draft depth - balances acceptance vs verify cost.
--speculative-eagle-topk	1	Top-1 draft per step for high acceptance rate.
--speculative-num-draft-tokens	4	Draft window size k - main model verifies in one forward where possible.

Reasoning & tool parsers

Flag

--reasoning-parser

Value

glm45

Rationale

Splits chain-of-thought into reasoning_content in OpenAI-compatible payloads.

Flag

--tool-call-parser

Value

glm47

Rationale

Deterministic tool-call formatting for GLM-5.1 templates.

Flag	Value	Rationale
--reasoning-parser	glm45	Splits chain-of-thought into reasoning_content in OpenAI-compatible payloads.
--tool-call-parser	glm47	Deterministic tool-call formatting for GLM-5.1 templates.

KV cache & throughput

Flag

--mem-fraction-static

Value

0.88

Rationale

Aggressive KV reservation after weights - safe on 192 GB HBM3e per GPU.

Flag

--max-running-requests

Value

Rationale

Beyond this, EAGLE verifier queues grow and TTFT spikes under load.

Flag

--max-prefill-tokens

Value

32768

Rationale

Caps a single prefill batch to reduce prefill storms / OOM risk.

Flag

--max-total-tokens

Value

65536

Rationale

Per-request token ceiling so one client cannot monopolize KV.

Flag

--kv-cache-dtype fp8

Value

(omitted)

Rationale

Intentionally not set - default BF16 KV mitigates #22359, #17526, #21291 class issues.

Flag	Value	Rationale
--mem-fraction-static	0.88	Aggressive KV reservation after weights - safe on 192 GB HBM3e per GPU.
--max-running-requests	48	Beyond this, EAGLE verifier queues grow and TTFT spikes under load.
--max-prefill-tokens	32768	Caps a single prefill batch to reduce prefill storms / OOM risk.
--max-total-tokens	65536	Per-request token ceiling so one client cannot monopolize KV.
--kv-cache-dtype fp8	(omitted)	Intentionally not set - default BF16 KV mitigates #22359, #17526, #21291 class issues.

Stability & observability

Flag

--enable-metrics

Value

True

Rationale

Surfaces Prometheus metrics for latency, KV usage, and engine health.

Flag

--api-key

Value

from API_KEY secret (optional in cmd)

Rationale

When set, SGLang requires Authorization on inference routes.

Flag	Value	Rationale
--enable-metrics	True	Surfaces Prometheus metrics for latency, KV usage, and engine health.
--api-key	from API_KEY secret (optional in cmd)	When set, SGLang requires Authorization on inference routes.

Modal infrastructure

Compute, timeouts, secrets, and image environment variables.

Compute & scaling (@app.cls)

Parameter

gpu

Value

B200:8

Rationale

Must match DeepGEMM compile job - SM100-specific binaries.

Parameter

min_containers

Value

Rationale

$0 at idle; first request pays cold start.

Parameter

max_containers

Value

Rationale

Hard cap on spend parallelism; 3×48 max running requests.

Parameter

scaledown_window

Value

900

Rationale

15 min keep-warm after last request for bursty traffic.

Parameter	Value	Rationale
gpu	B200:8	Must match DeepGEMM compile job - SM100-specific binaries.
min_containers	0	$0 at idle; first request pays cold start.
max_containers	3	Hard cap on spend parallelism; 3×48 max running requests.
scaledown_window	900	15 min keep-warm after last request for bursty traffic.

Timeouts

Scope

Server @app.cls timeout

Value

86400

Rationale

24h max container life - limits fragmentation / leaks.

Scope

@modal.web_server startup_timeout

Value

900

Rationale

15 min for health to pass after process start.

Scope

download_model timeout

Value

7200

Rationale

2h ceiling for 700 GB pull.

Scope

compile_deepgemm timeout

Value

3600

Rationale

1h ceiling for compile + streaming logs.

Scope	Value	Rationale
Server @app.cls timeout	86400	24h max container life - limits fragmentation / leaks.
@modal.web_server startup_timeout	900	15 min for health to pass after process start.
download_model timeout	7200	2h ceiling for 700 GB pull.
compile_deepgemm timeout	3600	1h ceiling for compile + streaming logs.

Secrets

Secret name

huggingface-secret

Keys

HF_TOKEN

Used by

download_model (CPU function)

Secret name

glm51-api-key

Keys

API_KEY

Used by

Server class - passed into SGLang --api-key when set

Secret name	Keys	Used by
huggingface-secret	HF_TOKEN	download_model (CPU function)
glm51-api-key	API_KEY	Server class - passed into SGLang --api-key when set

Image environment variables

Variable

HF_XET_HIGH_PERFORMANCE

Value

Where

Download + SGLang images

Why

Faster chunked transfers for huge HF artifacts.

Variable

SGLANG_ENABLE_SPEC_V2

Value

Where

SGLang image

Why

Enables latest speculative decoding pipeline (EAGLE v2).

Variable

SGLANG_DG_CACHE_DIR

Value

/dg-cache

Where

SGLang image

Why

DeepGEMM writes compiled kernels into the mounted volume.

Variable

HF_HUB_OFFLINE

Value

Where

Subprocess env in serve()

Why

Serving should read weights only from volume - no runtime hub fetch.

Variable	Value	Where	Why
HF_XET_HIGH_PERFORMANCE	1	Download + SGLang images	Faster chunked transfers for huge HF artifacts.
SGLANG_ENABLE_SPEC_V2	1	SGLang image	Enables latest speculative decoding pipeline (EAGLE v2).
SGLANG_DG_CACHE_DIR	/dg-cache	SGLang image	DeepGEMM writes compiled kernels into the mounted volume.
HF_HUB_OFFLINE	1	Subprocess env in serve()	Serving should read weights only from volume - no runtime hub fetch.

Known issues & mitigations

Cross-check with upstream issues before changing KV dtype or NSA backends.

Issue

#22359

Symptom

EAGLE + FP8 KV Unsupported h_q / crash on Blackwell

Our fix

Do not pass fp8 KV dtype - keep BF16 KV default.

Issue

#21291

Symptom

flashmla_kv decode accuracy regression on B200

Our fix

Force TRT-LLM backends for NSA decode and prefill.

Issue

#17526

Symptom

FP8 KV slower than BF16 due to quant overhead

Our fix

BF16 KV default.

Issue

DeepGEMM JIT

Symptom

10–15 min blocked compile on cold start if not pre-cached

Our fix

Run compile_deepgemm once on B200; persist to glm51-deepgemm-cache.

Issue

Watchdog default

Symptom

Process killed mid weight load

Our fix

--watchdog-timeout 1200 in launch command.

Issue	Symptom	Our fix
#22359	EAGLE + FP8 KV Unsupported h_q / crash on Blackwell	Do not pass fp8 KV dtype - keep BF16 KV default.
#21291	flashmla_kv decode accuracy regression on B200	Force TRT-LLM backends for NSA decode and prefill.
#17526	FP8 KV slower than BF16 due to quant overhead	BF16 KV default.
DeepGEMM JIT	10–15 min blocked compile on cold start if not pre-cached	Run compile_deepgemm once on B200; persist to glm51-deepgemm-cache.
Watchdog default	Process killed mid weight load	--watchdog-timeout 1200 in launch command.

Next steps

Tune & Operate covers EAGLE trade-offs, BF16 KV rationale, warmup & keep-alive, and diagnostics.
Code Walkthrough shows how flags are assembled and how subprocess lifecycle is managed.

Overview & Architecture Deployment Code walkthrough Tune & Operate

Configuration & flags

Top-level Python constants

SGLang serving flags

Model loading & parallelism

Inference backends (Blackwell)

Speculative decoding (EAGLE v2)

Reasoning & tool parsers

KV cache & throughput

Stability & observability

Modal infrastructure

Compute & scaling (@app.cls)

Timeouts

Secrets

Image environment variables

Known issues & mitigations

Next steps

References