GuideProduction

GLM-5.1 FP8 on Modal

Code Walkthrough

Annotated deploy.py: lifecycle, monitoring, and graceful shutdown.

Code walkthrough

Annotated excerpts from deploy.py: volumes, launch command, warmup, and the Modal Server lifecycle. Pair with Configuration & Flags for the full matrix and Tune & Operate for operational context.

01

Three lifecycle phases

Cheap CPU work, one-time GPU compilation, and long-running GPU serving; cold starts avoid redundant network and JIT costs.

PhaseUnitEnvironmentPurpose
download_model()CPU debian-slimmodal.Volume /model-cacheIdempotent HF snapshot_download (~700 GB)
compile_deepgemm()8×B200 + SGLang imageBoth volumes mountedsglang.compile_deep_gemm JIT → cache in /dg-cache
Server8×B200 + SGLang imageWeb server port 8000Subprocess launch, health, warmup, crash monitor, graceful shutdown
02

Volumes & mounts

Fixed mount points so MODEL_PATH and DeepGEMM cache align with SGLang. Always reload() both volumes in GPU entrypoints.

deploy.py
1model_volume = modal.Volume.from_name("glm51-model-weights", create_if_missing=True)💡
2dg_volume = modal.Volume.from_name("glm51-deepgemm-cache", create_if_missing=True)
3
4VOLUME_MOUNTS = {💡
5 "/model-cache": model_volume,
6 "/dg-cache": dg_volume,
7}
03

Building the SGLang command

Centralizes bug mitigations (#21291 TRT-LLM backends, #22359 BF16 KV by omission) and performance defaults.

deploy.py
1def _build_sglang_cmd(api_key: str = "") -> list[str]:
2 cmd = [
3 "python3", "-m", "sglang.launch_server",
4 "--model-path", MODEL_PATH,
5 "--served-model-name", MODEL_NAME,
6 "--tp", str(GPU_COUNT),
7 "--host", "0.0.0.0",
8 "--port", str(PORT),
9 "--trust-remote-code",
10 "--ep", "1",
11 "--attention-backend", "nsa",
12 "--nsa-decode-backend", "trtllm",
13 "--nsa-prefill-backend", "trtllm",
14 "--moe-runner-backend", "flashinfer_trtllm",💡
15 "--enable-flashinfer-allreduce-fusion",
16 "--speculative-algorithm", "EAGLE",
17 "--speculative-num-steps", "3",
18 "--speculative-eagle-topk", "1",
19 "--speculative-num-draft-tokens", "4",
20 "--reasoning-parser", "glm45",
21 "--tool-call-parser", "glm47",
22 "--mem-fraction-static", "0.88",
23 "--max-running-requests", "48",
24 "--max-prefill-tokens", "32768",
25 "--max-total-tokens", "65536",
26 "--watchdog-timeout", "1200",💡
27 "--enable-metrics",
28 ]
29 if api_key:
30 cmd.extend(["--api-key", api_key])
31 return cmd
Argument clusterPurpose
--model-path /model-cache/GLM-5.1-FP8Volume-mounted weights
--served-model-name glm-5.1OpenAI model field
--tp 8Tensor parallel = GPU count
--trust-remote-codeLoad custom model code from repo
--attention-backend nsaBlackwell NSA path
--nsa-decode-backend / --nsa-prefill-backend trtllmStable decode on B200 (#21291)
--moe-runner-backend flashinfer_trtllmMoE routing on TRT-LLM path
--enable-flashinfer-allreduce-fusionFuse TP all-reduce into attention kernel
--speculative-algorithm EAGLE (+ steps/topk/draft tokens)Speculative decoding
--reasoning-parser glm45 / --tool-call-parser glm47GLM chat templates
No --kv-cache-dtype fp8Default BF16 KV - mitigates #22359, #17526
04

Warmup & CUDA graphs

Hits chat, no-thinking, long-context, and tool paths so the first customer request is not paying graph capture latency alone.

deploy.py
1def _warmup(port: int, api_key: str = ""):
2 headers = {"Content-Type": "application/json"}
3 if api_key:
4 headers["Authorization"] = f"Bearer {api_key}"
5 url = f"http://127.0.0.1:{port}/v1/chat/completions"
6
7 payloads = [💡
8 {"model": MODEL_NAME, "messages": [{"role": "user", "content": "Say hello."}], "max_tokens": 32},
9 {"model": MODEL_NAME, "messages": [{"role": "user", "content": "Explain quicksort in one sentence."}],
10 "max_tokens": 64, "chat_template_kwargs": {"enable_thinking": False}},💡
11 # ... long-context and tool-calling payloads ...
12 ]
13 for i, payload in enumerate(payloads):
14 r = requests.post(url, headers=headers, json=payload, timeout=300)
15 # validates 200 + non-empty assistant message / tool_calls / reasoning_content
05

download_model()

CPU-only Modal function keeps $/hr tiny while pulling ~700 GB. Sentinel guard makes the job safe to re-run.

deploy.py
1@app.function(
2 image=_download_image,💡
3 volumes={"/model-cache": model_volume},
4 secrets=[modal.Secret.from_name("huggingface-secret")],
5 timeout=7200,
6 cpu=4,
7 memory=32768,
8)
9def download_model():
10 sentinel = os.path.join(MODEL_PATH, WEIGHT_SENTINEL)💡
11 if os.path.exists(sentinel):
12 print("[download] Already cached. Skipping.")
13 return
14 from huggingface_hub import snapshot_download
15 snapshot_download(MODEL_REPO, local_dir=MODEL_PATH, max_workers=8)
16 model_volume.commit()
06

compile_deepgemm()

Runs python3 -m sglang.compile_deep_gemm with TP identical to serving. Streams stdout line-by-line for Modal logs.

deploy.py
1process = subprocess.Popen(
2 ["python3", "-m", "sglang.compile_deep_gemm",
3 "--model", MODEL_PATH, "--tp", str(GPU_COUNT)],
4 stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
5 text=True, bufsize=1,💡
6)
7for line in iter(process.stdout.readline, ""):
8 sys.stdout.write(line)
07

Server class: setup, serve, shutdown

@modal.web_server binds traffic to port 8000. Log stream, health wait, warmup, then crash monitor only after startup succeeds.

deploy.py
1@app.cls(
2 gpu=f"{GPU_TYPE}:{GPU_COUNT}",
3 volumes=VOLUME_MOUNTS,
4 secrets=[modal.Secret.from_name("glm51-api-key")],
5 timeout=86400,
6 min_containers=0,
7 max_containers=3,
8 scaledown_window=900,
9)
10@modal.concurrent(max_inputs=20)
11class Server:
12 process: subprocess.Popen | None = None
13 _startup_complete: bool = False
14
15 @modal.enter()
16 def setup(self):
17 model_volume.reload()
18 dg_volume.reload()
19 # sentinel + optional DeepGEMM marker checks ...
20
21 @modal.web_server(port=PORT, startup_timeout=900)
22 def serve(self):
23 api_key = os.environ.get("API_KEY", "")
24 self.process = subprocess.Popen(
25 _build_sglang_cmd(api_key),
26 stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
27 text=True, bufsize=1,
28 env={**os.environ, "HF_HUB_OFFLINE": "1"},💡
29 )
30 threading.Thread(target=_stream_subprocess_logs, args=(self.process.stdout,), daemon=True).start()
31 if not _wait_for_health(PORT, timeout=900):💡
32 raise TimeoutError("SGLang startup timeout")
33 _warmup(PORT, api_key)
34 self._startup_complete = True
35 threading.Thread(target=self._monitor_process, daemon=True).start()
36
37 @modal.exit()
38 def shutdown(self):
39 if self.process and self.process.poll() is None:
40 self.process.terminate()
41 try:
42 self.process.wait(timeout=30)
43 except subprocess.TimeoutExpired:
44 self.process.kill()
08

Crash monitor

If SGLang dies after startup, os._exit(1) forces Modal to recycle the container instead of serving 502s.

deploy.py
1def _monitor_process(self):
2 while True:
3 time.sleep(10)
4 if not self._startup_complete:
5 continue
6 if self.process is not None and self.process.poll() is not None:
7 print(f"[monitor] FATAL: SGLang died (code {self.process.returncode})")
8 time.sleep(1)
9 os._exit(1) # force Modal to recycle the container💡
09

Local entrypoint

modal run deploy.py without a target chains download → compile → verify, which is handy when the laptop should not pull hundreds of GB locally.

deploy.py
1@app.local_entrypoint()
2def main():
3 download_model.remote()
4 compile_deepgemm.remote()
5 ok = verify_setup.remote()