GLM-5.1 FP8 on Modal
Code Walkthrough
Annotated deploy.py: lifecycle, monitoring, and graceful shutdown.
Code walkthrough
Annotated excerpts from deploy.py: volumes, launch command, warmup, and the Modal Server lifecycle. Pair with Configuration & Flags for the full matrix and Tune & Operate for operational context.
Three lifecycle phases
Cheap CPU work, one-time GPU compilation, and long-running GPU serving; cold starts avoid redundant network and JIT costs.
| Phase | Unit | Environment | Purpose |
|---|---|---|---|
| download_model() | CPU debian-slim | modal.Volume /model-cache | Idempotent HF snapshot_download (~700 GB) |
| compile_deepgemm() | 8×B200 + SGLang image | Both volumes mounted | sglang.compile_deep_gemm JIT → cache in /dg-cache |
| Server | 8×B200 + SGLang image | Web server port 8000 | Subprocess launch, health, warmup, crash monitor, graceful shutdown |
Volumes & mounts
Fixed mount points so MODEL_PATH and DeepGEMM cache align with SGLang. Always reload() both volumes in GPU entrypoints.
1model_volume = modal.Volume.from_name("glm51-model-weights", create_if_missing=True)💡2dg_volume = modal.Volume.from_name("glm51-deepgemm-cache", create_if_missing=True)34VOLUME_MOUNTS = {💡5 "/model-cache": model_volume,6 "/dg-cache": dg_volume,7}
Building the SGLang command
Centralizes bug mitigations (#21291 TRT-LLM backends, #22359 BF16 KV by omission) and performance defaults.
1def _build_sglang_cmd(api_key: str = "") -> list[str]:2 cmd = [3 "python3", "-m", "sglang.launch_server",4 "--model-path", MODEL_PATH,5 "--served-model-name", MODEL_NAME,6 "--tp", str(GPU_COUNT),7 "--host", "0.0.0.0",8 "--port", str(PORT),9 "--trust-remote-code",10 "--ep", "1",11 "--attention-backend", "nsa",12 "--nsa-decode-backend", "trtllm",13 "--nsa-prefill-backend", "trtllm",14 "--moe-runner-backend", "flashinfer_trtllm",💡15 "--enable-flashinfer-allreduce-fusion",16 "--speculative-algorithm", "EAGLE",17 "--speculative-num-steps", "3",18 "--speculative-eagle-topk", "1",19 "--speculative-num-draft-tokens", "4",20 "--reasoning-parser", "glm45",21 "--tool-call-parser", "glm47",22 "--mem-fraction-static", "0.88",23 "--max-running-requests", "48",24 "--max-prefill-tokens", "32768",25 "--max-total-tokens", "65536",26 "--watchdog-timeout", "1200",💡27 "--enable-metrics",28 ]29 if api_key:30 cmd.extend(["--api-key", api_key])31 return cmd
| Argument cluster | Purpose |
|---|---|
| --model-path /model-cache/GLM-5.1-FP8 | Volume-mounted weights |
| --served-model-name glm-5.1 | OpenAI model field |
| --tp 8 | Tensor parallel = GPU count |
| --trust-remote-code | Load custom model code from repo |
| --attention-backend nsa | Blackwell NSA path |
| --nsa-decode-backend / --nsa-prefill-backend trtllm | Stable decode on B200 (#21291) |
| --moe-runner-backend flashinfer_trtllm | MoE routing on TRT-LLM path |
| --enable-flashinfer-allreduce-fusion | Fuse TP all-reduce into attention kernel |
| --speculative-algorithm EAGLE (+ steps/topk/draft tokens) | Speculative decoding |
| --reasoning-parser glm45 / --tool-call-parser glm47 | GLM chat templates |
| No --kv-cache-dtype fp8 | Default BF16 KV - mitigates #22359, #17526 |
Warmup & CUDA graphs
Hits chat, no-thinking, long-context, and tool paths so the first customer request is not paying graph capture latency alone.
1def _warmup(port: int, api_key: str = ""):2 headers = {"Content-Type": "application/json"}3 if api_key:4 headers["Authorization"] = f"Bearer {api_key}"5 url = f"http://127.0.0.1:{port}/v1/chat/completions"67 payloads = [💡8 {"model": MODEL_NAME, "messages": [{"role": "user", "content": "Say hello."}], "max_tokens": 32},9 {"model": MODEL_NAME, "messages": [{"role": "user", "content": "Explain quicksort in one sentence."}],10 "max_tokens": 64, "chat_template_kwargs": {"enable_thinking": False}},💡11 # ... long-context and tool-calling payloads ...12 ]13 for i, payload in enumerate(payloads):14 r = requests.post(url, headers=headers, json=payload, timeout=300)15 # validates 200 + non-empty assistant message / tool_calls / reasoning_content
download_model()
CPU-only Modal function keeps $/hr tiny while pulling ~700 GB. Sentinel guard makes the job safe to re-run.
1@app.function(2 image=_download_image,💡3 volumes={"/model-cache": model_volume},4 secrets=[modal.Secret.from_name("huggingface-secret")],5 timeout=7200,6 cpu=4,7 memory=32768,8)9def download_model():10 sentinel = os.path.join(MODEL_PATH, WEIGHT_SENTINEL)💡11 if os.path.exists(sentinel):12 print("[download] Already cached. Skipping.")13 return14 from huggingface_hub import snapshot_download15 snapshot_download(MODEL_REPO, local_dir=MODEL_PATH, max_workers=8)16 model_volume.commit()
compile_deepgemm()
Runs python3 -m sglang.compile_deep_gemm with TP identical to serving. Streams stdout line-by-line for Modal logs.
1process = subprocess.Popen(2 ["python3", "-m", "sglang.compile_deep_gemm",3 "--model", MODEL_PATH, "--tp", str(GPU_COUNT)],4 stdout=subprocess.PIPE, stderr=subprocess.STDOUT,5 text=True, bufsize=1,💡6)7for line in iter(process.stdout.readline, ""):8 sys.stdout.write(line)
Server class: setup, serve, shutdown
@modal.web_server binds traffic to port 8000. Log stream, health wait, warmup, then crash monitor only after startup succeeds.
1@app.cls(2 gpu=f"{GPU_TYPE}:{GPU_COUNT}",3 volumes=VOLUME_MOUNTS,4 secrets=[modal.Secret.from_name("glm51-api-key")],5 timeout=86400,6 min_containers=0,7 max_containers=3,8 scaledown_window=900,9)10@modal.concurrent(max_inputs=20)11class Server:12 process: subprocess.Popen | None = None13 _startup_complete: bool = False1415 @modal.enter()16 def setup(self):17 model_volume.reload()18 dg_volume.reload()19 # sentinel + optional DeepGEMM marker checks ...2021 @modal.web_server(port=PORT, startup_timeout=900)22 def serve(self):23 api_key = os.environ.get("API_KEY", "")24 self.process = subprocess.Popen(25 _build_sglang_cmd(api_key),26 stdout=subprocess.PIPE, stderr=subprocess.STDOUT,27 text=True, bufsize=1,28 env={**os.environ, "HF_HUB_OFFLINE": "1"},💡29 )30 threading.Thread(target=_stream_subprocess_logs, args=(self.process.stdout,), daemon=True).start()31 if not _wait_for_health(PORT, timeout=900):💡32 raise TimeoutError("SGLang startup timeout")33 _warmup(PORT, api_key)34 self._startup_complete = True35 threading.Thread(target=self._monitor_process, daemon=True).start()3637 @modal.exit()38 def shutdown(self):39 if self.process and self.process.poll() is None:40 self.process.terminate()41 try:42 self.process.wait(timeout=30)43 except subprocess.TimeoutExpired:44 self.process.kill()
Crash monitor
If SGLang dies after startup, os._exit(1) forces Modal to recycle the container instead of serving 502s.
1def _monitor_process(self):2 while True:3 time.sleep(10)4 if not self._startup_complete:5 continue6 if self.process is not None and self.process.poll() is not None:7 print(f"[monitor] FATAL: SGLang died (code {self.process.returncode})")8 time.sleep(1)9 os._exit(1) # force Modal to recycle the container💡
Local entrypoint
modal run deploy.py without a target chains download → compile → verify, which is handy when the laptop should not pull hundreds of GB locally.
1@app.local_entrypoint()2def main():3 download_model.remote()4 compile_deepgemm.remote()5 ok = verify_setup.remote()