GLM-5.1 FP8 on Modal
Code Walkthrough
Annotated deploy.py: lifecycle, monitoring, and graceful shutdown.
Code walkthrough
Annotated excerpts from deploy.py: volumes, launch command, warmup, and the Modal Server lifecycle. Pair with Configuration & Flags for the full matrix and Tune & Operate for operational context.
Three lifecycle phases
Cheap CPU work, one-time GPU compilation, and long-running GPU serving; cold starts avoid redundant network and JIT costs.
| Phase | Unit | Environment | Purpose |
|---|---|---|---|
| download_model() | CPU debian-slim | modal.Volume /model-cache | Idempotent HF snapshot_download (~700 GB) |
| compile_deepgemm() | 8×B200 + SGLang image | Both volumes mounted | sglang.compile_deep_gemm JIT → cache in /dg-cache |
| Server | 8×B200 + SGLang image | Web server port 8000 | Subprocess launch, health, warmup, crash monitor, graceful shutdown |
Volumes & mounts
Fixed mount points so MODEL_PATH and DeepGEMM cache align with SGLang. Always reload() both volumes in GPU entrypoints.
1model_volume = modal.Volume.from_name("glm51-model-weights", create_if_missing=True)💡2dg_volume = modal.Volume.from_name("glm51-deepgemm-cache", create_if_missing=True)34VOLUME_MOUNTS = {💡5 "/model-cache": model_volume,6 "/dg-cache": dg_volume,7}
Building the SGLang command
Centralizes bug mitigations (#21291 TRT-LLM backends, #22359 BF16 KV by omission) and performance defaults.
1def _build_sglang_cmd(api_key: str = "") -> list[str]:2 cmd = [3 "python3", "-m", "sglang.launch_server",4 "--model-path", MODEL_PATH,5 "--served-model-name", MODEL_NAME,6 "--tp", str(GPU_COUNT),7 "--host", "0.0.0.0",8 "--port", str(PORT),9 "--trust-remote-code",10 "--ep", "1",11 "--attention-backend", "nsa",12 "--nsa-decode-backend", "trtllm",13 "--nsa-prefill-backend", "trtllm",14 "--moe-runner-backend", "flashinfer_trtllm",💡15 "--enable-flashinfer-allreduce-fusion",16 "--speculative-algorithm", "EAGLE",17 "--speculative-num-steps", "3",18 "--speculative-eagle-topk", "1",19 "--speculative-num-draft-tokens", "4",20 "--reasoning-parser", "glm45",21 "--tool-call-parser", "glm47",22 "--mem-fraction-static", "0.88",23 "--max-running-requests", "48",24 "--max-prefill-tokens", "32768",25 "--max-total-tokens", "65536",26 "--watchdog-timeout", "1200",💡27 "--enable-metrics",28 ]29 if api_key:30 cmd.extend(["--api-key", api_key])31 return cmd
| Argument cluster | Purpose |
|---|---|
| --model-path /model-cache/GLM-5.1-FP8 | Volume-mounted weights |
| --served-model-name glm-5.1 | OpenAI model field |
| --tp 8 | Tensor parallel = GPU count |
| --trust-remote-code | Load custom model code from repo |
| --attention-backend nsa | Blackwell NSA path |
| --nsa-decode-backend / --nsa-prefill-backend trtllm | Stable decode on B200 (#21291) |
| --moe-runner-backend flashinfer_trtllm | MoE routing on TRT-LLM path |
| --enable-flashinfer-allreduce-fusion | Fuse TP all-reduce into attention kernel |
| --speculative-algorithm EAGLE (+ steps/topk/draft tokens) | Speculative decoding |
| --reasoning-parser glm45 / --tool-call-parser glm47 | GLM chat templates |
| No --kv-cache-dtype fp8 | Default BF16 KV - mitigates #22359, #17526 |
Warmup & CUDA graphs
Hits chat, no-thinking, long-context, and tool paths so the first customer request is not paying graph capture latency alone.
1def _warmup(port: int, api_key: str = ""):2 headers = {"Content-Type": "application/json"}3 if api_key:4 headers["Authorization"] = f"Bearer {api_key}"5 url = f"http://127.0.0.1:{port}/v1/chat/completions"67 payloads = [💡8 {"model": MODEL_NAME, "messages": [{"role": "user", "content": "Say hello."}], "max_tokens": 32},9 {"model": MODEL_NAME, "messages": [{"role": "user", "content": "Explain quicksort in one sentence."}],10 "max_tokens": 64, "chat_template_kwargs": {"enable_thinking": False}},💡11 # ... long-context and tool-calling payloads ...12 ]13 for i, payload in enumerate(payloads):14 r = requests.post(url, headers=headers, json=payload, timeout=300)15 # validates 200 + non-empty assistant message / tool_calls / reasoning_content
download_model()
CPU-only Modal function keeps $/hr tiny while pulling ~700 GB. Sentinel guard makes the job safe to re-run.
1@app.function(2 image=_download_image,💡3 volumes={"/model-cache": model_volume},4 secrets=[modal.Secret.from_name("huggingface-secret")],5 timeout=7200,6 cpu=4,7 memory=32768,8)9def download_model():10 sentinel = os.path.join(MODEL_PATH, WEIGHT_SENTINEL)💡11 if os.path.exists(sentinel):12 print("[download] Already cached. Skipping.")13 return14 from huggingface_hub import snapshot_download15 snapshot_download(MODEL_REPO, local_dir=MODEL_PATH, max_workers=8)16 model_volume.commit()
compile_deepgemm()
Runs python3 -m sglang.compile_deep_gemm with TP identical to serving. Streams stdout line-by-line for Modal logs.
1process = subprocess.Popen(2 ["python3", "-m", "sglang.compile_deep_gemm",3 "--model", MODEL_PATH, "--tp", str(GPU_COUNT)],4 stdout=subprocess.PIPE, stderr=subprocess.STDOUT,5 text=True, bufsize=1,💡6)7for line in iter(process.stdout.readline, ""):8 sys.stdout.write(line)
Server class: setup, serve, shutdown
@modal.web_server binds traffic to port 8000. Log stream, health wait, warmup, then crash monitor only after startup succeeds.
1@app.cls(2 gpu=f"{GPU_TYPE}:{GPU_COUNT}",3 volumes=VOLUME_MOUNTS,4 secrets=[modal.Secret.from_name("glm51-api-key")],5 timeout=86400,6 min_containers=0,7 max_containers=3,8 scaledown_window=900,9)10@modal.concurrent(max_inputs=20)11class Server:12 process: subprocess.Popen | None = None13 _startup_complete: bool = False1415 @modal.enter()16 def setup(self):17 model_volume.reload()18 dg_volume.reload()19 # sentinel + optional DeepGEMM marker checks ...2021 @modal.web_server(port=PORT, startup_timeout=900)22 def serve(self):23 api_key = os.environ.get("API_KEY", "")24 self.process = subprocess.Popen(25 _build_sglang_cmd(api_key),26 stdout=subprocess.PIPE, stderr=subprocess.STDOUT,27 text=True, bufsize=1,28 env={**os.environ, "HF_HUB_OFFLINE": "1"},💡29 )30 threading.Thread(target=_stream_subprocess_logs, args=(self.process.stdout,), daemon=True).start()31 if not _wait_for_health(PORT, timeout=900):💡32 raise TimeoutError("SGLang startup timeout")33 _warmup(PORT, api_key)34 self._startup_complete = True35 threading.Thread(target=self._monitor_process, daemon=True).start()3637 @modal.exit()38 def shutdown(self):39 if self.process and self.process.poll() is None:40 self.process.terminate()41 try:42 self.process.wait(timeout=30)43 except subprocess.TimeoutExpired:44 self.process.kill()
Crash monitor
If SGLang dies after startup, os._exit(1) forces Modal to recycle the container instead of serving 502s.
1def _monitor_process(self):2 while True:3 time.sleep(10)4 if not self._startup_complete:5 continue6 if self.process is not None and self.process.poll() is not None:7 print(f"[monitor] FATAL: SGLang died (code {self.process.returncode})")8 time.sleep(1)9 os._exit(1) # force Modal to recycle the container💡
Local entrypoint
modal run deploy.py without a target chains download → compile → verify, which is handy when the laptop should not pull hundreds of GB locally.
1@app.local_entrypoint()2def main():3 download_model.remote()4 compile_deepgemm.remote()5 ok = verify_setup.remote()
API Usage: client examples for every feature the endpoint exposes.
API usage examples
OpenAI SDK-compatible. These examples work with any OpenAI client library pointed at your Modal deployment URL.
Client setup
1from openai import OpenAI23client = OpenAI(4 api_key="your-production-key",5 base_url="https://<your-app>.modal.run/v1",6)
Basic completion
1# Basic chat completion2response = client.chat.completions.create(3 model="glm-5.1",4 messages=[5 {"role": "system", "content": "You are a helpful assistant."},6 {"role": "user", "content": "Explain quicksort."},7 ],8)9print(response.choices[0].message.content)
Streaming
Tokens arrive as they're generated. Lower perceived latency for interactive UIs.
1# Streaming response - tokens arrive as generated2stream = client.chat.completions.create(3 model="glm-5.1",4 messages=[{"role": "user", "content": "Write a haiku about GPUs."}],5 stream=True,6)7for chunk in stream:8 if chunk.choices[0].delta.content:9 print(chunk.choices[0].delta.content, end="", flush=True)
Thinking mode (chain-of-thought)
GLM-5.1 has thinking enabled by default. The chain-of-thought reasoning appears in a separate reasoning_content field — never mixed into content.
1# Thinking mode (ON by default)2# Chain-of-thought appears in reasoning_content, never mixed into content3response = client.chat.completions.create(4 model="glm-5.1",5 messages=[{"role": "user", "content": "Solve x² + 5x + 6 = 0"}],6)7print("Thinking:", response.choices[0].message.reasoning_content)8print("Answer:", response.choices[0].message.content)910# Thinking OFF - direct answer, saves tokens11response = client.chat.completions.create(12 model="glm-5.1",13 messages=[{"role": "user", "content": "What is 2+2?"}],14 extra_body={"chat_template_kwargs": {"enable_thinking": False}},15)
Streaming with reasoning
Reasoning streams first, then the final content. Useful for showing "thinking" indicators in UI.
1# Streaming with reasoning - reasoning streams first, then content2for chunk in client.chat.completions.create(3 model="glm-5.1",4 messages=[{"role": "user", "content": "Prove that √2 is irrational."}],5 stream=True,6):7 delta = chunk.choices[0].delta8 # Reasoning streams first (gray), then content9 if hasattr(delta, "reasoning_content") and delta.reasoning_content:10 print(f"\033[90m{delta.reasoning_content}\033[0m", end="", flush=True)11 if delta.content:12 print(delta.content, end="", flush=True)
Tool / function calling
Two-step flow: model requests a tool call, you execute it, then send the result back for the final response.
1# Tool/function calling2tools = [{3 "type": "function",4 "function": {5 "name": "get_weather",6 "description": "Get current weather for a city",7 "parameters": {8 "type": "object",9 "properties": {10 "city": {"type": "string", "description": "City name"},11 "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},12 },13 "required": ["city"],14 },15 },16}]1718# Step 1: Model decides to call a tool19response = client.chat.completions.create(20 model="glm-5.1",21 messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],22 tools=tools,23 tool_choice="auto",24)2526tool_call = response.choices[0].message.tool_calls[0]27print(f"Tool: {tool_call.function.name}({tool_call.function.arguments})")2829# Step 2: Execute the tool and send result back30messages = [31 {"role": "user", "content": "What's the weather in Tokyo?"},32 response.choices[0].message,33 {34 "role": "tool",35 "tool_call_id": tool_call.id,36 "content": '{"temperature": 22, "condition": "sunny"}',37 },38]39final = client.chat.completions.create(model="glm-5.1", messages=messages, tools=tools)40print(final.choices[0].message.content)
JSON mode
Force structured JSON output for downstream parsing.
1# JSON mode - structured output2response = client.chat.completions.create(3 model="glm-5.1",4 messages=[{5 "role": "user",6 "content": "List 3 planets as JSON with name and diameter_km."7 }],8 response_format={"type": "json_object"},9)10import json11print(json.loads(response.choices[0].message.content))
cURL example
1# cURL example2curl https://<your-app>.modal.run/v1/chat/completions \3 -H "Authorization: Bearer $API_KEY" \4 -H "Content-Type: application/json" \5 -d '{6 "model": "glm-5.1",7 "messages": [{"role": "user", "content": "Hello!"}]8 }'
Available endpoints
The deployment exposes these routes on the same host. All are OpenAI-compatible where applicable.
Endpoint
/v1/chat/completions
Method
POST
Description
Chat completions (streaming + non-streaming)
Endpoint
/v1/completions
Method
POST
Description
Legacy text completions
Endpoint
/v1/models
Method
GET
Description
List available models
Endpoint
/health
Method
GET
Description
Health check (200 when ready)
Endpoint
/metrics
Method
GET
Description
Prometheus metrics
| Endpoint | Method | Description |
|---|---|---|
| /v1/chat/completions | POST | Chat completions (streaming + non-streaming) |
| /v1/completions | POST | Legacy text completions |
| /v1/models | GET | List available models |
| /health | GET | Health check (200 when ready) |
| /metrics | GET | Prometheus metrics |