GuideProduction

GLM-5.1 FP8 on Modal

Code Walkthrough

Annotated deploy.py: lifecycle, monitoring, and graceful shutdown.

Annotated excerpts from deploy.py: volumes, launch command, warmup, and the Modal Server lifecycle. Pair with Configuration & Flags for the full matrix and Tune & Operate for operational context.

Three lifecycle phases

Cheap CPU work, one-time GPU compilation, and long-running GPU serving; cold starts avoid redundant network and JIT costs.

Phase	Unit	Environment	Purpose
download_model()	CPU debian-slim	modal.Volume /model-cache	Idempotent HF snapshot_download (~700 GB)
compile_deepgemm()	8×B200 + SGLang image	Both volumes mounted	sglang.compile_deep_gemm JIT → cache in /dg-cache
Server	8×B200 + SGLang image	Web server port 8000	Subprocess launch, health, warmup, crash monitor, graceful shutdown

Volumes & mounts

Fixed mount points so MODEL_PATH and DeepGEMM cache align with SGLang. Always reload() both volumes in GPU entrypoints.

deploy.py

1model_volume = modal.Volume.from_name("glm51-model-weights", create_if_missing=True)💡
2dg_volume = modal.Volume.from_name("glm51-deepgemm-cache", create_if_missing=True)
3
4VOLUME_MOUNTS = {💡
5    "/model-cache": model_volume,
6    "/dg-cache": dg_volume,
7}

Building the SGLang command

Centralizes bug mitigations (#21291 TRT-LLM backends, #22359 BF16 KV by omission) and performance defaults.

deploy.py

1def _build_sglang_cmd(api_key: str = "") -> list[str]:
2    cmd = [
3        "python3", "-m", "sglang.launch_server",
4        "--model-path", MODEL_PATH,
5        "--served-model-name", MODEL_NAME,
6        "--tp", str(GPU_COUNT),
7        "--host", "0.0.0.0",
8        "--port", str(PORT),
9        "--trust-remote-code",
10        "--ep", "1",
11        "--attention-backend", "nsa",
12        "--nsa-decode-backend", "trtllm",
13        "--nsa-prefill-backend", "trtllm",
14        "--moe-runner-backend", "flashinfer_trtllm",💡
15        "--enable-flashinfer-allreduce-fusion",
16        "--speculative-algorithm", "EAGLE",
17        "--speculative-num-steps", "3",
18        "--speculative-eagle-topk", "1",
19        "--speculative-num-draft-tokens", "4",
20        "--reasoning-parser", "glm45",
21        "--tool-call-parser", "glm47",
22        "--mem-fraction-static", "0.88",
23        "--max-running-requests", "48",
24        "--max-prefill-tokens", "32768",
25        "--max-total-tokens", "65536",
26        "--watchdog-timeout", "1200",💡
27        "--enable-metrics",
28    ]
29    if api_key:
30        cmd.extend(["--api-key", api_key])
31    return cmd

Argument cluster	Purpose
--model-path /model-cache/GLM-5.1-FP8	Volume-mounted weights
--served-model-name glm-5.1	OpenAI model field
--tp 8	Tensor parallel = GPU count
--trust-remote-code	Load custom model code from repo
--attention-backend nsa	Blackwell NSA path
--nsa-decode-backend / --nsa-prefill-backend trtllm	Stable decode on B200 (#21291)
--moe-runner-backend flashinfer_trtllm	MoE routing on TRT-LLM path
--enable-flashinfer-allreduce-fusion	Fuse TP all-reduce into attention kernel
--speculative-algorithm EAGLE (+ steps/topk/draft tokens)	Speculative decoding
--reasoning-parser glm45 / --tool-call-parser glm47	GLM chat templates
No --kv-cache-dtype fp8	Default BF16 KV - mitigates #22359, #17526

Warmup & CUDA graphs

Hits chat, no-thinking, long-context, and tool paths so the first customer request is not paying graph capture latency alone.

deploy.py

1def _warmup(port: int, api_key: str = ""):
2    headers = {"Content-Type": "application/json"}
3    if api_key:
4        headers["Authorization"] = f"Bearer {api_key}"
5    url = f"http://127.0.0.1:{port}/v1/chat/completions"
6
7    payloads = [💡
8        {"model": MODEL_NAME, "messages": [{"role": "user", "content": "Say hello."}], "max_tokens": 32},
9        {"model": MODEL_NAME, "messages": [{"role": "user", "content": "Explain quicksort in one sentence."}],
10         "max_tokens": 64, "chat_template_kwargs": {"enable_thinking": False}},💡
11        # ... long-context and tool-calling payloads ...
12    ]
13    for i, payload in enumerate(payloads):
14        r = requests.post(url, headers=headers, json=payload, timeout=300)
15        # validates 200 + non-empty assistant message / tool_calls / reasoning_content

download_model()

CPU-only Modal function keeps $/hr tiny while pulling ~700 GB. Sentinel guard makes the job safe to re-run.

deploy.py

1@app.function(
2    image=_download_image,💡
3    volumes={"/model-cache": model_volume},
4    secrets=[modal.Secret.from_name("huggingface-secret")],
5    timeout=7200,
6    cpu=4,
7    memory=32768,
8)
9def download_model():
10    sentinel = os.path.join(MODEL_PATH, WEIGHT_SENTINEL)💡
11    if os.path.exists(sentinel):
12        print("[download] Already cached. Skipping.")
13        return
14    from huggingface_hub import snapshot_download
15    snapshot_download(MODEL_REPO, local_dir=MODEL_PATH, max_workers=8)
16    model_volume.commit()

compile_deepgemm()

Runs python3 -m sglang.compile_deep_gemm with TP identical to serving. Streams stdout line-by-line for Modal logs.

deploy.py

1process = subprocess.Popen(
2    ["python3", "-m", "sglang.compile_deep_gemm",
3     "--model", MODEL_PATH, "--tp", str(GPU_COUNT)],
4    stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
5    text=True, bufsize=1,💡
6)
7for line in iter(process.stdout.readline, ""):
8    sys.stdout.write(line)

Server class: setup, serve, shutdown

@modal.web_server binds traffic to port 8000. Log stream, health wait, warmup, then crash monitor only after startup succeeds.

deploy.py

1@app.cls(
2    gpu=f"{GPU_TYPE}:{GPU_COUNT}",
3    volumes=VOLUME_MOUNTS,
4    secrets=[modal.Secret.from_name("glm51-api-key")],
5    timeout=86400,
6    min_containers=0,
7    max_containers=3,
8    scaledown_window=900,
9)
10@modal.concurrent(max_inputs=20)
11class Server:
12    process: subprocess.Popen | None = None
13    _startup_complete: bool = False
14
15    @modal.enter()
16    def setup(self):
17        model_volume.reload()
18        dg_volume.reload()
19        # sentinel + optional DeepGEMM marker checks ...
20
21    @modal.web_server(port=PORT, startup_timeout=900)
22    def serve(self):
23        api_key = os.environ.get("API_KEY", "")
24        self.process = subprocess.Popen(
25            _build_sglang_cmd(api_key),
26            stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
27            text=True, bufsize=1,
28            env={**os.environ, "HF_HUB_OFFLINE": "1"},💡
29        )
30        threading.Thread(target=_stream_subprocess_logs, args=(self.process.stdout,), daemon=True).start()
31        if not _wait_for_health(PORT, timeout=900):💡
32            raise TimeoutError("SGLang startup timeout")
33        _warmup(PORT, api_key)
34        self._startup_complete = True
35        threading.Thread(target=self._monitor_process, daemon=True).start()
36
37    @modal.exit()
38    def shutdown(self):
39        if self.process and self.process.poll() is None:
40            self.process.terminate()
41            try:
42                self.process.wait(timeout=30)
43            except subprocess.TimeoutExpired:
44                self.process.kill()

Crash monitor

If SGLang dies after startup, os._exit(1) forces Modal to recycle the container instead of serving 502s.

deploy.py

1def _monitor_process(self):
2    while True:
3        time.sleep(10)
4        if not self._startup_complete:
5            continue
6        if self.process is not None and self.process.poll() is not None:
7            print(f"[monitor] FATAL: SGLang died (code {self.process.returncode})")
8            time.sleep(1)
9            os._exit(1)  # force Modal to recycle the container💡

Local entrypoint

modal run deploy.py without a target chains download → compile → verify, which is handy when the laptop should not pull hundreds of GB locally.

deploy.py

1@app.local_entrypoint()
2def main():
3    download_model.remote()
4    compile_deepgemm.remote()
5    ok = verify_setup.remote()

API Usage: client examples for every feature the endpoint exposes.

API usage examples

OpenAI SDK-compatible. These examples work with any OpenAI client library pointed at your Modal deployment URL.

Client setup

client.py

1from openai import OpenAI
2
3client = OpenAI(
4    api_key="your-production-key",
5    base_url="https://<your-app>.modal.run/v1",
6)

Basic completion

basic.py

1# Basic chat completion
2response = client.chat.completions.create(
3    model="glm-5.1",
4    messages=[
5        {"role": "system", "content": "You are a helpful assistant."},
6        {"role": "user", "content": "Explain quicksort."},
7    ],
8)
9print(response.choices[0].message.content)

Streaming

Tokens arrive as they're generated. Lower perceived latency for interactive UIs.

streaming.py

1# Streaming response - tokens arrive as generated
2stream = client.chat.completions.create(
3    model="glm-5.1",
4    messages=[{"role": "user", "content": "Write a haiku about GPUs."}],
5    stream=True,
6)
7for chunk in stream:
8    if chunk.choices[0].delta.content:
9        print(chunk.choices[0].delta.content, end="", flush=True)

Thinking mode (chain-of-thought)

GLM-5.1 has thinking enabled by default. The chain-of-thought reasoning appears in a separate reasoning_content field — never mixed into content.

thinking.py

1# Thinking mode (ON by default)
2# Chain-of-thought appears in reasoning_content, never mixed into content
3response = client.chat.completions.create(
4    model="glm-5.1",
5    messages=[{"role": "user", "content": "Solve x² + 5x + 6 = 0"}],
6)
7print("Thinking:", response.choices[0].message.reasoning_content)
8print("Answer:", response.choices[0].message.content)
9
10# Thinking OFF - direct answer, saves tokens
11response = client.chat.completions.create(
12    model="glm-5.1",
13    messages=[{"role": "user", "content": "What is 2+2?"}],
14    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
15)

Streaming with reasoning

Reasoning streams first, then the final content. Useful for showing "thinking" indicators in UI.

stream_reasoning.py

1# Streaming with reasoning - reasoning streams first, then content
2for chunk in client.chat.completions.create(
3    model="glm-5.1",
4    messages=[{"role": "user", "content": "Prove that √2 is irrational."}],
5    stream=True,
6):
7    delta = chunk.choices[0].delta
8    # Reasoning streams first (gray), then content
9    if hasattr(delta, "reasoning_content") and delta.reasoning_content:
10        print(f"\033[90m{delta.reasoning_content}\033[0m", end="", flush=True)
11    if delta.content:
12        print(delta.content, end="", flush=True)

Tool / function calling

Two-step flow: model requests a tool call, you execute it, then send the result back for the final response.

tools.py

1# Tool/function calling
2tools = [{
3    "type": "function",
4    "function": {
5        "name": "get_weather",
6        "description": "Get current weather for a city",
7        "parameters": {
8            "type": "object",
9            "properties": {
10                "city": {"type": "string", "description": "City name"},
11                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
12            },
13            "required": ["city"],
14        },
15    },
16}]
17
18# Step 1: Model decides to call a tool
19response = client.chat.completions.create(
20    model="glm-5.1",
21    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
22    tools=tools,
23    tool_choice="auto",
24)
25
26tool_call = response.choices[0].message.tool_calls[0]
27print(f"Tool: {tool_call.function.name}({tool_call.function.arguments})")
28
29# Step 2: Execute the tool and send result back
30messages = [
31    {"role": "user", "content": "What's the weather in Tokyo?"},
32    response.choices[0].message,
33    {
34        "role": "tool",
35        "tool_call_id": tool_call.id,
36        "content": '{"temperature": 22, "condition": "sunny"}',
37    },
38]
39final = client.chat.completions.create(model="glm-5.1", messages=messages, tools=tools)
40print(final.choices[0].message.content)

JSON mode

Force structured JSON output for downstream parsing.

json_mode.py

1# JSON mode - structured output
2response = client.chat.completions.create(
3    model="glm-5.1",
4    messages=[{
5        "role": "user",
6        "content": "List 3 planets as JSON with name and diameter_km."
7    }],
8    response_format={"type": "json_object"},
9)
10import json
11print(json.loads(response.choices[0].message.content))

cURL example

curl.sh

1# cURL example
2curl https://<your-app>.modal.run/v1/chat/completions \
3  -H "Authorization: Bearer $API_KEY" \
4  -H "Content-Type: application/json" \
5  -d '{
6    "model": "glm-5.1",
7    "messages": [{"role": "user", "content": "Hello!"}]
8  }'

Available endpoints

The deployment exposes these routes on the same host. All are OpenAI-compatible where applicable.

Endpoint

/v1/chat/completions

Method

POST

Description

Chat completions (streaming + non-streaming)

Endpoint

/v1/completions

Method

POST

Description

Legacy text completions

Endpoint

/v1/models

Method

GET

Description

List available models

Endpoint

/health

Method

GET

Description

Health check (200 when ready)

Endpoint

/metrics

Method

GET

Description

Prometheus metrics

Endpoint	Method	Description
/v1/chat/completions	POST	Chat completions (streaming + non-streaming)
/v1/completions	POST	Legacy text completions
/v1/models	GET	List available models
/health	GET	Health check (200 when ready)
/metrics	GET	Prometheus metrics

Overview & Architecture Configuration & Flags Tune & Operate Deployment

Code walkthrough

Three lifecycle phases

Volumes & mounts

Building the SGLang command

Warmup & CUDA graphs

download_model()

compile_deepgemm()

Server class: setup, serve, shutdown

Crash monitor

Local entrypoint

API usage examples

Client setup

Basic completion

Streaming

Thinking mode (chain-of-thought)

Streaming with reasoning

Tool / function calling

JSON mode

cURL example

Available endpoints