> **Note:** The canonical experience is the interactive HTML tab: [Code Walkthrough](https://www.quantml.org/guides/glm-5-1-fp8/code). This file is a text mirror for search engines and AI tools.

# GLM-5.1 FP8 - Code Walkthrough

Annotated excerpts from `deploy.py`: volumes, launch command, warmup, and the Modal `Server` lifecycle. Pair with [Configuration & Flags](https://www.quantml.org/guides/glm-5-1-fp8/configuration) for the full matrix and [Tune & Operate](https://www.quantml.org/guides/glm-5-1-fp8/operations) for operational context.

---

## 1. Three lifecycle phases {#lifecycle}

Cheap CPU work, one-time GPU compilation, and long-running GPU serving; cold starts avoid redundant network and JIT costs.

| Phase | Unit | Environment | Purpose |
|---|---|---|---|
| `download_model()` | CPU debian-slim | modal.Volume /model-cache | Idempotent HF snapshot_download (~700 GB) |
| `compile_deepgemm()` | 8×B200 + SGLang image | Both volumes mounted | sglang.compile_deep_gemm JIT → cache in /dg-cache |
| `Server` | 8×B200 + SGLang image | Web server port 8000 | Subprocess launch, health, warmup, crash monitor, graceful shutdown |

---

## 2. Volumes & mounts {#volumes}

Fixed mount points so MODEL_PATH and DeepGEMM cache align with SGLang. Always `reload()` both volumes in GPU entrypoints.

```python
model_volume = modal.Volume.from_name("glm51-model-weights", create_if_missing=True)
dg_volume = modal.Volume.from_name("glm51-deepgemm-cache", create_if_missing=True)

VOLUME_MOUNTS = {
    "/model-cache": model_volume,
    "/dg-cache": dg_volume,
}
```

- `Volume.from_name` — Create-or-bind durable NFS-like storage shared by all containers.
- `VOLUME_MOUNTS` — Injected into @app.function / @app.cls, with the same paths at compile and serve time.

---

## 3. Building the SGLang command {#build-cmd}

Centralizes bug mitigations (#21291 TRT-LLM backends, #22359 BF16 KV by omission) and performance defaults.

```python
def _build_sglang_cmd(api_key: str = "") -> list[str]:
    cmd = [
        "python3", "-m", "sglang.launch_server",
        "--model-path", MODEL_PATH,
        "--served-model-name", MODEL_NAME,
        "--tp", str(GPU_COUNT),
        "--host", "0.0.0.0",
        "--port", str(PORT),
        "--trust-remote-code",
        "--ep", "1",
        "--attention-backend", "nsa",
        "--nsa-decode-backend", "trtllm",
        "--nsa-prefill-backend", "trtllm",
        "--moe-runner-backend", "flashinfer_trtllm",
        "--enable-flashinfer-allreduce-fusion",
        "--speculative-algorithm", "EAGLE",
        "--speculative-num-steps", "3",
        "--speculative-eagle-topk", "1",
        "--speculative-num-draft-tokens", "4",
        "--reasoning-parser", "glm45",
        "--tool-call-parser", "glm47",
        "--mem-fraction-static", "0.88",
        "--max-running-requests", "48",
        "--max-prefill-tokens", "32768",
        "--max-total-tokens", "65536",
        "--watchdog-timeout", "1200",
        "--enable-metrics",
    ]
    if api_key:
        cmd.extend(["--api-key", api_key])
    return cmd
```

- `flashinfer_trtllm` — MoE routing path aligned with TRT-LLM + FlashInfer fusion.
- `--watchdog-timeout` — 20 minutes so weight staging cannot trip the internal watchdog.

### Argument cluster reference

| Argument cluster | Purpose |
|---|---|
| `--model-path /model-cache/GLM-5.1-FP8` | Volume-mounted weights |
| `--served-model-name glm-5.1` | OpenAI model field |
| `--tp 8` | Tensor parallel = GPU count |
| `--trust-remote-code` | Load custom model code from repo |
| `--attention-backend nsa` | Blackwell NSA path |
| `--nsa-decode-backend / --nsa-prefill-backend trtllm` | Stable decode on B200 (#21291) |
| `--moe-runner-backend flashinfer_trtllm` | MoE routing on TRT-LLM path |
| `--enable-flashinfer-allreduce-fusion` | Fuse TP all-reduce into attention kernel |
| `--speculative-algorithm EAGLE (+ steps/topk/draft tokens)` | Speculative decoding |
| `--reasoning-parser glm45 / --tool-call-parser glm47` | GLM chat templates |
| No --kv-cache-dtype fp8 | Default BF16 KV — mitigates #22359, #17526 |

---

## 4. Warmup & CUDA graphs {#warmup}

Hits chat, no-thinking, long-context, and tool paths so the first customer request is not paying graph capture latency alone.

```python
def _warmup(port: int, api_key: str = ""):
    headers = {"Content-Type": "application/json"}
    if api_key:
        headers["Authorization"] = f"Bearer {api_key}"
    url = f"http://127.0.0.1:{port}/v1/chat/completions"

    payloads = [
        {"model": MODEL_NAME, "messages": [{"role": "user", "content": "Say hello."}], "max_tokens": 32},
        {"model": MODEL_NAME, "messages": [{"role": "user", "content": "Explain quicksort in one sentence."}],
         "max_tokens": 64, "chat_template_kwargs": {"enable_thinking": False}},
        # ... long-context and tool-calling payloads ...
    ]
    for i, payload in enumerate(payloads):
        r = requests.post(url, headers=headers, json=payload, timeout=300)
        # validates 200 + non-empty assistant message / tool_calls / reasoning_content
```

- `chat_template_kwargs` — Disables thinking template branch (separate CUDA graph).
- `payloads =` — Four requests mirror production modalities (standard, tools, etc.).

---

## 5. download_model() {#download}

CPU-only Modal function keeps $/hr tiny while pulling ~700 GB. Sentinel guard makes the job safe to re-run.

```python
@app.function(
    image=_download_image,
    volumes={"/model-cache": model_volume},
    secrets=[modal.Secret.from_name("huggingface-secret")],
    timeout=7200,
    cpu=4,
    memory=32768,
)
def download_model():
    sentinel = os.path.join(MODEL_PATH, WEIGHT_SENTINEL)
    if os.path.exists(sentinel):
        print("[download] Already cached. Skipping.")
        return
    from huggingface_hub import snapshot_download
    snapshot_download(MODEL_REPO, local_dir=MODEL_PATH, max_workers=8)
    model_volume.commit()
```

- `_download_image` — Debian slim + huggingface-hub[hf_xet]; no CUDA pull for I/O.
- `WEIGHT_SENTINEL` — model.safetensors.index.json proves HF snapshot layout landed.

---

## 6. compile_deepgemm() {#compile}

Runs `python3 -m sglang.compile_deep_gemm` with TP identical to serving. Streams stdout line-by-line for Modal logs.

```python
process = subprocess.Popen(
    ["python3", "-m", "sglang.compile_deep_gemm",
     "--model", MODEL_PATH, "--tp", str(GPU_COUNT)],
    stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
    text=True, bufsize=1,
)
for line in iter(process.stdout.readline, ""):
    sys.stdout.write(line)
```

- `text=True` — Line-buffered text mode, required for bufsize=1.

---

## 7. Server class: setup, serve, shutdown {#server}

`@modal.web_server` binds traffic to port 8000. Log stream, health wait, warmup, then crash monitor only after startup succeeds.

```python
@app.cls(
    gpu=f"{GPU_TYPE}:{GPU_COUNT}",
    volumes=VOLUME_MOUNTS,
    secrets=[modal.Secret.from_name("glm51-api-key")],
    timeout=86400,
    min_containers=0,
    max_containers=3,
    scaledown_window=900,
)
@modal.concurrent(max_inputs=20)
class Server:
    process: subprocess.Popen | None = None
    _startup_complete: bool = False

    @modal.enter()
    def setup(self):
        model_volume.reload()
        dg_volume.reload()
        # sentinel + optional DeepGEMM marker checks ...

    @modal.web_server(port=PORT, startup_timeout=900)
    def serve(self):
        api_key = os.environ.get("API_KEY", "")
        self.process = subprocess.Popen(
            _build_sglang_cmd(api_key),
            stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
            text=True, bufsize=1,
            env={**os.environ, "HF_HUB_OFFLINE": "1"},
        )
        threading.Thread(target=_stream_subprocess_logs, args=(self.process.stdout,), daemon=True).start()
        if not _wait_for_health(PORT, timeout=900):
            raise TimeoutError("SGLang startup timeout")
        _warmup(PORT, api_key)
        self._startup_complete = True
        threading.Thread(target=self._monitor_process, daemon=True).start()

    @modal.exit()
    def shutdown(self):
        if self.process and self.process.poll() is None:
            self.process.terminate()
            try:
                self.process.wait(timeout=30)
            except subprocess.TimeoutExpired:
                self.process.kill()
```

- `HF_HUB_OFFLINE` — Serving must not silently pull weights from Hugging Face; the volume is source of truth.
- `_wait_for_health` — 900s budget for multi-minute load before failing fast with logs.

---

## 8. Crash monitor {#monitor}

If SGLang dies after startup, `os._exit(1)` forces Modal to recycle the container instead of serving 502s.

```python
def _monitor_process(self):
    while True:
        time.sleep(10)
        if not self._startup_complete:
            continue
        if self.process is not None and self.process.poll() is not None:
            print(f"[monitor] FATAL: SGLang died (code {self.process.returncode})")
            time.sleep(1)
            os._exit(1)  # force Modal to recycle the container
```

- `os._exit(1)` — sys.exit in a daemon thread would not tear down the whole worker; os._exit does.

---

## 9. Local entrypoint {#entrypoint}

`modal run deploy.py` without a target chains download → compile → verify, which is handy when the laptop should not pull hundreds of GB locally.

```python
@app.local_entrypoint()
def main():
    download_model.remote()
    compile_deepgemm.remote()
    ok = verify_setup.remote()
```

---

## 10. API usage examples {#api-usage}

OpenAI SDK-compatible. These examples work with any OpenAI client library pointed at your Modal deployment URL.

### Client setup

```python
from openai import OpenAI

client = OpenAI(
    api_key="your-production-key",
    base_url="https://<your-app>.modal.run/v1",
)
```

### Basic completion

```python
# Basic chat completion
response = client.chat.completions.create(
    model="glm-5.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quicksort."},
    ],
)
print(response.choices[0].message.content)
```

### Streaming

Tokens arrive as they're generated. Lower perceived latency for interactive UIs.

```python
# Streaming response - tokens arrive as generated
stream = client.chat.completions.create(
    model="glm-5.1",
    messages=[{"role": "user", "content": "Write a haiku about GPUs."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
```

### Thinking mode (chain-of-thought)

GLM-5.1 has thinking enabled by default. The chain-of-thought reasoning appears in a separate `reasoning_content` field — never mixed into `content`.

```python
# Thinking mode (ON by default)
# Chain-of-thought appears in reasoning_content, never mixed into content
response = client.chat.completions.create(
    model="glm-5.1",
    messages=[{"role": "user", "content": "Solve x² + 5x + 6 = 0"}],
)
print("Thinking:", response.choices[0].message.reasoning_content)
print("Answer:", response.choices[0].message.content)

# Thinking OFF - direct answer, saves tokens
response = client.chat.completions.create(
    model="glm-5.1",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
```

### Streaming with reasoning

Reasoning streams first, then the final content. Useful for showing "thinking" indicators in UI.

```python
# Streaming with reasoning - reasoning streams first, then content
for chunk in client.chat.completions.create(
    model="glm-5.1",
    messages=[{"role": "user", "content": "Prove that √2 is irrational."}],
    stream=True,
):
    delta = chunk.choices[0].delta
    # Reasoning streams first (gray), then content
    if hasattr(delta, "reasoning_content") and delta.reasoning_content:
        print(f"\033[90m{delta.reasoning_content}\033[0m", end="", flush=True)
    if delta.content:
        print(delta.content, end="", flush=True)
```

### Tool / function calling

Two-step flow: model requests a tool call, you execute it, then send the result back for the final response.

```python
# Tool/function calling
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["city"],
        },
    },
}]

# Step 1: Model decides to call a tool
response = client.chat.completions.create(
    model="glm-5.1",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
)

tool_call = response.choices[0].message.tool_calls[0]
print(f"Tool: {tool_call.function.name}({tool_call.function.arguments})")

# Step 2: Execute the tool and send result back
messages = [
    {"role": "user", "content": "What's the weather in Tokyo?"},
    response.choices[0].message,
    {
        "role": "tool",
        "tool_call_id": tool_call.id,
        "content": '{"temperature": 22, "condition": "sunny"}',
    },
]
final = client.chat.completions.create(model="glm-5.1", messages=messages, tools=tools)
print(final.choices[0].message.content)
```

### JSON mode

Force structured JSON output for downstream parsing.

```python
# JSON mode - structured output
response = client.chat.completions.create(
    model="glm-5.1",
    messages=[{
        "role": "user",
        "content": "List 3 planets as JSON with name and diameter_km."
    }],
    response_format={"type": "json_object"},
)
import json
print(json.loads(response.choices[0].message.content))
```

### cURL example

```bash
# cURL example
curl https://<your-app>.modal.run/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
```

---

## 11. Available endpoints {#api-endpoints}

The deployment exposes these routes on the same host. All are OpenAI-compatible where applicable.

| Endpoint | Method | Description |
|---|---|---|
| `/v1/chat/completions` | POST | Chat completions (streaming + non-streaming) |
| `/v1/completions` | POST | Legacy text completions |
| `/v1/models` | GET | List available models |
| `/health` | GET | Health check (200 when ready) |
| `/metrics` | GET | Prometheus metrics |

---

## Related sections

- [Overview & Architecture](https://www.quantml.org/guides/glm-5-1-fp8)
- [Deployment Pipeline](https://www.quantml.org/guides/glm-5-1-fp8/deployment)
- [Configuration & Flags](https://www.quantml.org/guides/glm-5-1-fp8/configuration)
- [Tune & Operate](https://www.quantml.org/guides/glm-5-1-fp8/operations)
