> **Note:** The canonical experience is the interactive HTML tab: [Configuration & Flags](https://www.quantml.org/guides/glm-5-1-fp8/configuration). This file is a text mirror for search engines and AI tools.

# GLM-5.1 FP8 - Configuration & Flags

Every flag below is wired through `_build_sglang_cmd()` and Modal decorators in `deploy.py`. Values mirror the production v4 script: Blackwell backends, EAGLE speculative decoding, BF16 KV cache (no FP8 KV dtype), and extended watchdog for multi-minute weight loads.

Rationale for tuning (EAGLE trade-offs, BF16 KV, cold start) lives in [Tune & Operate](https://www.quantml.org/guides/glm-5-1-fp8/operations).

---

## 1. Top-level Python constants {#python-constants}

Tune these before deploy, especially keep-alive cron flags if you use the optional /health ping.

| Constant | Value | Note |
|---|---|---|
| `MODEL_REPO` | `zai-org/GLM-5.1-FP8` | Hugging Face model id |
| `MODEL_NAME` | `glm-5.1` | Served model id in API |
| `GPU_TYPE / GPU_COUNT` | `B200 / 8` | Must match DeepGEMM compile hardware |
| `ENABLE_KEEPALIVE_CRON` | `False` | Set True + DEPLOYED_URL for business-hours ping |
| `DEPLOYED_URL` | `""` | Base URL for keep_warm cron /health |

---

## 2. SGLang serving flags {#sglang-flags}

Passed to `python3 -m sglang.launch_server`, grouped the same way they read in the launch command.

### Model loading & parallelism

| Flag | Value | Rationale |
|---|---|---|
| `--model-path` | `/model-cache/GLM-5.1-FP8` | Modal volume mount path where snapshot_download wrote weights. |
| `--served-model-name` | `glm-5.1` | OpenAI-compatible model id for /v1/models and chat requests. |
| `--tp` | `8` | Tensor parallel degree — shards the MoE across all B200s; EP stays 1. |
| `--trust-remote-code` | enabled | Required for newer GLM architecture code in the HF repo. |
| `--ep` | `1` | Expert parallelism disabled at EP>1 because TP fits full model in VRAM. |
| `--watchdog-timeout` | `1200` | 20 minutes — default watchdog can kill during 5–7 min weight load from volume. |

### Inference backends (Blackwell)

| Flag | Value | Rationale |
|---|---|---|
| `--attention-backend` | `nsa` | NVIDIA sparse attention path tuned for long context on Blackwell. |
| `--nsa-decode-backend / --nsa-prefill-backend` | `trtllm` | Consistent TRT-LLM kernels across prefill/decode — avoids flashmla accuracy issues on B200. |
| `--moe-runner-backend` | `flashinfer_trtllm` | MoE expert routing through FlashInfer + TRT-LLM integration. |
| `--enable-flashinfer-allreduce-fusion` | `True` | Fuses tensor-parallel all-reduce with attention to cut comm overhead. |

### Speculative decoding (EAGLE v2)

| Flag | Value | Rationale |
|---|---|---|
| `--speculative-algorithm` | `EAGLE` | Uses built-in MTP head for draft tokens. |
| `--speculative-num-steps` | `3` | Draft depth — balances acceptance vs verify cost. |
| `--speculative-eagle-topk` | `1` | Top-1 draft per step for high acceptance rate. |
| `--speculative-num-draft-tokens` | `4` | Draft window size k — main model verifies in one forward where possible. |

### Reasoning & tool parsers

| Flag | Value | Rationale |
|---|---|---|
| `--reasoning-parser` | `glm45` | Splits chain-of-thought into reasoning_content in OpenAI-compatible payloads. |
| `--tool-call-parser` | `glm47` | Deterministic tool-call formatting for GLM-5.1 templates. |

### KV cache & throughput

| Flag | Value | Rationale |
|---|---|---|
| `--mem-fraction-static` | `0.88` | Aggressive KV reservation after weights — safe on 192 GB HBM3e per GPU. |
| `--max-running-requests` | `48` | Beyond this, EAGLE verifier queues grow and TTFT spikes under load. |
| `--max-prefill-tokens` | `32768` | Caps a single prefill batch to reduce prefill storms / OOM risk. |
| `--max-total-tokens` | `65536` | Per-request token ceiling so one client cannot monopolize KV. |
| `--kv-cache-dtype fp8` | **(omitted)** | Intentionally not set — default BF16 KV mitigates #22359, #17526, #21291 class issues. |

### Stability & observability

| Flag | Value | Rationale |
|---|---|---|
| `--enable-metrics` | `True` | Surfaces Prometheus metrics for latency, KV usage, and engine health. |
| `--api-key` | from API_KEY secret (optional in cmd) | When set, SGLang requires Authorization on inference routes. |

---

## 3. Modal infrastructure {#modal-infra}

Compute, timeouts, secrets, and image environment variables.

### Compute & scaling (@app.cls)

| Parameter | Value | Rationale |
|---|---|---|
| `gpu` | `B200:8` | Must match DeepGEMM compile job — SM100-specific binaries. |
| `min_containers` | `0` | $0 at idle; first request pays cold start. |
| `max_containers` | `3` | Hard cap on spend parallelism; 3×48 max running requests. |
| `scaledown_window` | `900` | 15 min keep-warm after last request for bursty traffic. |

### Timeouts

| Scope | Value | Rationale |
|---|---|---|
| Server @app.cls timeout | `86400` | 24h max container life — limits fragmentation / leaks. |
| @modal.web_server startup_timeout | `900` | 15 min for health to pass after process start. |
| download_model timeout | `7200` | 2h ceiling for 700 GB pull. |
| compile_deepgemm timeout | `3600` | 1h ceiling for compile + streaming logs. |

### Secrets

| Secret name | Keys | Used by |
|---|---|---|
| `huggingface-secret` | `HF_TOKEN` | download_model (CPU function) |
| `glm51-api-key` | `API_KEY` | Server class — passed into SGLang --api-key when set |

### Image environment variables

| Variable | Value | Where | Why |
|---|---|---|---|
| `HF_XET_HIGH_PERFORMANCE` | `1` | Download + SGLang images | Faster chunked transfers for huge HF artifacts. |
| `SGLANG_ENABLE_SPEC_V2` | `1` | SGLang image | Enables latest speculative decoding pipeline (EAGLE v2). |
| `SGLANG_DG_CACHE_DIR` | `/dg-cache` | SGLang image | DeepGEMM writes compiled kernels into the mounted volume. |
| `HF_HUB_OFFLINE` | `1` | Subprocess env in serve() | Serving should read weights only from volume — no runtime hub fetch. |

---

## 4. Known issues & mitigations {#known-issues}

Cross-check with upstream issues before changing KV dtype or NSA backends.

| Issue | Symptom | Our fix |
|---|---|---|
| [#22359](https://github.com/sgl-project/sglang/issues/22359) | EAGLE + FP8 KV Unsupported h_q / crash on Blackwell | Do not pass fp8 KV dtype — keep BF16 KV default. |
| [#21291](https://github.com/sgl-project/sglang/issues/21291) | flashmla_kv decode accuracy regression on B200 | Force TRT-LLM backends for NSA decode and prefill. |
| [#17526](https://github.com/sgl-project/sglang/issues/17526) | FP8 KV slower than BF16 due to quant overhead | BF16 KV default. |
| DeepGEMM JIT | 10–15 min blocked compile on cold start if not pre-cached | Run compile_deepgemm once on B200; persist to glm51-deepgemm-cache. |
| Watchdog default | Process killed mid weight load | --watchdog-timeout 1200 in launch command. |

---

## Next steps

- [Tune & Operate](https://www.quantml.org/guides/glm-5-1-fp8/operations) covers EAGLE trade-offs, BF16 KV rationale, warmup & keep-alive, and diagnostics.
- [Code Walkthrough](https://www.quantml.org/guides/glm-5-1-fp8/code) shows how flags are assembled and how subprocess lifecycle is managed.

---

## 5. References {#refs}

- [SGLang server arguments](https://docs.sglang.ai/backend/server_arguments.html)
- [SGLang GLM-5.1 cookbook](https://cookbook.sglang.io/autoregressive/GLM/GLM-5.1)
- [Modal - serve & scale](https://modal.com/docs/guide/serve?utm_source=quantml.org)
- [Modal - volumes](https://modal.com/docs/guide/volumes?utm_source=quantml.org)
- [Hugging Face - hf-xet](https://huggingface.co/blog/hf-xet)

---

## Related sections

- [Overview & Architecture](https://www.quantml.org/guides/glm-5-1-fp8)
- [Deployment Pipeline](https://www.quantml.org/guides/glm-5-1-fp8/deployment)
- [Tune & Operate](https://www.quantml.org/guides/glm-5-1-fp8/operations)
- [Code Walkthrough](https://www.quantml.org/guides/glm-5-1-fp8/code)
