GuideProduction

GLM-5.1 FP8 on Modal

Configuration & Flags

SGLang flags, API secrets, and Modal decorators.

Configuration & flags

Every flag below is wired through _build_sglang_cmd() and Modal decorators in deploy.py. Values mirror the production v4 script: Blackwell backends, EAGLE speculative decoding, BF16 KV cache (no FP8 KV dtype), and extended watchdog for multi-minute weight loads.

Rationale for tuning (EAGLE trade-offs, BF16 KV, cold start) lives in Tune & Operate.

01

Top-level Python constants

Tune these before deploy, especially keep-alive cron flags if you use the optional /health ping.

Constant

MODEL_REPO

Value

zai-org/GLM-5.1-FP8

Note

Hugging Face model id

Constant

MODEL_NAME

Value

glm-5.1

Note

Served model id in API

Constant

GPU_TYPE / GPU_COUNT

Value

B200 / 8

Note

Must match DeepGEMM compile hardware

Constant

ENABLE_KEEPALIVE_CRON

Value

False

Note

Set True + DEPLOYED_URL for business-hours ping

Constant

DEPLOYED_URL

Value

""

Note

Base URL for keep_warm cron /health

02

SGLang serving flags

Passed to python3 -m sglang.launch_server, grouped the same way they read in the launch command.

Model loading & parallelism

Flag

--model-path

Value

/model-cache/GLM-5.1-FP8

Rationale

Modal volume mount path where snapshot_download wrote weights.

Flag

--served-model-name

Value

glm-5.1

Rationale

OpenAI-compatible model id for /v1/models and chat requests.

Flag

--tp

Value

8

Rationale

Tensor parallel degree - shards the MoE across all B200s; EP stays 1.

Flag

--trust-remote-code

Value

enabled

Rationale

Required for newer GLM architecture code in the HF repo.

Flag

--ep

Value

1

Rationale

Expert parallelism disabled at EP>1 because TP fits full model in VRAM.

Flag

--watchdog-timeout

Value

1200

Rationale

20 minutes - default watchdog can kill during 5–7 min weight load from volume.

Inference backends (Blackwell)

Flag

--attention-backend

Value

nsa

Rationale

NVIDIA sparse attention path tuned for long context on Blackwell.

Flag

--nsa-decode-backend / --nsa-prefill-backend

Value

trtllm

Rationale

Consistent TRT-LLM kernels across prefill/decode - avoids flashmla accuracy issues on B200.

Flag

--moe-runner-backend

Value

flashinfer_trtllm

Rationale

MoE expert routing through FlashInfer + TRT-LLM integration.

Flag

--enable-flashinfer-allreduce-fusion

Value

True

Rationale

Fuses tensor-parallel all-reduce with attention to cut comm overhead.

Speculative decoding (EAGLE v2)

Flag

--speculative-algorithm

Value

EAGLE

Rationale

Uses built-in MTP head for draft tokens.

Flag

--speculative-num-steps

Value

3

Rationale

Draft depth - balances acceptance vs verify cost.

Flag

--speculative-eagle-topk

Value

1

Rationale

Top-1 draft per step for high acceptance rate.

Flag

--speculative-num-draft-tokens

Value

4

Rationale

Draft window size k - main model verifies in one forward where possible.

Reasoning & tool parsers

Flag

--reasoning-parser

Value

glm45

Rationale

Splits chain-of-thought into reasoning_content in OpenAI-compatible payloads.

Flag

--tool-call-parser

Value

glm47

Rationale

Deterministic tool-call formatting for GLM-5.1 templates.

KV cache & throughput

Flag

--mem-fraction-static

Value

0.88

Rationale

Aggressive KV reservation after weights - safe on 192 GB HBM3e per GPU.

Flag

--max-running-requests

Value

48

Rationale

Beyond this, EAGLE verifier queues grow and TTFT spikes under load.

Flag

--max-prefill-tokens

Value

32768

Rationale

Caps a single prefill batch to reduce prefill storms / OOM risk.

Flag

--max-total-tokens

Value

65536

Rationale

Per-request token ceiling so one client cannot monopolize KV.

Flag

--kv-cache-dtype fp8

Value

(omitted)

Rationale

Intentionally not set - default BF16 KV mitigates #22359, #17526, #21291 class issues.

Stability & observability

Flag

--enable-metrics

Value

True

Rationale

Surfaces Prometheus metrics for latency, KV usage, and engine health.

Flag

--api-key

Value

from API_KEY secret (optional in cmd)

Rationale

When set, SGLang requires Authorization on inference routes.

04

Known issues & mitigations

Cross-check with upstream issues before changing KV dtype or NSA backends.

Issue

#22359

Symptom

EAGLE + FP8 KV Unsupported h_q / crash on Blackwell

Our fix

Do not pass fp8 KV dtype - keep BF16 KV default.

Issue

#21291

Symptom

flashmla_kv decode accuracy regression on B200

Our fix

Force TRT-LLM backends for NSA decode and prefill.

Issue

#17526

Symptom

FP8 KV slower than BF16 due to quant overhead

Our fix

BF16 KV default.

Issue

DeepGEMM JIT

Symptom

10–15 min blocked compile on cold start if not pre-cached

Our fix

Run compile_deepgemm once on B200; persist to glm51-deepgemm-cache.

Issue

Watchdog default

Symptom

Process killed mid weight load

Our fix

--watchdog-timeout 1200 in launch command.

Next steps

  • Tune & Operate covers EAGLE trade-offs, BF16 KV rationale, warmup & keep-alive, and diagnostics.
  • Code Walkthrough shows how flags are assembled and how subprocess lifecycle is managed.
05

References