GLM-5.1 FP8 on Modal
Tune & Operate
Performance tuning, cold starts, diagnostics, and warning triage.
Tune & operate
Tuning goal: maximize throughput on 8×B200 while keeping TTFT under ~300 ms for interactive workloads. This page also triages compile noise, runtime diagnostics, and cold-start remediation, the operational complement to the deployment pipeline.
Modal & SGLang parameters
Replica counts and @modal.concurrent interact with SGLang batching; adjust them together, not in isolation.
@app.cls scaling
- min_containersDefault: 0
Scale to zero when idle
- max_containersDefault: 3
Max concurrent replicas (3 × 48 ≈ 144 in-flight decode slots)
- scaledown_windowDefault: 900
Seconds idle before Modal scales down
@modal.concurrent
- @modal.concurrent max_inputsDefault: 20
Modal-level queue per container before spill to new replica
_build_sglang_cmd()
- --mem-fraction-staticDefault: 0.88
Share of post-weight VRAM for KV cache
- --max-running-requestsDefault: 48
Cap concurrent decodes - EAGLE verifier queue grows past this
- --max-prefill-tokensDefault: 32768
Prefill batch safety cap
- --max-total-tokensDefault: 65536
Per-request token budget (input + output)
- --watchdog-timeoutDefault: 1200
20 min - must exceed multi-minute weight load
- --enable-metricsDefault: True
Expose /metrics (Prometheus)
EAGLE v2 speculative decoding
GLM-5.1 ships a multi-token prediction head. Draft quality is high enough that effective decode throughput rises sharply without a second model process.
| Metric | Autoregressive | EAGLE v2 | Delta |
|---|---|---|---|
| TPOT | ~20 ms | ~7.7 ms | ~2.6× faster |
| Aggregate decode tok/s | ~1,750 | ~4,600+ | ~2.6× |
Trade-off: under very high concurrency, EAGLE verification can lengthen TTFT. The --max-running-requests 48 cap is the guardrail.
1--speculative-algorithm EAGLE \2--speculative-num-steps 3 \3--speculative-eagle-topk 1 \4--speculative-num-draft-tokens 4
Why BF16 KV cache with FP8 weights
Weights stay FP8 for capacity and math throughput, but the KV cache remains BF16 (deliberate, not an oversight).
- Stability: EAGLE + FP8 KV has a known crash path on Blackwell (SGLang #22359).
- Accuracy: Some flash MLA KV paths regressed quality on B200 (#21291); TRT-LLM NSA backends pair cleanly with BF16 KV.
- Speed: FP8 KV can be slower than BF16 once quant/dequant overhead is included (#17526).
Practically: omit --kv-cache-dtype fp8_e4m3 so SGLang keeps BF16 KV defaults. Full flag matrix: Configuration & Flags.
Concurrency & memory
If you see OOM under load, drop mem-fraction toward 0.85 before raising max-running-requests.
--mem-fraction-static 0.88 aggressively reserves KV after weights load. For batch-heavy workloads you can experiment above 48 running requests, and watch TTFT percentiles when you do.
Stability: watchdog & crash monitor
Weight staging from a 700 GB volume can exceed default SGLang watchdog windows.
Production deploy.py sets --watchdog-timeout 1200. After startup, a background thread calls os._exit(1) if the SGLang child exits so Modal replaces the container instead of serving endless 502s.
Cold start mitigation
Warmup hits diverse chat paths; optional Modal cron can ping /health during business hours.
After /health succeeds, _warmup() issues four diverse chat completions to capture CUDA graphs. Optionally enable ENABLE_KEEPALIVE_CRON and set DEPLOYED_URL.
1# deploy.py (excerpt) - optional business-hours keep-alive2if ENABLE_KEEPALIVE_CRON:3 @app.function(schedule=modal.Cron("*/14 6-22 * * MON-FRI"))4 def keep_warm():5 requests.get(f"{DEPLOYED_URL}/health", timeout=120)
Cold start phases (reminder)
Allocation + container spin-up
~700 GB at volume throughput
When pre-compiled cache is present
Warmup requests after /health OK
Chat, no-thinking, long context, tools
Cost strategies
Strategy
Scale-to-zero (~8h/day traffic)
Monthly (illustrative)
~$12,000
Cold profile
6–10 min after ~15 min idle
Best for
Dev, internal tools, predictable daytime usage
Strategy
Business-hours keep-alive cron
Monthly (illustrative)
~$18,000
Cold profile
None during hours; overnight/weekend cold possible
Best for
Production APIs with daytime SLA
Strategy
Always-on (min_containers=1)
Monthly (illustrative)
~$36,000
Cold profile
None
Best for
24/7 low-latency SaaS
| Strategy | Monthly (illustrative) | Cold profile | Best for |
|---|---|---|---|
| Scale-to-zero (~8h/day traffic) | ~$12,000 | 6–10 min after ~15 min idle | Dev, internal tools, predictable daytime usage |
| Business-hours keep-alive cron | ~$18,000 | None during hours; overnight/weekend cold possible | Production APIs with daytime SLA |
| Always-on (min_containers=1) | ~$36,000 | None | 24/7 low-latency SaaS |
Tuning triangle
Dimension
Latency
Our setting
EAGLE + max-running-requests 48
Effect
Sub-300 ms TTFT target under moderate load; ~8 ms TPOT class
Dimension
Throughput
Our setting
mem-fraction 0.88 + BF16 KV
Effect
High aggregate tok/s across slots
Dimension
Stability
Our setting
watchdog 1200 + crash monitor + volume reload
Effect
Clean recycle on failure; consistent volume view
| Dimension | Our setting | Effect |
|---|---|---|
| Latency | EAGLE + max-running-requests 48 | Sub-300 ms TTFT target under moderate load; ~8 ms TPOT class |
| Throughput | mem-fraction 0.88 + BF16 KV | High aggregate tok/s across slots |
| Stability | watchdog 1200 + crash monitor + volume reload | Clean recycle on failure; consistent volume view |
Upstream mitigations (summary)
Bug
SGLang #22359
Impact
EAGLE + FP8 KV crash
Mitigation
BF16 KV cache (omit FP8 KV dtype)
Bug
SGLang #21291
Impact
flashmla_kv decode accuracy on B200
Mitigation
TRT-LLM NSA backends (decode + prefill)
Bug
SGLang #17526
Impact
FP8 KV slower than BF16
Mitigation
BF16 KV cache
Bug
SGLang #19796
Impact
EAGLE NaN on radix (sm120)
Mitigation
B200 is sm100 - not affected
| Bug | Impact | Mitigation |
|---|---|---|
| SGLang #22359 | EAGLE + FP8 KV crash | BF16 KV cache (omit FP8 KV dtype) |
| SGLang #21291 | flashmla_kv decode accuracy on B200 | TRT-LLM NSA backends (decode + prefill) |
| SGLang #17526 | FP8 KV slower than BF16 | BF16 KV cache |
| SGLang #19796 | EAGLE NaN on radix (sm120) | B200 is sm100 - not affected |
Operations: compile triage, checklists, diagnostics, and upgrades.
Compilation & runtime warnings
Most warnings during modal run deploy.py::compile_deepgemm are benign. Use Find on this page to match log lines.
FastAPI ORJSONResponse deprecation (SGLang internal)
Ignore - upstream; no functional impact
Generation flags like top_p reported invalid during compile warmup
Ignore - compile-mode artifact
Unexpected error during package walk: cutlass.cute.experimental
Ignore - FlashInfer autotuner noise; autotuning still completes
torch.Tensor return type deprecated (flashinfer.jit)
Monitor - works today; upgrade when SGLang bumps FlashInfer
Leaked semaphore / shared_memory on multi-process shutdown
Ignore - normal Python cleanup noise after TP workers exit
Gloo Rank 0 connected to 0 peer ranks
Ignore - NCCL used for GPU comm; Gloo for local control groups
DeepGEMM enabled but scale_fmt of checkpoint is not ue8m0
Watch output quality; typical for E4M3 FP8 on Blackwell
KV cache dtype set to fp8_e4m3 during compile_deep_gemm on SM10
OK for compile step - serving uses BF16 KV per deploy config
FP8 KV cache with no scaling factors - defaulting to 1.0
OK during compile; serving avoids FP8 KV
Force NSA prefill to sparse MLA (MHA_ONE_SHOT disabled) on Blackwell
Expected - TRT-LLM sparse MLA path for GLM-5.1 on B200
| # | Severity | Message | Action |
|---|---|---|---|
| 1 | informational | FastAPI ORJSONResponse deprecation (SGLang internal) | Ignore - upstream; no functional impact |
| 2 | informational | Generation flags like top_p reported invalid during compile warmup | Ignore - compile-mode artifact |
| 3 | informational | Unexpected error during package walk: cutlass.cute.experimental | Ignore - FlashInfer autotuner noise; autotuning still completes |
| 4 | informational | torch.Tensor return type deprecated (flashinfer.jit) | Monitor - works today; upgrade when SGLang bumps FlashInfer |
| 5 | informational | Leaked semaphore / shared_memory on multi-process shutdown | Ignore - normal Python cleanup noise after TP workers exit |
| 6 | informational | Gloo Rank 0 connected to 0 peer ranks | Ignore - NCCL used for GPU comm; Gloo for local control groups |
| 7 | monitor | DeepGEMM enabled but scale_fmt of checkpoint is not ue8m0 | Watch output quality; typical for E4M3 FP8 on Blackwell |
| 8 | informational | KV cache dtype set to fp8_e4m3 during compile_deep_gemm on SM10 | OK for compile step - serving uses BF16 KV per deploy config |
| 9 | informational | FP8 KV cache with no scaling factors - defaulting to 1.0 | OK during compile; serving avoids FP8 KV |
| 10 | informational | Force NSA prefill to sparse MLA (MHA_ONE_SHOT disabled) on Blackwell | Expected - TRT-LLM sparse MLA path for GLM-5.1 on B200 |
Deployment review checklist
Independent audit items; several are already addressed in the reference deploy.py.
serve() vs startup() ordering with @modal.web_server
Race: traffic routed before port listens
Align with Modal large-model pattern (subprocess in serve or experimental http_server)
No subprocess stdout/stderr capture on crash
Silent failures in Modal logs
Pipe stdout/stderr and stream in a background thread (deploy.py adds log streaming)
region= on @app.cls may be invalid
Deployment region not guaranteed
Use supported regional APIs per Modal docs
Default watchdog too short for 700GB load
Intermittent startup kills mid load
Set --watchdog-timeout 1200 in _build_sglang_cmd (present in reference deploy.py)
No crash detection while serving
Stale container after SGLang exit
Crash monitor thread → os._exit(1) (reference deploy.py)
Fragile download idempotency (directory file count)
Partial downloads mistaken as complete
Check sentinel / weight shards explicitly
Missing volume.reload() on server startup
Stale volume view across containers
Reload model + DeepGEMM volumes in startup path
compile_deepgemm uses 8×B200
Higher $/hr during one-time compile
Accept for SM-specific kernels; or explore supported cheaper GPU if shapes allow
Consider modal.experimental.http_server
Latency / lifecycle handling for huge models
Evaluate vs @modal.web_server for your Modal SDK version
Radix cache overhead for single-shot API traffic
KV memory headroom
Consider --disable-radix-cache if workload is mostly single-turn
| Severity | Issue | Impact | Mitigation / fix |
|---|---|---|---|
| Critical | serve() vs startup() ordering with @modal.web_server | Race: traffic routed before port listens | Align with Modal large-model pattern (subprocess in serve or experimental http_server) |
| Critical | No subprocess stdout/stderr capture on crash | Silent failures in Modal logs | Pipe stdout/stderr and stream in a background thread (deploy.py adds log streaming) |
| Significant | region= on @app.cls may be invalid | Deployment region not guaranteed | Use supported regional APIs per Modal docs |
| Significant | Default watchdog too short for 700GB load | Intermittent startup kills mid load | Set --watchdog-timeout 1200 in _build_sglang_cmd (present in reference deploy.py) |
| Significant | No crash detection while serving | Stale container after SGLang exit | Crash monitor thread → os._exit(1) (reference deploy.py) |
| Significant | Fragile download idempotency (directory file count) | Partial downloads mistaken as complete | Check sentinel / weight shards explicitly |
| Significant | Missing volume.reload() on server startup | Stale volume view across containers | Reload model + DeepGEMM volumes in startup path |
| Optimization | compile_deepgemm uses 8×B200 | Higher $/hr during one-time compile | Accept for SM-specific kernels; or explore supported cheaper GPU if shapes allow |
| Optimization | Consider modal.experimental.http_server | Latency / lifecycle handling for huge models | Evaluate vs @modal.web_server for your Modal SDK version |
| Optimization | Radix cache overhead for single-shot API traffic | KV memory headroom | Consider --disable-radix-cache if workload is mostly single-turn |
Upstream bug mitigations
Config levers tie back to the flag tables on Configuration.
Full flag context: Configuration & Flags.
SGLang #22359
EAGLE + FP8 KV crash
BF16 KV cache (omit FP8 KV dtype)
Lever: Omit --kv-cache-dtype fp8
SGLang #21291
flashmla_kv decode accuracy on B200
TRT-LLM NSA backends (decode + prefill)
Lever: --nsa-*-backend trtllm
SGLang #17526
FP8 KV slower than BF16
BF16 KV cache
Lever: Omit --kv-cache-dtype fp8
SGLang #19796
EAGLE NaN on radix (sm120)
B200 is sm100 - not affected
Lever: N/A (sm100)
| Issue | Symptom | Mitigation | Config lever |
|---|---|---|---|
| SGLang #22359 | EAGLE + FP8 KV crash | BF16 KV cache (omit FP8 KV dtype) | Omit --kv-cache-dtype fp8 |
| SGLang #21291 | flashmla_kv decode accuracy on B200 | TRT-LLM NSA backends (decode + prefill) | --nsa-*-backend trtllm |
| SGLang #17526 | FP8 KV slower than BF16 | BF16 KV cache | Omit --kv-cache-dtype fp8 |
| SGLang #19796 | EAGLE NaN on radix (sm120) | B200 is sm100 - not affected | N/A (sm100) |
Runtime diagnostics
After deploy, these endpoints should respond from the same host that serves chat completions.
1# Readiness (200 only after weights + warmup path completes)2curl -f https://<your-app>.modal.run/health34# Prometheus text exposition5curl https://<your-app>.modal.run/metrics67# Modal platform logs8modal app logs glm-5.1-production
Log triage
Pattern
[health] passed on attempt 1
Meaning
Fast path - hot volumes / graphs
Action
None
Pattern
[health] passed on attempt 15+
Meaning
Slow weight load or graph capture
Action
If near 900s timeout, check volume I/O; consider longer startup_timeout
Pattern
[monitor] FATAL: SGLang died (code -9)
Meaning
OOM killer
Action
Lower --mem-fraction-static (e.g. 0.88 → 0.85)
Pattern
[monitor] FATAL: SGLang died (code 137)
Meaning
SIGKILL / platform kill
Action
Review Modal timeout, manual stops, scaledown
Pattern
DeepGEMM / compile success
Meaning
Kernels cached for future cold starts
Action
None - verify marker file exists
| Pattern | Meaning | Action |
|---|---|---|
| [health] passed on attempt 1 | Fast path - hot volumes / graphs | None |
| [health] passed on attempt 15+ | Slow weight load or graph capture | If near 900s timeout, check volume I/O; consider longer startup_timeout |
| [monitor] FATAL: SGLang died (code -9) | OOM killer | Lower --mem-fraction-static (e.g. 0.88 → 0.85) |
| [monitor] FATAL: SGLang died (code 137) | SIGKILL / platform kill | Review Modal timeout, manual stops, scaledown |
| DeepGEMM / compile success | Kernels cached for future cold starts | None - verify marker file exists |
Cold start remediation
- Business-hours keep-alive: set
ENABLE_KEEPALIVE_CRON = True, fillDEPLOYED_URL, redeploy. - Longer idle hold: raise
scaledown_windowbeyond 900s if bursts are wider than 15 minutes. - Always warm: set
min_containers=1to eliminate cold starts at the cost of baseline GPU spend (see cost strategies above).
Upgrading SGLang or switching GPUs
- Bump SGLANG_IMAGE or GPU_TYPE in deploy.pyAlign image tag with the SGLang release you validated.
- Invalidate DeepGEMM cache on GPU architecture changeRemove .compiled-GLM-5.1-FP8 from glm51-deepgemm-cache, then re-run compile_deepgemm on the new SKU.
- Tighten --mem-fraction-static if VRAM per GPU dropsExample: B200 192 GB → H200 141 GB - try 0.88 → 0.83 after testing KV headroom.
1# Example: drop DeepGEMM marker to force recompile on new SM2modal volume rm glm51-deepgemm-cache .compiled-GLM-5.1-FP83modal run deploy.py::compile_deepgemm