GLM-5.1 FP8 on Modal
Tune & Operate
Performance tuning, cold starts, diagnostics, and warning triage.
Tune & operate
Tuning goal: maximize throughput on 8×B200 while keeping TTFT under ~300 ms for interactive workloads. This page also triages compile noise, runtime diagnostics, and cold-start remediation, the operational complement to the deployment pipeline.
Modal & SGLang parameters
Replica counts and @modal.concurrent interact with SGLang batching; adjust them together, not in isolation.
@app.cls scaling
- min_containersDefault: 0
Scale to zero when idle
- max_containersDefault: 3
Max concurrent replicas (3 × 48 ≈ 144 in-flight decode slots)
- scaledown_windowDefault: 900
Seconds idle before Modal scales down
@modal.concurrent
- @modal.concurrent max_inputsDefault: 20
Modal-level queue per container before spill to new replica
_build_sglang_cmd()
- --mem-fraction-staticDefault: 0.88
Share of post-weight VRAM for KV cache
- --max-running-requestsDefault: 48
Cap concurrent decodes - EAGLE verifier queue grows past this
- --max-prefill-tokensDefault: 32768
Prefill batch safety cap
- --max-total-tokensDefault: 65536
Per-request token budget (input + output)
- --watchdog-timeoutDefault: 1200
20 min - must exceed multi-minute weight load
- --enable-metricsDefault: True
Expose /metrics (Prometheus)
EAGLE v2 speculative decoding
GLM-5.1 ships a multi-token prediction head. Draft quality is high enough that effective decode throughput rises sharply without a second model process.
| Metric | Autoregressive | EAGLE v2 | Delta |
|---|---|---|---|
| TPOT | ~20 ms | ~7.7 ms | ~2.6× faster |
| Aggregate decode tok/s | ~1,750 | ~4,600+ | ~2.6× |
Trade-off: under very high concurrency, EAGLE verification can lengthen TTFT. The --max-running-requests 48 cap is the guardrail.
1--speculative-algorithm EAGLE \2--speculative-num-steps 3 \3--speculative-eagle-topk 1 \4--speculative-num-draft-tokens 4
Why BF16 KV cache with FP8 weights
Weights stay FP8 for capacity and math throughput, but the KV cache remains BF16 (deliberate, not an oversight).
- Stability: EAGLE + FP8 KV has a known crash path on Blackwell (SGLang #22359).
- Accuracy: Some flash MLA KV paths regressed quality on B200 (#21291); TRT-LLM NSA backends pair cleanly with BF16 KV.
- Speed: FP8 KV can be slower than BF16 once quant/dequant overhead is included (#17526).
Practically: omit --kv-cache-dtype fp8_e4m3 so SGLang keeps BF16 KV defaults. Full flag matrix: Configuration & Flags.
Concurrency & memory
If you see OOM under load, drop mem-fraction toward 0.85 before raising max-running-requests.
--mem-fraction-static 0.88 aggressively reserves KV after weights load. For batch-heavy workloads you can experiment above 48 running requests, and watch TTFT percentiles when you do.
Stability: watchdog & crash monitor
Weight staging from a 700 GB volume can exceed default SGLang watchdog windows.
Production deploy.py sets --watchdog-timeout 1200. After startup, a background thread calls os._exit(1) if the SGLang child exits so Modal replaces the container instead of serving endless 502s.
Cold start mitigation
Warmup hits diverse chat paths; optional Modal cron can ping /health during business hours.
After /health succeeds, _warmup() issues four diverse chat completions to capture CUDA graphs. Optionally enable ENABLE_KEEPALIVE_CRON and set DEPLOYED_URL.
1# deploy.py (excerpt) - optional business-hours keep-alive2if ENABLE_KEEPALIVE_CRON:3 @app.function(schedule=modal.Cron("*/14 6-22 * * MON-FRI"))4 def keep_warm():5 requests.get(f"{DEPLOYED_URL}/health", timeout=120)
Cold start phases (reminder)
Allocation + container spin-up
~700 GB at volume throughput
When pre-compiled cache is present
Warmup requests after /health OK
Chat, no-thinking, long context, tools
Cost strategies
Strategy
Scale-to-zero (~8h/day traffic)
Monthly (illustrative)
~$12,000
Cold profile
6–10 min after ~15 min idle
Best for
Dev, internal tools, predictable daytime usage
Strategy
Business-hours keep-alive cron
Monthly (illustrative)
~$18,000
Cold profile
None during hours; overnight/weekend cold possible
Best for
Production APIs with daytime SLA
Strategy
Always-on (min_containers=1)
Monthly (illustrative)
~$36,000
Cold profile
None
Best for
24/7 low-latency SaaS
| Strategy | Monthly (illustrative) | Cold profile | Best for |
|---|---|---|---|
| Scale-to-zero (~8h/day traffic) | ~$12,000 | 6–10 min after ~15 min idle | Dev, internal tools, predictable daytime usage |
| Business-hours keep-alive cron | ~$18,000 | None during hours; overnight/weekend cold possible | Production APIs with daytime SLA |
| Always-on (min_containers=1) | ~$36,000 | None | 24/7 low-latency SaaS |
Tuning triangle
Dimension
Latency
Our setting
EAGLE + max-running-requests 48
Effect
Sub-300 ms TTFT target under moderate load; ~8 ms TPOT class
Dimension
Throughput
Our setting
mem-fraction 0.88 + BF16 KV
Effect
High aggregate tok/s across slots
Dimension
Stability
Our setting
watchdog 1200 + crash monitor + volume reload
Effect
Clean recycle on failure; consistent volume view
| Dimension | Our setting | Effect |
|---|---|---|
| Latency | EAGLE + max-running-requests 48 | Sub-300 ms TTFT target under moderate load; ~8 ms TPOT class |
| Throughput | mem-fraction 0.88 + BF16 KV | High aggregate tok/s across slots |
| Stability | watchdog 1200 + crash monitor + volume reload | Clean recycle on failure; consistent volume view |
Performance baselines
Expected performance on 8×B200 with EAGLE enabled. Use these as reference points when benchmarking your deployment.
| Metric | 8×B200 Value | Condition | Notes |
|---|---|---|---|
| Time to First Token (TTFT) | ~246 ms | Warm, low concurrency, pre-captured graphs | Inflates under high concurrency due to EAGLE verify queue |
| Time Per Output Token (TPOT) | ~7.7 ms | EAGLE enabled, low concurrency | ~20 ms without EAGLE (2.6× slower) |
| Decode throughput (per user) | 30-75 tok/s | Varies by concurrency and output length | Higher at low concurrency, lower when batched |
| Aggregate throughput | ~4,600+ tok/s | All slots active across 8 GPUs | Combined across --max-running-requests slots |
| EAGLE accept length | ~3.5 tokens | Typical drafting acceptance | Consistent across hardware types |
| Max concurrent requests | 48 per replica | --max-running-requests 48 | Cap prevents TTFT inflation at high load |
| Cold start total | 6-10 min | Pre-compiled DeepGEMM, from scale-to-zero | 15+ min without pre-compiled kernels |
Upstream mitigations (summary)
Bug
SGLang #22359
Impact
EAGLE + FP8 KV crash
Mitigation
BF16 KV cache (omit FP8 KV dtype)
Bug
SGLang #21291
Impact
flashmla_kv decode accuracy on B200
Mitigation
TRT-LLM NSA backends (decode + prefill)
Bug
SGLang #17526
Impact
FP8 KV slower than BF16
Mitigation
BF16 KV cache
Bug
SGLang #19796
Impact
EAGLE NaN on radix (sm120)
Mitigation
B200 is sm100 - not affected
| Bug | Impact | Mitigation |
|---|---|---|
| SGLang #22359 | EAGLE + FP8 KV crash | BF16 KV cache (omit FP8 KV dtype) |
| SGLang #21291 | flashmla_kv decode accuracy on B200 | TRT-LLM NSA backends (decode + prefill) |
| SGLang #17526 | FP8 KV slower than BF16 | BF16 KV cache |
| SGLang #19796 | EAGLE NaN on radix (sm120) | B200 is sm100 - not affected |
Operations: compile triage, checklists, diagnostics, and upgrades.
Compilation & runtime warnings
Most warnings during modal run deploy.py::compile_deepgemm are benign. Use Find on this page to match log lines.
FastAPI ORJSONResponse deprecation (SGLang internal)
Ignore - upstream; no functional impact
Generation flags like top_p reported invalid during compile warmup
Ignore - compile-mode artifact
Unexpected error during package walk: cutlass.cute.experimental
Ignore - FlashInfer autotuner noise; autotuning still completes
torch.Tensor return type deprecated (flashinfer.jit)
Monitor - works today; upgrade when SGLang bumps FlashInfer
Leaked semaphore / shared_memory on multi-process shutdown
Ignore - normal Python cleanup noise after TP workers exit
Gloo Rank 0 connected to 0 peer ranks
Ignore - NCCL used for GPU comm; Gloo for local control groups
DeepGEMM enabled but scale_fmt of checkpoint is not ue8m0
Watch output quality; typical for E4M3 FP8 on Blackwell
KV cache dtype set to fp8_e4m3 during compile_deep_gemm on SM10
OK for compile step - serving uses BF16 KV per deploy config
FP8 KV cache with no scaling factors - defaulting to 1.0
OK during compile; serving avoids FP8 KV
Force NSA prefill to sparse MLA (MHA_ONE_SHOT disabled) on Blackwell
Expected - TRT-LLM sparse MLA path for GLM-5.1 on B200
| # | Severity | Message | Action |
|---|---|---|---|
| 1 | informational | FastAPI ORJSONResponse deprecation (SGLang internal) | Ignore - upstream; no functional impact |
| 2 | informational | Generation flags like top_p reported invalid during compile warmup | Ignore - compile-mode artifact |
| 3 | informational | Unexpected error during package walk: cutlass.cute.experimental | Ignore - FlashInfer autotuner noise; autotuning still completes |
| 4 | informational | torch.Tensor return type deprecated (flashinfer.jit) | Monitor - works today; upgrade when SGLang bumps FlashInfer |
| 5 | informational | Leaked semaphore / shared_memory on multi-process shutdown | Ignore - normal Python cleanup noise after TP workers exit |
| 6 | informational | Gloo Rank 0 connected to 0 peer ranks | Ignore - NCCL used for GPU comm; Gloo for local control groups |
| 7 | monitor | DeepGEMM enabled but scale_fmt of checkpoint is not ue8m0 | Watch output quality; typical for E4M3 FP8 on Blackwell |
| 8 | informational | KV cache dtype set to fp8_e4m3 during compile_deep_gemm on SM10 | OK for compile step - serving uses BF16 KV per deploy config |
| 9 | informational | FP8 KV cache with no scaling factors - defaulting to 1.0 | OK during compile; serving avoids FP8 KV |
| 10 | informational | Force NSA prefill to sparse MLA (MHA_ONE_SHOT disabled) on Blackwell | Expected - TRT-LLM sparse MLA path for GLM-5.1 on B200 |
Deployment review checklist
Independent audit items; several are already addressed in the reference deploy.py.
serve() vs startup() ordering with @modal.web_server
Race: traffic routed before port listens
Align with Modal large-model pattern (subprocess in serve or experimental http_server)
No subprocess stdout/stderr capture on crash
Silent failures in Modal logs
Pipe stdout/stderr and stream in a background thread (deploy.py adds log streaming)
region= on @app.cls may be invalid
Deployment region not guaranteed
Use supported regional APIs per Modal docs
Default watchdog too short for 700GB load
Intermittent startup kills mid load
Set --watchdog-timeout 1200 in _build_sglang_cmd (present in reference deploy.py)
No crash detection while serving
Stale container after SGLang exit
Crash monitor thread → os._exit(1) (reference deploy.py)
Fragile download idempotency (directory file count)
Partial downloads mistaken as complete
Check sentinel / weight shards explicitly
Missing volume.reload() on server startup
Stale volume view across containers
Reload model + DeepGEMM volumes in startup path
compile_deepgemm uses 8×B200
Higher $/hr during one-time compile
Accept for SM-specific kernels; or explore supported cheaper GPU if shapes allow
Consider modal.experimental.http_server
Latency / lifecycle handling for huge models
Evaluate vs @modal.web_server for your Modal SDK version
Radix cache overhead for single-shot API traffic
KV memory headroom
Consider --disable-radix-cache if workload is mostly single-turn
| Severity | Issue | Impact | Mitigation / fix |
|---|---|---|---|
| Critical | serve() vs startup() ordering with @modal.web_server | Race: traffic routed before port listens | Align with Modal large-model pattern (subprocess in serve or experimental http_server) |
| Critical | No subprocess stdout/stderr capture on crash | Silent failures in Modal logs | Pipe stdout/stderr and stream in a background thread (deploy.py adds log streaming) |
| Significant | region= on @app.cls may be invalid | Deployment region not guaranteed | Use supported regional APIs per Modal docs |
| Significant | Default watchdog too short for 700GB load | Intermittent startup kills mid load | Set --watchdog-timeout 1200 in _build_sglang_cmd (present in reference deploy.py) |
| Significant | No crash detection while serving | Stale container after SGLang exit | Crash monitor thread → os._exit(1) (reference deploy.py) |
| Significant | Fragile download idempotency (directory file count) | Partial downloads mistaken as complete | Check sentinel / weight shards explicitly |
| Significant | Missing volume.reload() on server startup | Stale volume view across containers | Reload model + DeepGEMM volumes in startup path |
| Optimization | compile_deepgemm uses 8×B200 | Higher $/hr during one-time compile | Accept for SM-specific kernels; or explore supported cheaper GPU if shapes allow |
| Optimization | Consider modal.experimental.http_server | Latency / lifecycle handling for huge models | Evaluate vs @modal.web_server for your Modal SDK version |
| Optimization | Radix cache overhead for single-shot API traffic | KV memory headroom | Consider --disable-radix-cache if workload is mostly single-turn |
Upstream bug mitigations
Config levers tie back to the flag tables on Configuration.
Full flag context: Configuration & Flags.
SGLang #22359
EAGLE + FP8 KV crash
BF16 KV cache (omit FP8 KV dtype)
Lever: Omit --kv-cache-dtype fp8
SGLang #21291
flashmla_kv decode accuracy on B200
TRT-LLM NSA backends (decode + prefill)
Lever: --nsa-*-backend trtllm
SGLang #17526
FP8 KV slower than BF16
BF16 KV cache
Lever: Omit --kv-cache-dtype fp8
SGLang #19796
EAGLE NaN on radix (sm120)
B200 is sm100 - not affected
Lever: N/A (sm100)
| Issue | Symptom | Mitigation | Config lever |
|---|---|---|---|
| SGLang #22359 | EAGLE + FP8 KV crash | BF16 KV cache (omit FP8 KV dtype) | Omit --kv-cache-dtype fp8 |
| SGLang #21291 | flashmla_kv decode accuracy on B200 | TRT-LLM NSA backends (decode + prefill) | --nsa-*-backend trtllm |
| SGLang #17526 | FP8 KV slower than BF16 | BF16 KV cache | Omit --kv-cache-dtype fp8 |
| SGLang #19796 | EAGLE NaN on radix (sm120) | B200 is sm100 - not affected | N/A (sm100) |
Runtime diagnostics
After deploy, these endpoints should respond from the same host that serves chat completions.
1# Readiness (200 only after weights + warmup path completes)2curl -f https://<your-app>.modal.run/health34# Prometheus text exposition5curl https://<your-app>.modal.run/metrics67# Modal platform logs8modal app logs glm-5.1-production
Health check sequence
Step-by-step verification that your deployment is working correctly.
Check Modal app status
modal app list | grep glm-5.1-productionExpected: Shows running app with endpoint URL
If fails: Run modal deploy deploy.py to create the app
Verify health endpoint
curl -f https://<your-app>.modal.run/healthExpected: HTTP 200 (may take 6-10 min on cold start)
If fails: Check modal app logs for startup errors
Test basic inference
curl -X POST https://<your-app>.modal.run/v1/chat/completions -H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" -d '{"model": "glm-5.1", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 10}'Expected: JSON response with assistant message
If fails: Check API_KEY secret is set correctly
Verify Prometheus metrics
curl https://<your-app>.modal.run/metrics | head -20Expected: Prometheus text format with sglang_* metrics
If fails: Ensure --enable-metrics is in launch command
Log triage
Pattern
[health] passed on attempt 1
Meaning
Fast path - hot volumes / graphs
Action
None
Pattern
[health] passed on attempt 15+
Meaning
Slow weight load or graph capture
Action
If near 900s timeout, check volume I/O; consider longer startup_timeout
Pattern
[monitor] FATAL: SGLang died (code -9)
Meaning
OOM killer
Action
Lower --mem-fraction-static (e.g. 0.88 → 0.85)
Pattern
[monitor] FATAL: SGLang died (code 137)
Meaning
SIGKILL / platform kill
Action
Review Modal timeout, manual stops, scaledown
Pattern
DeepGEMM / compile success
Meaning
Kernels cached for future cold starts
Action
None - verify marker file exists
| Pattern | Meaning | Action |
|---|---|---|
| [health] passed on attempt 1 | Fast path - hot volumes / graphs | None |
| [health] passed on attempt 15+ | Slow weight load or graph capture | If near 900s timeout, check volume I/O; consider longer startup_timeout |
| [monitor] FATAL: SGLang died (code -9) | OOM killer | Lower --mem-fraction-static (e.g. 0.88 → 0.85) |
| [monitor] FATAL: SGLang died (code 137) | SIGKILL / platform kill | Review Modal timeout, manual stops, scaledown |
| DeepGEMM / compile success | Kernels cached for future cold starts | None - verify marker file exists |
Troubleshooting scenarios
Common problems and their solutions. Use Ctrl/Cmd+F to search for your specific symptom.
Container returns 502 Bad Gateway after running fine for hours
SGLang subprocess crashed (OOM, CUDA error, NCCL timeout) but Modal container stayed alive
Check modal app logs for [monitor] FATAL: SGLang died messages. Look for exit code -9 (OOM) or 137 (SIGKILL).
The crash monitor should trigger os._exit(1) to replace the container. If not present, add the _monitor_process thread to deploy.py.
Lower --mem-fraction-static from 0.88 to 0.85 if OOM is recurring. Check for memory leaks in long sessions.
Cold start takes 15+ minutes instead of expected 6-10 minutes
DeepGEMM kernels are being JIT-compiled at startup instead of loading from cache
Run modal run deploy.py::verify_setup and check if .compiled-GLM-5.1-FP8 marker exists in the DeepGEMM volume.
Run modal run deploy.py::compile_deepgemm once on B200 to pre-compile kernels. Ensure dg_volume.commit() is called after compilation.
Never change GPU_TYPE without recompiling DeepGEMM. The marker file includes the GPU type for verification.
FileNotFoundError: Model weights missing at /model-cache/GLM-5.1-FP8
Volume eventual consistency - server container started before download commit propagated
Check if download_model completed successfully and called model_volume.commit().
Add model_volume.reload() at the start of the Server.setup() method. This forces a metadata refresh from the central store.
Always reload both volumes at startup. The reference deploy.py does this in @modal.enter().
TTFT spikes to 2-3 seconds under load (was ~250ms at low concurrency)
EAGLE speculative decoding verification queue is backing up
Check if concurrent requests exceed --max-running-requests (default 48). Look at /metrics for queue depth.
Either scale max_containers from 3 to 4+, or reduce max-running-requests to 32-40 for lower TTFT variance.
EAGLE trades some TTFT for massive TPOT improvement. For latency-critical apps, cap concurrency lower.
Generation quality seems worse than expected / garbled output
Using FP8 KV cache (crashes/accuracy issues) or wrong decode backend
Check if --kv-cache-dtype fp8_e4m3 was accidentally set. Verify --nsa-decode-backend trtllm is present.
Remove any FP8 KV cache flags. Ensure TRT-LLM backends are used for NSA decode and prefill.
The reference _build_sglang_cmd intentionally omits KV cache dtype to use BF16 default.
Warmup requests fail with 500 errors or timeouts
CUDA graph capture taking longer than warmup timeout, or SGLang crashed during first inference
Check logs for CUDA graph capture messages and timing. Verify subprocess is still running.
Increase warmup timeout from 300s to 600s for first-time graph capture. Ensure warmup happens after /health passes.
The 4 diverse warmup requests trigger different CUDA graphs. First cold start is always slowest.
Tool calls return malformed JSON or wrong function names
Missing or wrong tool parser flag for GLM-5.1
Verify --tool-call-parser glm47 is in the launch command. Check the request includes properly formatted tools array.
Add --tool-call-parser glm47 to _build_sglang_cmd. Ensure tools conform to OpenAI function calling schema.
reasoning_content is empty even though thinking should be enabled
Missing reasoning parser or request explicitly disabled thinking
Check for --reasoning-parser glm45 in launch command. Check if request had enable_thinking: false.
Add --reasoning-parser glm45 to _build_sglang_cmd. Thinking is enabled by default if parser is present.
Cold start remediation
- Business-hours keep-alive: set
ENABLE_KEEPALIVE_CRON = True, fillDEPLOYED_URL, redeploy. - Longer idle hold: raise
scaledown_windowbeyond 900s if bursts are wider than 15 minutes. - Always warm: set
min_containers=1to eliminate cold starts at the cost of baseline GPU spend (see cost strategies above).
Upgrading SGLang or switching GPUs
- Bump SGLANG_IMAGE or GPU_TYPE in deploy.pyAlign image tag with the SGLang release you validated.
- Invalidate DeepGEMM cache on GPU architecture changeRemove .compiled-GLM-5.1-FP8 from glm51-deepgemm-cache, then re-run compile_deepgemm on the new SKU.
- Tighten --mem-fraction-static if VRAM per GPU dropsExample: B200 192 GB → H200 141 GB - try 0.88 → 0.83 after testing KV headroom.
1# Example: drop DeepGEMM marker to force recompile on new SM2modal volume rm glm51-deepgemm-cache .compiled-GLM-5.1-FP83modal run deploy.py::compile_deepgemm