GuideProduction

GLM-5.1 FP8 on Modal

Tune & Operate

Performance tuning, cold starts, diagnostics, and warning triage.

Tuning goal: maximize throughput on 8×B200 while keeping TTFT under ~300 ms for interactive workloads. This page also triages compile noise, runtime diagnostics, and cold-start remediation, the operational complement to the deployment pipeline.

Modal & SGLang parameters

Replica counts and @modal.concurrent interact with SGLang batching; adjust them together, not in isolation.

@app.cls scaling

min_containers
Default: 0
Scale to zero when idle
max_containers
Default: 3
Max concurrent replicas (3 × 48 ≈ 144 in-flight decode slots)
scaledown_window
Default: 900
Seconds idle before Modal scales down

@modal.concurrent

@modal.concurrent max_inputs
Default: 20
Modal-level queue per container before spill to new replica

_build_sglang_cmd()

--mem-fraction-static
Default: 0.88
Share of post-weight VRAM for KV cache
--max-running-requests
Default: 48
Cap concurrent decodes - EAGLE verifier queue grows past this
--max-prefill-tokens
Default: 32768
Prefill batch safety cap
--max-total-tokens
Default: 65536
Per-request token budget (input + output)
--watchdog-timeout
Default: 1200
20 min - must exceed multi-minute weight load
--enable-metrics
Default: True
Expose /metrics (Prometheus)

EAGLE v2 speculative decoding

GLM-5.1 ships a multi-token prediction head. Draft quality is high enough that effective decode throughput rises sharply without a second model process.

Metric	Autoregressive	EAGLE v2	Delta
TPOT	~20 ms	~7.7 ms	~2.6× faster
Aggregate decode tok/s	~1,750	~4,600+	~2.6×

Trade-off: under very high concurrency, EAGLE verification can lengthen TTFT. The --max-running-requests 48 cap is the guardrail.

sglang flags

1--speculative-algorithm EAGLE \
2--speculative-num-steps 3 \
3--speculative-eagle-topk 1 \
4--speculative-num-draft-tokens 4

Why BF16 KV cache with FP8 weights

Weights stay FP8 for capacity and math throughput, but the KV cache remains BF16 (deliberate, not an oversight).

Stability: EAGLE + FP8 KV has a known crash path on Blackwell (SGLang #22359).
Accuracy: Some flash MLA KV paths regressed quality on B200 (#21291); TRT-LLM NSA backends pair cleanly with BF16 KV.
Speed: FP8 KV can be slower than BF16 once quant/dequant overhead is included (#17526).

Practically: omit --kv-cache-dtype fp8_e4m3 so SGLang keeps BF16 KV defaults. Full flag matrix: Configuration & Flags.

Concurrency & memory

If you see OOM under load, drop mem-fraction toward 0.85 before raising max-running-requests.

--mem-fraction-static 0.88 aggressively reserves KV after weights load. For batch-heavy workloads you can experiment above 48 running requests, and watch TTFT percentiles when you do.

Stability: watchdog & crash monitor

Weight staging from a 700 GB volume can exceed default SGLang watchdog windows.

Production deploy.py sets --watchdog-timeout 1200. After startup, a background thread calls os._exit(1) if the SGLang child exits so Modal replaces the container instead of serving endless 502s.

Cold start mitigation

Warmup hits diverse chat paths; optional Modal cron can ping /health during business hours.

After /health succeeds, _warmup() issues four diverse chat completions to capture CUDA graphs. Optionally enable ENABLE_KEEPALIVE_CRON and set DEPLOYED_URL.

deploy.py

1# deploy.py (excerpt) - optional business-hours keep-alive
2if ENABLE_KEEPALIVE_CRON:
3    @app.function(schedule=modal.Cron("*/14 6-22 * * MON-FRI"))
4    def keep_warm():
5        requests.get(f"{DEPLOYED_URL}/health", timeout=120)

Cold start phases (reminder)

Modal provisions 8×B200

~30s

Allocation + container spin-up

Weight load from Volume

3–5 min

~700 GB at volume throughput

DeepGEMM kernel load

Instant

When pre-compiled cache is present

CUDA graph capture

2–3 min

Warmup requests after /health OK

Warmup (4 requests)

1–2 min

Chat, no-thinking, long context, tools

Cost strategies

Strategy

Scale-to-zero (~8h/day traffic)

Monthly (illustrative)

~$12,000

Cold profile

6–10 min after ~15 min idle

Best for

Dev, internal tools, predictable daytime usage

Strategy

Business-hours keep-alive cron

Monthly (illustrative)

~$18,000

Cold profile

None during hours; overnight/weekend cold possible

Best for

Production APIs with daytime SLA

Strategy

Always-on (min_containers=1)

Monthly (illustrative)

~$36,000

Cold profile

None

Best for

24/7 low-latency SaaS

Strategy	Monthly (illustrative)	Cold profile	Best for
Scale-to-zero (~8h/day traffic)	~$12,000	6–10 min after ~15 min idle	Dev, internal tools, predictable daytime usage
Business-hours keep-alive cron	~$18,000	None during hours; overnight/weekend cold possible	Production APIs with daytime SLA
Always-on (min_containers=1)	~$36,000	None	24/7 low-latency SaaS

Tuning triangle

Dimension

Latency

Our setting

EAGLE + max-running-requests 48

Effect

Sub-300 ms TTFT target under moderate load; ~8 ms TPOT class

Dimension

Throughput

Our setting

mem-fraction 0.88 + BF16 KV

Effect

High aggregate tok/s across slots

Dimension

Stability

Our setting

watchdog 1200 + crash monitor + volume reload

Effect

Clean recycle on failure; consistent volume view

Dimension	Our setting	Effect
Latency	EAGLE + max-running-requests 48	Sub-300 ms TTFT target under moderate load; ~8 ms TPOT class
Throughput	mem-fraction 0.88 + BF16 KV	High aggregate tok/s across slots
Stability	watchdog 1200 + crash monitor + volume reload	Clean recycle on failure; consistent volume view

Performance baselines

Expected performance on 8×B200 with EAGLE enabled. Use these as reference points when benchmarking your deployment.

Metric	8×B200 Value	Condition	Notes
Time to First Token (TTFT)	~246 ms	Warm, low concurrency, pre-captured graphs	Inflates under high concurrency due to EAGLE verify queue
Time Per Output Token (TPOT)	~7.7 ms	EAGLE enabled, low concurrency	~20 ms without EAGLE (2.6× slower)
Decode throughput (per user)	30-75 tok/s	Varies by concurrency and output length	Higher at low concurrency, lower when batched
Aggregate throughput	~4,600+ tok/s	All slots active across 8 GPUs	Combined across --max-running-requests slots
EAGLE accept length	~3.5 tokens	Typical drafting acceptance	Consistent across hardware types
Max concurrent requests	48 per replica	--max-running-requests 48	Cap prevents TTFT inflation at high load
Cold start total	6-10 min	Pre-compiled DeepGEMM, from scale-to-zero	15+ min without pre-compiled kernels

Upstream mitigations (summary)

Bug

SGLang #22359

Impact

EAGLE + FP8 KV crash

Mitigation

BF16 KV cache (omit FP8 KV dtype)

Bug

SGLang #21291

Impact

flashmla_kv decode accuracy on B200

Mitigation

TRT-LLM NSA backends (decode + prefill)

Bug

SGLang #17526

Impact

FP8 KV slower than BF16

Mitigation

BF16 KV cache

Bug

SGLang #19796

Impact

EAGLE NaN on radix (sm120)

Mitigation

B200 is sm100 - not affected

Bug	Impact	Mitigation
SGLang #22359	EAGLE + FP8 KV crash	BF16 KV cache (omit FP8 KV dtype)
SGLang #21291	flashmla_kv decode accuracy on B200	TRT-LLM NSA backends (decode + prefill)
SGLang #17526	FP8 KV slower than BF16	BF16 KV cache
SGLang #19796	EAGLE NaN on radix (sm120)	B200 is sm100 - not affected

Operations: compile triage, checklists, diagnostics, and upgrades.

Compilation & runtime warnings

Most warnings during modal run deploy.py::compile_deepgemm are benign. Use Find on this page to match log lines.

#1informational

FastAPI ORJSONResponse deprecation (SGLang internal)

Ignore - upstream; no functional impact

#2informational

Generation flags like top_p reported invalid during compile warmup

Ignore - compile-mode artifact

#3informational

Unexpected error during package walk: cutlass.cute.experimental

Ignore - FlashInfer autotuner noise; autotuning still completes

#4informational

torch.Tensor return type deprecated (flashinfer.jit)

Monitor - works today; upgrade when SGLang bumps FlashInfer

#5informational

Leaked semaphore / shared_memory on multi-process shutdown

Ignore - normal Python cleanup noise after TP workers exit

#6informational

Gloo Rank 0 connected to 0 peer ranks

Ignore - NCCL used for GPU comm; Gloo for local control groups

#7monitor

DeepGEMM enabled but scale_fmt of checkpoint is not ue8m0

Watch output quality; typical for E4M3 FP8 on Blackwell

#8informational

KV cache dtype set to fp8_e4m3 during compile_deep_gemm on SM10

OK for compile step - serving uses BF16 KV per deploy config

#9informational

FP8 KV cache with no scaling factors - defaulting to 1.0

OK during compile; serving avoids FP8 KV

#10informational

Force NSA prefill to sparse MLA (MHA_ONE_SHOT disabled) on Blackwell

Expected - TRT-LLM sparse MLA path for GLM-5.1 on B200

#	Severity	Message	Action
1	informational	FastAPI ORJSONResponse deprecation (SGLang internal)	Ignore - upstream; no functional impact
2	informational	Generation flags like top_p reported invalid during compile warmup	Ignore - compile-mode artifact
3	informational	Unexpected error during package walk: cutlass.cute.experimental	Ignore - FlashInfer autotuner noise; autotuning still completes
4	informational	torch.Tensor return type deprecated (flashinfer.jit)	Monitor - works today; upgrade when SGLang bumps FlashInfer
5	informational	Leaked semaphore / shared_memory on multi-process shutdown	Ignore - normal Python cleanup noise after TP workers exit
6	informational	Gloo Rank 0 connected to 0 peer ranks	Ignore - NCCL used for GPU comm; Gloo for local control groups
7	monitor	DeepGEMM enabled but scale_fmt of checkpoint is not ue8m0	Watch output quality; typical for E4M3 FP8 on Blackwell
8	informational	KV cache dtype set to fp8_e4m3 during compile_deep_gemm on SM10	OK for compile step - serving uses BF16 KV per deploy config
9	informational	FP8 KV cache with no scaling factors - defaulting to 1.0	OK during compile; serving avoids FP8 KV
10	informational	Force NSA prefill to sparse MLA (MHA_ONE_SHOT disabled) on Blackwell	Expected - TRT-LLM sparse MLA path for GLM-5.1 on B200

Deployment review checklist

Independent audit items; several are already addressed in the reference deploy.py.

Critical

serve() vs startup() ordering with @modal.web_server

Race: traffic routed before port listens

Align with Modal large-model pattern (subprocess in serve or experimental http_server)

Critical

No subprocess stdout/stderr capture on crash

Silent failures in Modal logs

Pipe stdout/stderr and stream in a background thread (deploy.py adds log streaming)

Significant

region= on @app.cls may be invalid

Deployment region not guaranteed

Use supported regional APIs per Modal docs

Significant

Default watchdog too short for 700GB load

Intermittent startup kills mid load

Set --watchdog-timeout 1200 in _build_sglang_cmd (present in reference deploy.py)

Significant

No crash detection while serving

Stale container after SGLang exit

Crash monitor thread → os._exit(1) (reference deploy.py)

Significant

Fragile download idempotency (directory file count)

Partial downloads mistaken as complete

Check sentinel / weight shards explicitly

Significant

Missing volume.reload() on server startup

Stale volume view across containers

Reload model + DeepGEMM volumes in startup path

Optimization

compile_deepgemm uses 8×B200

Higher $/hr during one-time compile

Accept for SM-specific kernels; or explore supported cheaper GPU if shapes allow

Optimization

Consider modal.experimental.http_server

Latency / lifecycle handling for huge models

Evaluate vs @modal.web_server for your Modal SDK version

Optimization

Radix cache overhead for single-shot API traffic

KV memory headroom

Consider --disable-radix-cache if workload is mostly single-turn

Severity	Issue	Impact	Mitigation / fix
Critical	serve() vs startup() ordering with @modal.web_server	Race: traffic routed before port listens	Align with Modal large-model pattern (subprocess in serve or experimental http_server)
Critical	No subprocess stdout/stderr capture on crash	Silent failures in Modal logs	Pipe stdout/stderr and stream in a background thread (deploy.py adds log streaming)
Significant	region= on @app.cls may be invalid	Deployment region not guaranteed	Use supported regional APIs per Modal docs
Significant	Default watchdog too short for 700GB load	Intermittent startup kills mid load	Set --watchdog-timeout 1200 in _build_sglang_cmd (present in reference deploy.py)
Significant	No crash detection while serving	Stale container after SGLang exit	Crash monitor thread → os._exit(1) (reference deploy.py)
Significant	Fragile download idempotency (directory file count)	Partial downloads mistaken as complete	Check sentinel / weight shards explicitly
Significant	Missing volume.reload() on server startup	Stale volume view across containers	Reload model + DeepGEMM volumes in startup path
Optimization	compile_deepgemm uses 8×B200	Higher $/hr during one-time compile	Accept for SM-specific kernels; or explore supported cheaper GPU if shapes allow
Optimization	Consider modal.experimental.http_server	Latency / lifecycle handling for huge models	Evaluate vs @modal.web_server for your Modal SDK version
Optimization	Radix cache overhead for single-shot API traffic	KV memory headroom	Consider --disable-radix-cache if workload is mostly single-turn

Upstream bug mitigations

Config levers tie back to the flag tables on Configuration.

Full flag context: Configuration & Flags.

SGLang #22359

EAGLE + FP8 KV crash

BF16 KV cache (omit FP8 KV dtype)

Lever: Omit --kv-cache-dtype fp8

SGLang #21291

flashmla_kv decode accuracy on B200

TRT-LLM NSA backends (decode + prefill)

Lever: --nsa-*-backend trtllm

SGLang #17526

FP8 KV slower than BF16

BF16 KV cache

Lever: Omit --kv-cache-dtype fp8

SGLang #19796

EAGLE NaN on radix (sm120)

B200 is sm100 - not affected

Lever: N/A (sm100)

Issue	Symptom	Mitigation	Config lever
SGLang #22359	EAGLE + FP8 KV crash	BF16 KV cache (omit FP8 KV dtype)	Omit --kv-cache-dtype fp8
SGLang #21291	flashmla_kv decode accuracy on B200	TRT-LLM NSA backends (decode + prefill)	--nsa-*-backend trtllm
SGLang #17526	FP8 KV slower than BF16	BF16 KV cache	Omit --kv-cache-dtype fp8
SGLang #19796	EAGLE NaN on radix (sm120)	B200 is sm100 - not affected	N/A (sm100)

Runtime diagnostics

After deploy, these endpoints should respond from the same host that serves chat completions.

diagnostics.sh

1# Readiness (200 only after weights + warmup path completes)
2curl -f https://<your-app>.modal.run/health
3
4# Prometheus text exposition
5curl https://<your-app>.modal.run/metrics
6
7# Modal platform logs
8modal app logs glm-5.1-production

Health check sequence

Step-by-step verification that your deployment is working correctly.

Check Modal app status

modal app list | grep glm-5.1-production

Expected: Shows running app with endpoint URL

If fails: Run modal deploy deploy.py to create the app

Verify health endpoint

curl -f https://<your-app>.modal.run/health

Expected: HTTP 200 (may take 6-10 min on cold start)

If fails: Check modal app logs for startup errors

Test basic inference

curl -X POST https://<your-app>.modal.run/v1/chat/completions -H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" -d '{"model": "glm-5.1", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 10}'

Expected: JSON response with assistant message

If fails: Check API_KEY secret is set correctly

Verify Prometheus metrics

curl https://<your-app>.modal.run/metrics | head -20

Expected: Prometheus text format with sglang_* metrics

If fails: Ensure --enable-metrics is in launch command

Log triage

Pattern

[health] passed on attempt 1

Meaning

Fast path - hot volumes / graphs

Action

None

Pattern

[health] passed on attempt 15+

Meaning

Slow weight load or graph capture

Action

If near 900s timeout, check volume I/O; consider longer startup_timeout

Pattern

[monitor] FATAL: SGLang died (code -9)

Meaning

OOM killer

Action

Lower --mem-fraction-static (e.g. 0.88 → 0.85)

Pattern

[monitor] FATAL: SGLang died (code 137)

Meaning

SIGKILL / platform kill

Action

Review Modal timeout, manual stops, scaledown

Pattern

DeepGEMM / compile success

Meaning

Kernels cached for future cold starts

Action

None - verify marker file exists

Pattern	Meaning	Action
[health] passed on attempt 1	Fast path - hot volumes / graphs	None
[health] passed on attempt 15+	Slow weight load or graph capture	If near 900s timeout, check volume I/O; consider longer startup_timeout
[monitor] FATAL: SGLang died (code -9)	OOM killer	Lower --mem-fraction-static (e.g. 0.88 → 0.85)
[monitor] FATAL: SGLang died (code 137)	SIGKILL / platform kill	Review Modal timeout, manual stops, scaledown
DeepGEMM / compile success	Kernels cached for future cold starts	None - verify marker file exists

⚠

reasoning_content is empty even though thinking should be enabled

Likely cause

Missing reasoning parser or request explicitly disabled thinking

Diagnosis

Check for --reasoning-parser glm45 in launch command. Check if request had enable_thinking: false.

Resolution

Add --reasoning-parser glm45 to _build_sglang_cmd. Thinking is enabled by default if parser is present.

Cold start remediation

Business-hours keep-alive: set ENABLE_KEEPALIVE_CRON = True, fill DEPLOYED_URL, redeploy.
Longer idle hold: raise scaledown_window beyond 900s if bursts are wider than 15 minutes.
Always warm: set min_containers=1 to eliminate cold starts at the cost of baseline GPU spend (see cost strategies above).

Upgrading SGLang or switching GPUs

Bump SGLANG_IMAGE or GPU_TYPE in deploy.pyAlign image tag with the SGLang release you validated.
Invalidate DeepGEMM cache on GPU architecture changeRemove .compiled-GLM-5.1-FP8 from glm51-deepgemm-cache, then re-run compile_deepgemm on the new SKU.
Tighten --mem-fraction-static if VRAM per GPU dropsExample: B200 192 GB → H200 141 GB - try 0.88 → 0.83 after testing KV headroom.

upgrade.sh

1# Example: drop DeepGEMM marker to force recompile on new SM
2modal volume rm glm51-deepgemm-cache .compiled-GLM-5.1-FP8
3modal run deploy.py::compile_deepgemm

Related sections

Overview & Architecture Code walkthrough Configuration Deployment

Tune & operate

Modal & SGLang parameters

@app.cls scaling

@modal.concurrent

_build_sglang_cmd()

EAGLE v2 speculative decoding

Why BF16 KV cache with FP8 weights

Concurrency & memory

Stability: watchdog & crash monitor

Cold start mitigation

Cold start phases (reminder)

Cost strategies

Tuning triangle

Performance baselines

Upstream mitigations (summary)

Compilation & runtime warnings

Deployment review checklist

Upstream bug mitigations

Runtime diagnostics

Health check sequence

Check Modal app status

Verify health endpoint

Test basic inference

Verify Prometheus metrics

Log triage

Troubleshooting scenarios

Container returns 502 Bad Gateway after running fine for hours

Cold start takes 15+ minutes instead of expected 6-10 minutes

FileNotFoundError: Model weights missing at /model-cache/GLM-5.1-FP8

TTFT spikes to 2-3 seconds under load (was ~250ms at low concurrency)

Generation quality seems worse than expected / garbled output

Warmup requests fail with 500 errors or timeouts

Tool calls return malformed JSON or wrong function names

reasoning_content is empty even though thinking should be enabled

Cold start remediation

Upgrading SGLang or switching GPUs

Related sections