GuideProduction

GLM-5.1 FP8 on Modal

Tune & Operate

Performance tuning, cold starts, diagnostics, and warning triage.

Tune & operate

Tuning goal: maximize throughput on 8×B200 while keeping TTFT under ~300 ms for interactive workloads. This page also triages compile noise, runtime diagnostics, and cold-start remediation, the operational complement to the deployment pipeline.

01

Modal & SGLang parameters

Replica counts and @modal.concurrent interact with SGLang batching; adjust them together, not in isolation.

@app.cls scaling

  • min_containers
    Default: 0

    Scale to zero when idle

  • max_containers
    Default: 3

    Max concurrent replicas (3 × 48 ≈ 144 in-flight decode slots)

  • scaledown_window
    Default: 900

    Seconds idle before Modal scales down

@modal.concurrent

  • @modal.concurrent max_inputs
    Default: 20

    Modal-level queue per container before spill to new replica

_build_sglang_cmd()

  • --mem-fraction-static
    Default: 0.88

    Share of post-weight VRAM for KV cache

  • --max-running-requests
    Default: 48

    Cap concurrent decodes - EAGLE verifier queue grows past this

  • --max-prefill-tokens
    Default: 32768

    Prefill batch safety cap

  • --max-total-tokens
    Default: 65536

    Per-request token budget (input + output)

  • --watchdog-timeout
    Default: 1200

    20 min - must exceed multi-minute weight load

  • --enable-metrics
    Default: True

    Expose /metrics (Prometheus)

02

EAGLE v2 speculative decoding

GLM-5.1 ships a multi-token prediction head. Draft quality is high enough that effective decode throughput rises sharply without a second model process.

MetricAutoregressiveEAGLE v2Delta
TPOT~20 ms~7.7 ms~2.6× faster
Aggregate decode tok/s~1,750~4,600+~2.6×

Trade-off: under very high concurrency, EAGLE verification can lengthen TTFT. The --max-running-requests 48 cap is the guardrail.

sglang flags
1--speculative-algorithm EAGLE \
2--speculative-num-steps 3 \
3--speculative-eagle-topk 1 \
4--speculative-num-draft-tokens 4
03

Why BF16 KV cache with FP8 weights

Weights stay FP8 for capacity and math throughput, but the KV cache remains BF16 (deliberate, not an oversight).

  • Stability: EAGLE + FP8 KV has a known crash path on Blackwell (SGLang #22359).
  • Accuracy: Some flash MLA KV paths regressed quality on B200 (#21291); TRT-LLM NSA backends pair cleanly with BF16 KV.
  • Speed: FP8 KV can be slower than BF16 once quant/dequant overhead is included (#17526).

Practically: omit --kv-cache-dtype fp8_e4m3 so SGLang keeps BF16 KV defaults. Full flag matrix: Configuration & Flags.

04

Concurrency & memory

If you see OOM under load, drop mem-fraction toward 0.85 before raising max-running-requests.

--mem-fraction-static 0.88 aggressively reserves KV after weights load. For batch-heavy workloads you can experiment above 48 running requests, and watch TTFT percentiles when you do.

05

Stability: watchdog & crash monitor

Weight staging from a 700 GB volume can exceed default SGLang watchdog windows.

Production deploy.py sets --watchdog-timeout 1200. After startup, a background thread calls os._exit(1) if the SGLang child exits so Modal replaces the container instead of serving endless 502s.

06

Cold start mitigation

Warmup hits diverse chat paths; optional Modal cron can ping /health during business hours.

After /health succeeds, _warmup() issues four diverse chat completions to capture CUDA graphs. Optionally enable ENABLE_KEEPALIVE_CRON and set DEPLOYED_URL.

deploy.py
1# deploy.py (excerpt) - optional business-hours keep-alive
2if ENABLE_KEEPALIVE_CRON:
3 @app.function(schedule=modal.Cron("*/14 6-22 * * MON-FRI"))
4 def keep_warm():
5 requests.get(f"{DEPLOYED_URL}/health", timeout=120)
07

Cold start phases (reminder)

Modal provisions 8×B200
~30s

Allocation + container spin-up

Weight load from Volume
3–5 min

~700 GB at volume throughput

DeepGEMM kernel load
Instant

When pre-compiled cache is present

CUDA graph capture
2–3 min

Warmup requests after /health OK

Warmup (4 requests)
1–2 min

Chat, no-thinking, long context, tools

08

Cost strategies

Strategy

Scale-to-zero (~8h/day traffic)

Monthly (illustrative)

~$12,000

Cold profile

6–10 min after ~15 min idle

Best for

Dev, internal tools, predictable daytime usage

Strategy

Business-hours keep-alive cron

Monthly (illustrative)

~$18,000

Cold profile

None during hours; overnight/weekend cold possible

Best for

Production APIs with daytime SLA

Strategy

Always-on (min_containers=1)

Monthly (illustrative)

~$36,000

Cold profile

None

Best for

24/7 low-latency SaaS

09

Tuning triangle

Dimension

Latency

Our setting

EAGLE + max-running-requests 48

Effect

Sub-300 ms TTFT target under moderate load; ~8 ms TPOT class

Dimension

Throughput

Our setting

mem-fraction 0.88 + BF16 KV

Effect

High aggregate tok/s across slots

Dimension

Stability

Our setting

watchdog 1200 + crash monitor + volume reload

Effect

Clean recycle on failure; consistent volume view

10

Performance baselines

Expected performance on 8×B200 with EAGLE enabled. Use these as reference points when benchmarking your deployment.

Metric8×B200 ValueConditionNotes
Time to First Token (TTFT)~246 msWarm, low concurrency, pre-captured graphsInflates under high concurrency due to EAGLE verify queue
Time Per Output Token (TPOT)~7.7 msEAGLE enabled, low concurrency~20 ms without EAGLE (2.6× slower)
Decode throughput (per user)30-75 tok/sVaries by concurrency and output lengthHigher at low concurrency, lower when batched
Aggregate throughput~4,600+ tok/sAll slots active across 8 GPUsCombined across --max-running-requests slots
EAGLE accept length~3.5 tokensTypical drafting acceptanceConsistent across hardware types
Max concurrent requests48 per replica--max-running-requests 48Cap prevents TTFT inflation at high load
Cold start total6-10 minPre-compiled DeepGEMM, from scale-to-zero15+ min without pre-compiled kernels
11

Upstream mitigations (summary)

Bug

SGLang #22359

Impact

EAGLE + FP8 KV crash

Mitigation

BF16 KV cache (omit FP8 KV dtype)

Bug

SGLang #21291

Impact

flashmla_kv decode accuracy on B200

Mitigation

TRT-LLM NSA backends (decode + prefill)

Bug

SGLang #17526

Impact

FP8 KV slower than BF16

Mitigation

BF16 KV cache

Bug

SGLang #19796

Impact

EAGLE NaN on radix (sm120)

Mitigation

B200 is sm100 - not affected

Operations: compile triage, checklists, diagnostics, and upgrades.

12

Compilation & runtime warnings

Most warnings during modal run deploy.py::compile_deepgemm are benign. Use Find on this page to match log lines.

#1informational

FastAPI ORJSONResponse deprecation (SGLang internal)

Ignore - upstream; no functional impact

#2informational

Generation flags like top_p reported invalid during compile warmup

Ignore - compile-mode artifact

#3informational

Unexpected error during package walk: cutlass.cute.experimental

Ignore - FlashInfer autotuner noise; autotuning still completes

#4informational

torch.Tensor return type deprecated (flashinfer.jit)

Monitor - works today; upgrade when SGLang bumps FlashInfer

#5informational

Leaked semaphore / shared_memory on multi-process shutdown

Ignore - normal Python cleanup noise after TP workers exit

#6informational

Gloo Rank 0 connected to 0 peer ranks

Ignore - NCCL used for GPU comm; Gloo for local control groups

#7monitor

DeepGEMM enabled but scale_fmt of checkpoint is not ue8m0

Watch output quality; typical for E4M3 FP8 on Blackwell

#8informational

KV cache dtype set to fp8_e4m3 during compile_deep_gemm on SM10

OK for compile step - serving uses BF16 KV per deploy config

#9informational

FP8 KV cache with no scaling factors - defaulting to 1.0

OK during compile; serving avoids FP8 KV

#10informational

Force NSA prefill to sparse MLA (MHA_ONE_SHOT disabled) on Blackwell

Expected - TRT-LLM sparse MLA path for GLM-5.1 on B200

13

Deployment review checklist

Independent audit items; several are already addressed in the reference deploy.py.

Critical

serve() vs startup() ordering with @modal.web_server

Race: traffic routed before port listens

Align with Modal large-model pattern (subprocess in serve or experimental http_server)

Critical

No subprocess stdout/stderr capture on crash

Silent failures in Modal logs

Pipe stdout/stderr and stream in a background thread (deploy.py adds log streaming)

Significant

region= on @app.cls may be invalid

Deployment region not guaranteed

Use supported regional APIs per Modal docs

Significant

Default watchdog too short for 700GB load

Intermittent startup kills mid load

Set --watchdog-timeout 1200 in _build_sglang_cmd (present in reference deploy.py)

Significant

No crash detection while serving

Stale container after SGLang exit

Crash monitor thread → os._exit(1) (reference deploy.py)

Significant

Fragile download idempotency (directory file count)

Partial downloads mistaken as complete

Check sentinel / weight shards explicitly

Significant

Missing volume.reload() on server startup

Stale volume view across containers

Reload model + DeepGEMM volumes in startup path

Optimization

compile_deepgemm uses 8×B200

Higher $/hr during one-time compile

Accept for SM-specific kernels; or explore supported cheaper GPU if shapes allow

Optimization

Consider modal.experimental.http_server

Latency / lifecycle handling for huge models

Evaluate vs @modal.web_server for your Modal SDK version

Optimization

Radix cache overhead for single-shot API traffic

KV memory headroom

Consider --disable-radix-cache if workload is mostly single-turn

14

Upstream bug mitigations

Config levers tie back to the flag tables on Configuration.

Full flag context: Configuration & Flags.

SGLang #22359

EAGLE + FP8 KV crash

BF16 KV cache (omit FP8 KV dtype)

Lever: Omit --kv-cache-dtype fp8

SGLang #21291

flashmla_kv decode accuracy on B200

TRT-LLM NSA backends (decode + prefill)

Lever: --nsa-*-backend trtllm

SGLang #17526

FP8 KV slower than BF16

BF16 KV cache

Lever: Omit --kv-cache-dtype fp8

SGLang #19796

EAGLE NaN on radix (sm120)

B200 is sm100 - not affected

Lever: N/A (sm100)

15

Runtime diagnostics

After deploy, these endpoints should respond from the same host that serves chat completions.

diagnostics.sh
1# Readiness (200 only after weights + warmup path completes)
2curl -f https://<your-app>.modal.run/health
3
4# Prometheus text exposition
5curl https://<your-app>.modal.run/metrics
6
7# Modal platform logs
8modal app logs glm-5.1-production
16

Health check sequence

Step-by-step verification that your deployment is working correctly.

1

Check Modal app status

modal app list | grep glm-5.1-production

Expected: Shows running app with endpoint URL

If fails: Run modal deploy deploy.py to create the app

2

Verify health endpoint

curl -f https://<your-app>.modal.run/health

Expected: HTTP 200 (may take 6-10 min on cold start)

If fails: Check modal app logs for startup errors

3

Test basic inference

curl -X POST https://<your-app>.modal.run/v1/chat/completions -H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" -d '{"model": "glm-5.1", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 10}'

Expected: JSON response with assistant message

If fails: Check API_KEY secret is set correctly

4

Verify Prometheus metrics

curl https://<your-app>.modal.run/metrics | head -20

Expected: Prometheus text format with sglang_* metrics

If fails: Ensure --enable-metrics is in launch command

17

Log triage

Pattern

[health] passed on attempt 1

Meaning

Fast path - hot volumes / graphs

Action

None

Pattern

[health] passed on attempt 15+

Meaning

Slow weight load or graph capture

Action

If near 900s timeout, check volume I/O; consider longer startup_timeout

Pattern

[monitor] FATAL: SGLang died (code -9)

Meaning

OOM killer

Action

Lower --mem-fraction-static (e.g. 0.88 → 0.85)

Pattern

[monitor] FATAL: SGLang died (code 137)

Meaning

SIGKILL / platform kill

Action

Review Modal timeout, manual stops, scaledown

Pattern

DeepGEMM / compile success

Meaning

Kernels cached for future cold starts

Action

None - verify marker file exists

18

Troubleshooting scenarios

Common problems and their solutions. Use Ctrl/Cmd+F to search for your specific symptom.

Container returns 502 Bad Gateway after running fine for hours

Likely cause

SGLang subprocess crashed (OOM, CUDA error, NCCL timeout) but Modal container stayed alive

Diagnosis

Check modal app logs for [monitor] FATAL: SGLang died messages. Look for exit code -9 (OOM) or 137 (SIGKILL).

Resolution

The crash monitor should trigger os._exit(1) to replace the container. If not present, add the _monitor_process thread to deploy.py.

Prevention tip

Lower --mem-fraction-static from 0.88 to 0.85 if OOM is recurring. Check for memory leaks in long sessions.

Cold start takes 15+ minutes instead of expected 6-10 minutes

Likely cause

DeepGEMM kernels are being JIT-compiled at startup instead of loading from cache

Diagnosis

Run modal run deploy.py::verify_setup and check if .compiled-GLM-5.1-FP8 marker exists in the DeepGEMM volume.

Resolution

Run modal run deploy.py::compile_deepgemm once on B200 to pre-compile kernels. Ensure dg_volume.commit() is called after compilation.

Prevention tip

Never change GPU_TYPE without recompiling DeepGEMM. The marker file includes the GPU type for verification.

FileNotFoundError: Model weights missing at /model-cache/GLM-5.1-FP8

Likely cause

Volume eventual consistency - server container started before download commit propagated

Diagnosis

Check if download_model completed successfully and called model_volume.commit().

Resolution

Add model_volume.reload() at the start of the Server.setup() method. This forces a metadata refresh from the central store.

Prevention tip

Always reload both volumes at startup. The reference deploy.py does this in @modal.enter().

TTFT spikes to 2-3 seconds under load (was ~250ms at low concurrency)

Likely cause

EAGLE speculative decoding verification queue is backing up

Diagnosis

Check if concurrent requests exceed --max-running-requests (default 48). Look at /metrics for queue depth.

Resolution

Either scale max_containers from 3 to 4+, or reduce max-running-requests to 32-40 for lower TTFT variance.

Prevention tip

EAGLE trades some TTFT for massive TPOT improvement. For latency-critical apps, cap concurrency lower.

Generation quality seems worse than expected / garbled output

Likely cause

Using FP8 KV cache (crashes/accuracy issues) or wrong decode backend

Diagnosis

Check if --kv-cache-dtype fp8_e4m3 was accidentally set. Verify --nsa-decode-backend trtllm is present.

Resolution

Remove any FP8 KV cache flags. Ensure TRT-LLM backends are used for NSA decode and prefill.

Prevention tip

The reference _build_sglang_cmd intentionally omits KV cache dtype to use BF16 default.

Warmup requests fail with 500 errors or timeouts

Likely cause

CUDA graph capture taking longer than warmup timeout, or SGLang crashed during first inference

Diagnosis

Check logs for CUDA graph capture messages and timing. Verify subprocess is still running.

Resolution

Increase warmup timeout from 300s to 600s for first-time graph capture. Ensure warmup happens after /health passes.

Prevention tip

The 4 diverse warmup requests trigger different CUDA graphs. First cold start is always slowest.

Tool calls return malformed JSON or wrong function names

Likely cause

Missing or wrong tool parser flag for GLM-5.1

Diagnosis

Verify --tool-call-parser glm47 is in the launch command. Check the request includes properly formatted tools array.

Resolution

Add --tool-call-parser glm47 to _build_sglang_cmd. Ensure tools conform to OpenAI function calling schema.

reasoning_content is empty even though thinking should be enabled

Likely cause

Missing reasoning parser or request explicitly disabled thinking

Diagnosis

Check for --reasoning-parser glm45 in launch command. Check if request had enable_thinking: false.

Resolution

Add --reasoning-parser glm45 to _build_sglang_cmd. Thinking is enabled by default if parser is present.

19

Cold start remediation

  • Business-hours keep-alive: set ENABLE_KEEPALIVE_CRON = True, fill DEPLOYED_URL, redeploy.
  • Longer idle hold: raise scaledown_window beyond 900s if bursts are wider than 15 minutes.
  • Always warm: set min_containers=1 to eliminate cold starts at the cost of baseline GPU spend (see cost strategies above).
20

Upgrading SGLang or switching GPUs

  1. Bump SGLANG_IMAGE or GPU_TYPE in deploy.pyAlign image tag with the SGLang release you validated.
  2. Invalidate DeepGEMM cache on GPU architecture changeRemove .compiled-GLM-5.1-FP8 from glm51-deepgemm-cache, then re-run compile_deepgemm on the new SKU.
  3. Tighten --mem-fraction-static if VRAM per GPU dropsExample: B200 192 GB → H200 141 GB - try 0.88 → 0.83 after testing KV headroom.
upgrade.sh
1# Example: drop DeepGEMM marker to force recompile on new SM
2modal volume rm glm51-deepgemm-cache .compiled-GLM-5.1-FP8
3modal run deploy.py::compile_deepgemm