GuideProduction

GLM-5.1 FP8 on Modal

Tune & Operate

Performance tuning, cold starts, diagnostics, and warning triage.

Tune & operate

Tuning goal: maximize throughput on 8×B200 while keeping TTFT under ~300 ms for interactive workloads. This page also triages compile noise, runtime diagnostics, and cold-start remediation, the operational complement to the deployment pipeline.

01

Modal & SGLang parameters

Replica counts and @modal.concurrent interact with SGLang batching; adjust them together, not in isolation.

@app.cls scaling

  • min_containers
    Default: 0

    Scale to zero when idle

  • max_containers
    Default: 3

    Max concurrent replicas (3 × 48 ≈ 144 in-flight decode slots)

  • scaledown_window
    Default: 900

    Seconds idle before Modal scales down

@modal.concurrent

  • @modal.concurrent max_inputs
    Default: 20

    Modal-level queue per container before spill to new replica

_build_sglang_cmd()

  • --mem-fraction-static
    Default: 0.88

    Share of post-weight VRAM for KV cache

  • --max-running-requests
    Default: 48

    Cap concurrent decodes - EAGLE verifier queue grows past this

  • --max-prefill-tokens
    Default: 32768

    Prefill batch safety cap

  • --max-total-tokens
    Default: 65536

    Per-request token budget (input + output)

  • --watchdog-timeout
    Default: 1200

    20 min - must exceed multi-minute weight load

  • --enable-metrics
    Default: True

    Expose /metrics (Prometheus)

02

EAGLE v2 speculative decoding

GLM-5.1 ships a multi-token prediction head. Draft quality is high enough that effective decode throughput rises sharply without a second model process.

MetricAutoregressiveEAGLE v2Delta
TPOT~20 ms~7.7 ms~2.6× faster
Aggregate decode tok/s~1,750~4,600+~2.6×

Trade-off: under very high concurrency, EAGLE verification can lengthen TTFT. The --max-running-requests 48 cap is the guardrail.

sglang flags
1--speculative-algorithm EAGLE \
2--speculative-num-steps 3 \
3--speculative-eagle-topk 1 \
4--speculative-num-draft-tokens 4
03

Why BF16 KV cache with FP8 weights

Weights stay FP8 for capacity and math throughput, but the KV cache remains BF16 (deliberate, not an oversight).

  • Stability: EAGLE + FP8 KV has a known crash path on Blackwell (SGLang #22359).
  • Accuracy: Some flash MLA KV paths regressed quality on B200 (#21291); TRT-LLM NSA backends pair cleanly with BF16 KV.
  • Speed: FP8 KV can be slower than BF16 once quant/dequant overhead is included (#17526).

Practically: omit --kv-cache-dtype fp8_e4m3 so SGLang keeps BF16 KV defaults. Full flag matrix: Configuration & Flags.

04

Concurrency & memory

If you see OOM under load, drop mem-fraction toward 0.85 before raising max-running-requests.

--mem-fraction-static 0.88 aggressively reserves KV after weights load. For batch-heavy workloads you can experiment above 48 running requests, and watch TTFT percentiles when you do.

05

Stability: watchdog & crash monitor

Weight staging from a 700 GB volume can exceed default SGLang watchdog windows.

Production deploy.py sets --watchdog-timeout 1200. After startup, a background thread calls os._exit(1) if the SGLang child exits so Modal replaces the container instead of serving endless 502s.

06

Cold start mitigation

Warmup hits diverse chat paths; optional Modal cron can ping /health during business hours.

After /health succeeds, _warmup() issues four diverse chat completions to capture CUDA graphs. Optionally enable ENABLE_KEEPALIVE_CRON and set DEPLOYED_URL.

deploy.py
1# deploy.py (excerpt) - optional business-hours keep-alive
2if ENABLE_KEEPALIVE_CRON:
3 @app.function(schedule=modal.Cron("*/14 6-22 * * MON-FRI"))
4 def keep_warm():
5 requests.get(f"{DEPLOYED_URL}/health", timeout=120)
07

Cold start phases (reminder)

Modal provisions 8×B200
~30s

Allocation + container spin-up

Weight load from Volume
3–5 min

~700 GB at volume throughput

DeepGEMM kernel load
Instant

When pre-compiled cache is present

CUDA graph capture
2–3 min

Warmup requests after /health OK

Warmup (4 requests)
1–2 min

Chat, no-thinking, long context, tools

08

Cost strategies

Strategy

Scale-to-zero (~8h/day traffic)

Monthly (illustrative)

~$12,000

Cold profile

6–10 min after ~15 min idle

Best for

Dev, internal tools, predictable daytime usage

Strategy

Business-hours keep-alive cron

Monthly (illustrative)

~$18,000

Cold profile

None during hours; overnight/weekend cold possible

Best for

Production APIs with daytime SLA

Strategy

Always-on (min_containers=1)

Monthly (illustrative)

~$36,000

Cold profile

None

Best for

24/7 low-latency SaaS

09

Tuning triangle

Dimension

Latency

Our setting

EAGLE + max-running-requests 48

Effect

Sub-300 ms TTFT target under moderate load; ~8 ms TPOT class

Dimension

Throughput

Our setting

mem-fraction 0.88 + BF16 KV

Effect

High aggregate tok/s across slots

Dimension

Stability

Our setting

watchdog 1200 + crash monitor + volume reload

Effect

Clean recycle on failure; consistent volume view

10

Upstream mitigations (summary)

Bug

SGLang #22359

Impact

EAGLE + FP8 KV crash

Mitigation

BF16 KV cache (omit FP8 KV dtype)

Bug

SGLang #21291

Impact

flashmla_kv decode accuracy on B200

Mitigation

TRT-LLM NSA backends (decode + prefill)

Bug

SGLang #17526

Impact

FP8 KV slower than BF16

Mitigation

BF16 KV cache

Bug

SGLang #19796

Impact

EAGLE NaN on radix (sm120)

Mitigation

B200 is sm100 - not affected

Operations: compile triage, checklists, diagnostics, and upgrades.

11

Compilation & runtime warnings

Most warnings during modal run deploy.py::compile_deepgemm are benign. Use Find on this page to match log lines.

#1informational

FastAPI ORJSONResponse deprecation (SGLang internal)

Ignore - upstream; no functional impact

#2informational

Generation flags like top_p reported invalid during compile warmup

Ignore - compile-mode artifact

#3informational

Unexpected error during package walk: cutlass.cute.experimental

Ignore - FlashInfer autotuner noise; autotuning still completes

#4informational

torch.Tensor return type deprecated (flashinfer.jit)

Monitor - works today; upgrade when SGLang bumps FlashInfer

#5informational

Leaked semaphore / shared_memory on multi-process shutdown

Ignore - normal Python cleanup noise after TP workers exit

#6informational

Gloo Rank 0 connected to 0 peer ranks

Ignore - NCCL used for GPU comm; Gloo for local control groups

#7monitor

DeepGEMM enabled but scale_fmt of checkpoint is not ue8m0

Watch output quality; typical for E4M3 FP8 on Blackwell

#8informational

KV cache dtype set to fp8_e4m3 during compile_deep_gemm on SM10

OK for compile step - serving uses BF16 KV per deploy config

#9informational

FP8 KV cache with no scaling factors - defaulting to 1.0

OK during compile; serving avoids FP8 KV

#10informational

Force NSA prefill to sparse MLA (MHA_ONE_SHOT disabled) on Blackwell

Expected - TRT-LLM sparse MLA path for GLM-5.1 on B200

12

Deployment review checklist

Independent audit items; several are already addressed in the reference deploy.py.

Critical

serve() vs startup() ordering with @modal.web_server

Race: traffic routed before port listens

Align with Modal large-model pattern (subprocess in serve or experimental http_server)

Critical

No subprocess stdout/stderr capture on crash

Silent failures in Modal logs

Pipe stdout/stderr and stream in a background thread (deploy.py adds log streaming)

Significant

region= on @app.cls may be invalid

Deployment region not guaranteed

Use supported regional APIs per Modal docs

Significant

Default watchdog too short for 700GB load

Intermittent startup kills mid load

Set --watchdog-timeout 1200 in _build_sglang_cmd (present in reference deploy.py)

Significant

No crash detection while serving

Stale container after SGLang exit

Crash monitor thread → os._exit(1) (reference deploy.py)

Significant

Fragile download idempotency (directory file count)

Partial downloads mistaken as complete

Check sentinel / weight shards explicitly

Significant

Missing volume.reload() on server startup

Stale volume view across containers

Reload model + DeepGEMM volumes in startup path

Optimization

compile_deepgemm uses 8×B200

Higher $/hr during one-time compile

Accept for SM-specific kernels; or explore supported cheaper GPU if shapes allow

Optimization

Consider modal.experimental.http_server

Latency / lifecycle handling for huge models

Evaluate vs @modal.web_server for your Modal SDK version

Optimization

Radix cache overhead for single-shot API traffic

KV memory headroom

Consider --disable-radix-cache if workload is mostly single-turn

13

Upstream bug mitigations

Config levers tie back to the flag tables on Configuration.

Full flag context: Configuration & Flags.

SGLang #22359

EAGLE + FP8 KV crash

BF16 KV cache (omit FP8 KV dtype)

Lever: Omit --kv-cache-dtype fp8

SGLang #21291

flashmla_kv decode accuracy on B200

TRT-LLM NSA backends (decode + prefill)

Lever: --nsa-*-backend trtllm

SGLang #17526

FP8 KV slower than BF16

BF16 KV cache

Lever: Omit --kv-cache-dtype fp8

SGLang #19796

EAGLE NaN on radix (sm120)

B200 is sm100 - not affected

Lever: N/A (sm100)

14

Runtime diagnostics

After deploy, these endpoints should respond from the same host that serves chat completions.

diagnostics.sh
1# Readiness (200 only after weights + warmup path completes)
2curl -f https://<your-app>.modal.run/health
3
4# Prometheus text exposition
5curl https://<your-app>.modal.run/metrics
6
7# Modal platform logs
8modal app logs glm-5.1-production
15

Log triage

Pattern

[health] passed on attempt 1

Meaning

Fast path - hot volumes / graphs

Action

None

Pattern

[health] passed on attempt 15+

Meaning

Slow weight load or graph capture

Action

If near 900s timeout, check volume I/O; consider longer startup_timeout

Pattern

[monitor] FATAL: SGLang died (code -9)

Meaning

OOM killer

Action

Lower --mem-fraction-static (e.g. 0.88 → 0.85)

Pattern

[monitor] FATAL: SGLang died (code 137)

Meaning

SIGKILL / platform kill

Action

Review Modal timeout, manual stops, scaledown

Pattern

DeepGEMM / compile success

Meaning

Kernels cached for future cold starts

Action

None - verify marker file exists

16

Cold start remediation

  • Business-hours keep-alive: set ENABLE_KEEPALIVE_CRON = True, fill DEPLOYED_URL, redeploy.
  • Longer idle hold: raise scaledown_window beyond 900s if bursts are wider than 15 minutes.
  • Always warm: set min_containers=1 to eliminate cold starts at the cost of baseline GPU spend (see cost strategies above).
17

Upgrading SGLang or switching GPUs

  1. Bump SGLANG_IMAGE or GPU_TYPE in deploy.pyAlign image tag with the SGLang release you validated.
  2. Invalidate DeepGEMM cache on GPU architecture changeRemove .compiled-GLM-5.1-FP8 from glm51-deepgemm-cache, then re-run compile_deepgemm on the new SKU.
  3. Tighten --mem-fraction-static if VRAM per GPU dropsExample: B200 192 GB → H200 141 GB - try 0.88 → 0.83 after testing KV headroom.
upgrade.sh
1# Example: drop DeepGEMM marker to force recompile on new SM
2modal volume rm glm51-deepgemm-cache .compiled-GLM-5.1-FP8
3modal run deploy.py::compile_deepgemm