> **Note:** The canonical experience is the interactive HTML tab: [Tune & Operate](https://www.quantml.org/guides/glm-5-1-fp8/operations). This file is a text mirror for search engines and AI tools.

# GLM-5.1 FP8 - Tune & Operate

Tuning goal: **maximize throughput** on 8×B200 while keeping **TTFT under ~300 ms** for interactive workloads. This page also triages **compile noise**, **runtime diagnostics**, and **cold-start remediation**, the operational complement to the deployment pipeline.

---

## 1. Modal & SGLang parameters {#params}

Replica counts and @modal.concurrent interact with SGLang batching; adjust them together, not in isolation.

### @app.cls scaling

| Parameter | Default | Description |
|---|---|---|
| `min_containers` | `0` | Scale to zero when idle |
| `max_containers` | `3` | Max concurrent replicas (3 × 48 ≈ 144 in-flight decode slots) |
| `scaledown_window` | `900` | Seconds idle before Modal scales down |

### @modal.concurrent

| Parameter | Default | Description |
|---|---|---|
| `max_inputs` | `20` | Modal-level queue per container before spill to new replica |

### _build_sglang_cmd()

| Parameter | Default | Description |
|---|---|---|
| `--mem-fraction-static` | `0.88` | Share of post-weight VRAM for KV cache |
| `--max-running-requests` | `48` | Cap concurrent decodes — EAGLE verifier queue grows past this |
| `--max-prefill-tokens` | `32768` | Prefill batch safety cap |
| `--max-total-tokens` | `65536` | Per-request token budget (input + output) |
| `--watchdog-timeout` | `1200` | 20 min — must exceed multi-minute weight load |
| `--enable-metrics` | `True` | Expose /metrics (Prometheus) |

---

## 2. EAGLE v2 speculative decoding {#eagle}

GLM-5.1 ships a multi-token prediction head. Draft quality is high enough that effective decode throughput rises sharply without a second model process.

| Metric | Autoregressive | EAGLE v2 | Delta |
|---|---|---|---|
| TPOT | ~20 ms | ~7.7 ms | ~2.6× faster |
| Aggregate decode tok/s | ~1,750 | ~4,600+ | ~2.6× |

> **Warning:** Trade-off: under very high concurrency, EAGLE verification can lengthen TTFT. The `--max-running-requests 48` cap is the guardrail.

```bash
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4
```

---

## 3. Why BF16 KV cache with FP8 weights {#kv-bf16}

Weights stay FP8 for capacity and math throughput, but the KV cache remains BF16 (deliberate, not an oversight).

- **Stability:** EAGLE + FP8 KV has a known crash path on Blackwell (SGLang #22359).
- **Accuracy:** Some flash MLA KV paths regressed quality on B200 (#21291); TRT-LLM NSA backends pair cleanly with BF16 KV.
- **Speed:** FP8 KV can be slower than BF16 once quant/dequant overhead is included (#17526).

Practically: **omit** `--kv-cache-dtype fp8_e4m3` so SGLang keeps BF16 KV defaults. Full flag matrix: [Configuration & Flags](https://www.quantml.org/guides/glm-5-1-fp8/configuration).

---

## 4. Concurrency & memory {#concurrency}

If you see OOM under load, drop mem-fraction toward 0.85 before raising max-running-requests.

`--mem-fraction-static 0.88` aggressively reserves KV after weights load. For batch-heavy workloads you can experiment above 48 running requests, and watch TTFT percentiles when you do.

---

## 5. Stability: watchdog & crash monitor {#stability}

Weight staging from a 700 GB volume can exceed default SGLang watchdog windows.

Production `deploy.py` sets `--watchdog-timeout 1200`. After startup, a background thread calls `os._exit(1)` if the SGLang child exits so Modal replaces the container instead of serving endless 502s.

---

## 6. Cold start mitigation {#warmup-keepalive}

Warmup hits diverse chat paths; optional Modal cron can ping /health during business hours.

After `/health` succeeds, `_warmup()` issues four diverse chat completions to capture CUDA graphs. Optionally enable `ENABLE_KEEPALIVE_CRON` and set `DEPLOYED_URL`.

```python
# deploy.py (excerpt) - optional business-hours keep-alive
if ENABLE_KEEPALIVE_CRON:
    @app.function(schedule=modal.Cron("*/14 6-22 * * MON-FRI"))
    def keep_warm():
        requests.get(f"{DEPLOYED_URL}/health", timeout=120)
```

---

## 7. Cold start phases (reminder) {#cold-phases}

| Phase | Duration | Notes |
|---|---|---|
| Modal provisions 8×B200 | ~30s | Allocation + container spin-up |
| Weight load from Volume | 3–5 min | ~700 GB at volume throughput |
| DeepGEMM kernel load | Instant | When pre-compiled cache is present |
| CUDA graph capture | 2–3 min | Warmup requests after /health OK |
| Warmup (4 requests) | 1–2 min | Chat, no-thinking, long context, tools |

---

## 8. Cost strategies {#costs}

| Strategy | Monthly (illustrative) | Cold profile | Best for |
|---|---|---|---|
| Scale-to-zero (~8h/day traffic) | ~$12,000 | 6–10 min after ~15 min idle | Dev, internal tools, predictable daytime usage |
| Business-hours keep-alive cron | ~$18,000 | None during hours; overnight/weekend cold possible | Production APIs with daytime SLA |
| Always-on (min_containers=1) | ~$36,000 | None | 24/7 low-latency SaaS |

---

## 9. Tuning triangle {#triangle}

| Dimension | Our setting | Effect |
|---|---|---|
| Latency | EAGLE + max-running-requests 48 | Sub-300 ms TTFT target under moderate load; ~8 ms TPOT class |
| Throughput | mem-fraction 0.88 + BF16 KV | High aggregate tok/s across slots |
| Stability | watchdog 1200 + crash monitor + volume reload | Clean recycle on failure; consistent volume view |

---

## 10. Performance baselines {#perf-baselines}

Expected performance on 8×B200 with EAGLE enabled. Use these as reference points when benchmarking your deployment.

| Metric | 8×B200 Value | Condition | Notes |
|---|---|---|---|
| Time to First Token (TTFT) | ~246 ms | Warm, low concurrency, pre-captured graphs | Inflates under high concurrency due to EAGLE verify queue |
| Time Per Output Token (TPOT) | ~7.7 ms | EAGLE, low concurrency | ~20 ms without EAGLE (2.6× slower) |
| Decode throughput (per user) | 30-75 tok/s | Varies by concurrency and output length | Higher at low concurrency, lower when batched |
| Aggregate throughput | ~4,600+ tok/s | All slots active across 8 GPUs | Combined across --max-running-requests slots |
| EAGLE accept length | ~3.5 tokens | Typical drafting acceptance | Consistent across hardware types |
| Max concurrent requests | 48 per replica | --max-running-requests 48 | Cap prevents TTFT inflation at high load |
| Cold start total | 6-10 min | Pre-compiled DeepGEMM, from scale-to-zero | 15+ min without pre-compiled kernels |

---

## 11. Upstream mitigations (summary) {#upstream}

| Bug | Impact | Mitigation |
|---|---|---|
| SGLang #22359 | EAGLE + FP8 KV crash | BF16 KV cache (omit FP8 KV dtype) |
| SGLang #21291 | flashmla_kv decode accuracy on B200 | TRT-LLM NSA backends (decode + prefill) |
| SGLang #17526 | FP8 KV slower than BF16 | BF16 KV cache |
| SGLang #19796 | EAGLE NaN on radix (sm120) | B200 is sm100 — not affected |

---

## 12. Compilation & runtime warnings {#warnings-compile}

Most warnings during `modal run deploy.py::compile_deepgemm` are benign. Use Find on this page to match log lines.

| # | Severity | Message | Action |
|---|---|---|---|
| 1 | Informational | FastAPI ORJSONResponse deprecation (SGLang internal) | Ignore — upstream; no functional impact |
| 2 | Informational | Generation flags like top_p reported invalid during compile warmup | Ignore — compile-mode artifact |
| 3 | Informational | Unexpected error during package walk: cutlass.cute.experimental | Ignore — FlashInfer autotuner noise; autotuning still completes |
| 4 | Informational | torch.Tensor return type deprecated (flashinfer.jit) | Monitor — works today; upgrade when SGLang bumps FlashInfer |
| 5 | Informational | Leaked semaphore / shared_memory on multi-process shutdown | Ignore — normal Python cleanup noise after TP workers exit |
| 6 | Informational | Gloo Rank 0 connected to 0 peer ranks | Ignore — NCCL used for GPU comm; Gloo for local control groups |
| 7 | Monitor | DeepGEMM enabled but scale_fmt of checkpoint is not ue8m0 | Watch output quality; typical for E4M3 FP8 on Blackwell |
| 8 | Informational | KV cache dtype set to fp8_e4m3 during compile_deep_gemm on SM10 | OK for compile step — serving uses BF16 KV per deploy config |
| 9 | Informational | FP8 KV cache with no scaling factors — defaulting to 1.0 | OK during compile; serving avoids FP8 KV |
| 10 | Informational | Force NSA prefill to sparse MLA (MHA_ONE_SHOT disabled) on Blackwell | Expected — TRT-LLM sparse MLA path for GLM-5.1 on B200 |

---

## 13. Deployment review checklist {#fixes-review}

Independent audit items; several are already addressed in the reference deploy.py.

| Severity | Issue | Impact | Mitigation / fix |
|---|---|---|---|
| **Critical** | serve() vs startup() ordering with @modal.web_server | Race: traffic routed before port listens | Align with Modal large-model pattern (subprocess in serve or experimental http_server) |
| **Critical** | No subprocess stdout/stderr capture on crash | Silent failures in Modal logs | Pipe stdout/stderr and stream in a background thread (deploy.py adds log streaming) |
| **Significant** | region= on @app.cls may be invalid | Deployment region not guaranteed | Use supported regional APIs per Modal docs |
| **Significant** | Default watchdog too short for 700GB load | Intermittent startup kills mid load | Set --watchdog-timeout 1200 in _build_sglang_cmd (present in reference deploy.py) |
| **Significant** | No crash detection while serving | Stale container after SGLang exit | Crash monitor thread → os._exit(1) (reference deploy.py) |
| **Significant** | Fragile download idempotency (directory file count) | Partial downloads mistaken as complete | Check sentinel / weight shards explicitly |
| **Significant** | Missing volume.reload() on server startup | Stale volume view across containers | Reload model + DeepGEMM volumes in startup path |
| **Optimization** | compile_deepgemm uses 8×B200 | Higher $/hr during one-time compile | Accept for SM-specific kernels; or explore supported cheaper GPU if shapes allow |
| **Optimization** | Consider modal.experimental.http_server | Latency / lifecycle handling for huge models | Evaluate vs @modal.web_server for your Modal SDK version |
| **Optimization** | Radix cache overhead for single-shot API traffic | KV memory headroom | Consider --disable-radix-cache if workload is mostly single-turn |

---

## 14. Upstream bug mitigations {#upstream-table}

Config levers tie back to the flag tables on [Configuration](https://www.quantml.org/guides/glm-5-1-fp8/configuration).

| Issue | Symptom | Mitigation | Config lever |
|---|---|---|---|
| SGLang #22359 | EAGLE + FP8 KV crash | BF16 KV cache (omit FP8 KV dtype) | Omit --kv-cache-dtype fp8 |
| SGLang #21291 | flashmla_kv decode accuracy on B200 | TRT-LLM NSA backends (decode + prefill) | --nsa-*-backend trtllm |
| SGLang #17526 | FP8 KV slower than BF16 | BF16 KV cache | Omit --kv-cache-dtype fp8 |
| SGLang #19796 | EAGLE NaN on radix (sm120) | B200 is sm100 — not affected | N/A (sm100) |

---

## 15. Runtime diagnostics {#runtime-diagnostics}

After deploy, these endpoints should respond from the same host that serves chat completions.

```bash
# Readiness (200 only after weights + warmup path completes)
curl -f https://<your-app>.modal.run/health

# Prometheus text exposition
curl https://<your-app>.modal.run/metrics

# Modal platform logs
modal app logs glm-5.1-production
```

---

## 16. Health check sequence {#health-sequence}

Step-by-step verification that your deployment is working correctly.

### Step 1: Check Modal app status

```bash
modal app list | grep glm-5.1-production
```

**Expected:** Shows running app with endpoint URL  
**If fails:** Run `modal deploy deploy.py` to create the app

### Step 2: Verify health endpoint

```bash
curl -f https://<your-app>.modal.run/health
```

**Expected:** HTTP 200 (may take 6-10 min on cold start)  
**If fails:** Check `modal app logs` for startup errors

### Step 3: Test basic inference

```bash
curl -X POST https://<your-app>.modal.run/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "glm-5.1", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 10}'
```

**Expected:** JSON response with assistant message  
**If fails:** Check API_KEY secret is set correctly

### Step 4: Verify Prometheus metrics

```bash
curl https://<your-app>.modal.run/metrics | head -20
```

**Expected:** Prometheus text format with sglang_* metrics  
**If fails:** Ensure --enable-metrics is in launch command

---

## 17. Log triage {#log-triage}

| Pattern | Meaning | Action |
|---|---|---|
| `[health] passed on attempt 1` | Fast path — hot volumes / graphs | None |
| `[health] passed on attempt 15+` | Slow weight load or graph capture | If near 900s timeout, check volume I/O; consider longer startup_timeout |
| `[monitor] FATAL: SGLang died (code -9)` | OOM killer | Lower --mem-fraction-static (e.g. 0.88 → 0.85) |
| `[monitor] FATAL: SGLang died (code 137)` | SIGKILL / platform kill | Review Modal timeout, manual stops, scaledown |
| `DeepGEMM / compile success` | Kernels cached for future cold starts | None — verify marker file exists |

---

## 18. Troubleshooting scenarios {#troubleshooting}

Common problems and their solutions. Use Ctrl/Cmd+F to search for your specific symptom.

### Container returns 502 Bad Gateway after running fine for hours

**Likely cause:** SGLang subprocess crashed (OOM, CUDA error, NCCL timeout) but Modal container stayed alive

**Diagnosis:** Check `modal app logs` for `[monitor] FATAL: SGLang died` messages. Look for exit code -9 (OOM) or 137 (SIGKILL).

**Resolution:** The crash monitor should trigger `os._exit(1)` to replace the container. If not present, add the _monitor_process thread to deploy.py.

**Prevention tip:** Lower --mem-fraction-static from 0.88 to 0.85 if OOM is recurring. Check for memory leaks in long sessions.

### Cold start takes 15+ minutes instead of expected 6-10 minutes

**Likely cause:** DeepGEMM kernels are being JIT-compiled at startup instead of loading from cache

**Diagnosis:** Run `modal run deploy.py::verify_setup` and check if .compiled-GLM-5.1-FP8 marker exists in the DeepGEMM volume.

**Resolution:** Run `modal run deploy.py::compile_deepgemm` once on B200 to pre-compile kernels. Ensure dg_volume.commit() is called after compilation.

**Prevention tip:** Never change GPU_TYPE without recompiling DeepGEMM. The marker file includes the GPU type for verification.

### FileNotFoundError: Model weights missing at /model-cache/GLM-5.1-FP8

**Likely cause:** Volume eventual consistency — server container started before download commit propagated

**Diagnosis:** Check if download_model completed successfully and called model_volume.commit().

**Resolution:** Add `model_volume.reload()` at the start of the Server.setup() method. This forces a metadata refresh from the central store.

**Prevention tip:** Always reload both volumes at startup. The reference deploy.py does this in @modal.enter().

### TTFT spikes to 2-3 seconds under load (was ~250ms at low concurrency)

**Likely cause:** EAGLE speculative decoding verification queue is backing up

**Diagnosis:** Check if concurrent requests exceed --max-running-requests (default 48). Look at /metrics for queue depth.

**Resolution:** Either scale max_containers from 3 to 4+, or reduce max-running-requests to 32-40 for lower TTFT variance.

**Prevention tip:** EAGLE trades some TTFT for massive TPOT improvement. For latency-critical apps, cap concurrency lower.

### Generation quality seems worse than expected / garbled output

**Likely cause:** Using FP8 KV cache (crashes/accuracy issues) or wrong decode backend

**Diagnosis:** Check if --kv-cache-dtype fp8_e4m3 was accidentally set. Verify --nsa-decode-backend trtllm is present.

**Resolution:** Remove any FP8 KV cache flags. Ensure TRT-LLM backends are used for NSA decode and prefill.

**Prevention tip:** The reference _build_sglang_cmd intentionally omits KV cache dtype to use BF16 default.

### Warmup requests fail with 500 errors or timeouts

**Likely cause:** CUDA graph capture taking longer than warmup timeout, or SGLang crashed during first inference

**Diagnosis:** Check logs for CUDA graph capture messages and timing. Verify subprocess is still running.

**Resolution:** Increase warmup timeout from 300s to 600s for first-time graph capture. Ensure warmup happens after /health passes.

**Prevention tip:** The 4 diverse warmup requests trigger different CUDA graphs. First cold start is always slowest.

### Tool calls return malformed JSON or wrong function names

**Likely cause:** Missing or wrong tool parser flag for GLM-5.1

**Diagnosis:** Verify --tool-call-parser glm47 is in the launch command. Check the request includes properly formatted tools array.

**Resolution:** Add --tool-call-parser glm47 to _build_sglang_cmd. Ensure tools conform to OpenAI function calling schema.

### reasoning_content is empty even though thinking should be enabled

**Likely cause:** Missing reasoning parser or request explicitly disabled thinking

**Diagnosis:** Check for --reasoning-parser glm45 in launch command. Check if request had enable_thinking: false.

**Resolution:** Add --reasoning-parser glm45 to _build_sglang_cmd. Thinking is enabled by default if parser is present.

---

## 19. Cold start remediation {#cold-remediation}

- **Business-hours keep-alive:** set `ENABLE_KEEPALIVE_CRON = True`, fill `DEPLOYED_URL`, redeploy.
- **Longer idle hold:** raise `scaledown_window` beyond 900s if bursts are wider than 15 minutes.
- **Always warm:** set `min_containers=1` to eliminate cold starts at the cost of baseline GPU spend (see cost strategies above).

---

## 20. Upgrading SGLang or switching GPUs {#upgrade}

1. **Bump SGLANG_IMAGE or GPU_TYPE in deploy.py** — Align image tag with the SGLang release you validated.
2. **Invalidate DeepGEMM cache on GPU architecture change** — Remove .compiled-GLM-5.1-FP8 from glm51-deepgemm-cache, then re-run compile_deepgemm on the new SKU.
3. **Tighten --mem-fraction-static if VRAM per GPU drops** — Example: B200 192 GB → H200 141 GB — try 0.88 → 0.83 after testing KV headroom.

```bash
# Example: drop DeepGEMM marker to force recompile on new SM
modal volume rm glm51-deepgemm-cache .compiled-GLM-5.1-FP8
modal run deploy.py::compile_deepgemm
```

---

## Related sections

- [Overview & Architecture](https://www.quantml.org/guides/glm-5-1-fp8)
- [Deployment Pipeline](https://www.quantml.org/guides/glm-5-1-fp8/deployment)
- [Configuration & Flags](https://www.quantml.org/guides/glm-5-1-fp8/configuration)
- [Code Walkthrough](https://www.quantml.org/guides/glm-5-1-fp8/code)
