GLM-5.1 FP8 on Modal
Deployment Pipeline
One-time setup tasks, volumes, and verification steps.
Deployment pipeline
Production deployment of GLM-5.1-FP8 on Modal. We prioritize cost-efficiency and inference performance over minimal setup. The pipeline has two phases: one-time setup (weights + DeepGEMM on CPU/GPU) and production deployment (API lifecycle, scale-to-zero).
- Hardware target: 8× NVIDIA B200 (Blackwell)
- Engine: SGLang v0.5.10+ with DeepGEMM
- Scale: pay-per-use (min_containers = 0 when idle)
Cold starts, keep-warm crons, and log triage live in Tune & Operate.
Architecture overview
Decoupled storage: weights and compiled kernels live in Modal Volumes so inference containers do not re-download on every cold start.
┌──────────────────────────────────────────────────────────────────┐
│ Modal Cloud │
│ │
│ [Volume: glm51-model-weights] │
│ └─ ~700 GB Safetensors (FP8) │
│ └─ Source: HuggingFace (XetHub Acceleration) │
│ │
│ [Volume: glm51-deepgemm-cache] │
│ └─ Pre-compiled CUDA Kernels (SM100/B200 Specific) │
│ └─ Source: JIT Compilation Step │
│ │
│ [Container: SGLang Server] ─────────────────────────────────┐ │
│ │ │ │
│ │ 8× B200 GPUs (1.5TB VRAM Total) │ │
│ │ ├── Loads weights from Volume (Lazy/Hot) │ │
│ │ ├── Loads kernels from Volume (Instant) │ │
│ │ └── Runs Inference (EAGLE Speculative Decoding) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ port 8000 │
└───────────────────────────┬────────────────────────────────────────┘
▼
Client RequestsPipeline steps
Jump to a step below. Each block lists GPU, cost, duration, bullets, and copy-paste commands.
Step 1: CLI, secrets, and repo
Credentials never bake into images; Modal injects secrets at runtime.
- huggingface-secret - required for snapshot_download and Xet acceleration.
- glm51-api-key - API_KEY env passed into SGLang via _build_sglang_cmd.
- Keep deploy.py in a clean directory; run commands from that folder.
1# Modal CLI2pip install modal3modal setup45# Hugging Face (weights + Xet)6modal secret create huggingface-secret HF_TOKEN=hf_xxxxx78# API key for /v1/chat/completions9modal secret create glm51-api-key API_KEY=your-production-key
Step 2: Download weights (CPU)
CPU-optimized download (~700 GB) into Volume glm51-model-weights. Uses huggingface-hub[hf_xet] for XetHub acceleration (content-defined chunking, often 2–3× faster than plain HTTP for large files).
- Idempotent: if model.safetensors.index.json exists, the run assumes the download is complete.
- Retries: up to 3 attempts with exponential backoff (30s, 60s, 90s) for transient network errors.
- After commit, serving containers mount the same volume path without re-pulling from the hub at runtime (when HF_HUB_OFFLINE=1 in serve).
1modal run deploy.py::download_model
Step 3: Compile DeepGEMM (8×B200)
DeepGEMM JIT-compiles FP8 GEMMs for GLM-5.1’s matrix shapes on Blackwell (sm100). Caches under /dg-cache; marker file .compiled-GLM-5.1-FP8 skips repeat work.
- Hardware lock: compile on B200 (sm100). Kernels built here will not run on H200, A100, or T4.
- Typical one-time cost ~$8 for ~10 minutes on 8×B200; saves ~10–15 minutes of JIT on every cold start if omitted.
- Requires weights on disk first; implementation reloads volumes before compile and streams logs (text mode, line-buffered).
1modal run deploy.py::compile_deepgemm
Step 4: Verify volumes
Confirms volumes are consistent before you expose the API (matches reference checks).
- Verifies model.safetensors.index.json exists under the weight path.
- Verifies DeepGEMM marker .compiled-GLM-5.1-FP8 exists under the cache path.
- May print shard counts and size hints - fail fast with actionable errors if something is missing.
1modal run deploy.py::verify_setup
Step 5: Deploy & smoke test
min_containers=0 - API scales to zero; first traffic after idle pays cold-start cost.
- OpenAI-compatible: POST /v1/chat/completions with Authorization: Bearer $API_KEY.
- Prometheus: GET /metrics on the same host.
- Optional: modal run deploy.py runs download → compile → verify in one shot.
1modal deploy deploy.py2modal app logs glm-5.1-production34# After URL is live5curl -f https://<your-app>.modal.run/health
End-to-end commands
CLI setup, secrets, per-function runs, deploy, and health check. Adjust modal app logs to your app name in deploy.py.
1# Prerequisites: Docker, Python 3.10+, Modal CLI2pip install modal3modal setup45# Secrets (never baked into images)6modal secret create huggingface-secret HF_TOKEN=hf_xxxxx7modal secret create glm51-api-key API_KEY=your-production-key89# One-time pipeline (download → compile → verify)10modal run deploy.py::download_model11modal run deploy.py::compile_deepgemm12modal run deploy.py::verify_setup1314# Or run the chained entrypoint if your deploy.py defines it:15# modal run deploy.py1617# Production deploy (scale-to-zero: min_containers=0)18modal deploy deploy.py1920# Smoke test (replace host after deploy)21curl -f https://<your-app>.modal.run/health
References
Further reading on Modal, SGLang, and large-artifact transfer.