GLM-5.1 FP8 on Modal
Deployment Pipeline
One-time setup tasks, volumes, and verification steps.
Deployment pipeline
Production deployment of GLM-5.1-FP8 on Modal. We prioritize cost-efficiency and inference performance over minimal setup. The pipeline has two phases: one-time setup (weights + DeepGEMM on CPU/GPU) and production deployment (API lifecycle, scale-to-zero).
- Hardware target: 8× NVIDIA B200 (Blackwell)
- Engine: SGLang v0.5.10+ with DeepGEMM
- Scale: pay-per-use (min_containers = 0 when idle)
Cold starts, keep-warm crons, and log triage live in Tune & Operate.
Architecture overview
Decoupled storage: weights and compiled kernels live in Modal Volumes so inference containers do not re-download on every cold start.
Volume
glm51-model-weights
- ~700 GB Safetensors (FP8)
- Source: HuggingFace
- XetHub acceleration enabled
Volume
glm51-deepgemm-cache
- pre-compiled CUDA kernels
- SM100/B200 specific
- from JIT compilation step
Container
SGLang server
- 8x B200 GPUs (1.5 TB total VRAM)
- loads weights volume (lazy/hot)
- loads kernel volume (instant)
- runs inference with EAGLE decoding
Volume storage breakdown
Two persistent volumes hold the model artifacts. Both are created automatically on first use.
Volume Name
glm51-model-weights
Size
~700 GB
Contents
FP8 safetensors shards, config.json, tokenizer files
Source
HuggingFace snapshot_download with Xet acceleration
Volume Name
glm51-deepgemm-cache
Size
~1–2 GB
Contents
Pre-compiled SM100 CUDA kernels for GLM-5.1 matrix shapes
Source
sglang.compile_deep_gemm JIT compilation on B200
| Volume Name | Size | Contents | Source |
|---|---|---|---|
| glm51-model-weights | ~700 GB | FP8 safetensors shards, config.json, tokenizer files | HuggingFace snapshot_download with Xet acceleration |
| glm51-deepgemm-cache | ~1–2 GB | Pre-compiled SM100 CUDA kernels for GLM-5.1 matrix shapes | sglang.compile_deep_gemm JIT compilation on B200 |
Why separate volumes? The weight volume is pure I/O (no GPU needed to download), while the DeepGEMM volume requires GPU compilation. Separating them allows cost-efficient CPU-only downloads (~$0.01/hr) followed by one-time GPU compilation (~$8).
Pipeline steps
Jump to a step below. Each block lists GPU, cost, duration, bullets, and copy-paste commands.
Step 1: CLI, secrets, and repo
Credentials never bake into images; Modal injects secrets at runtime.
- huggingface-secret - required for snapshot_download and Xet acceleration.
- glm51-api-key - API_KEY env passed into SGLang via _build_sglang_cmd.
- Keep deploy.py in a clean directory; run commands from that folder.
1# Modal CLI2pip install modal3modal setup45# Hugging Face (weights + Xet)6modal secret create huggingface-secret HF_TOKEN=hf_xxxxx78# API key for /v1/chat/completions9modal secret create glm51-api-key API_KEY=your-production-key
Step 2: Download weights (CPU)
CPU-optimized download (~700 GB) into Volume glm51-model-weights. Uses huggingface-hub[hf_xet] for XetHub acceleration (content-defined chunking, often 2–3× faster than plain HTTP for large files).
- Idempotent: if model.safetensors.index.json exists, the run assumes the download is complete.
- Retries: up to 3 attempts with exponential backoff (30s, 60s, 90s) for transient network errors.
- After commit, serving containers mount the same volume path without re-pulling from the hub at runtime (when HF_HUB_OFFLINE=1 in serve).
1modal run deploy.py::download_model
Step 3: Compile DeepGEMM (8×B200)
DeepGEMM JIT-compiles FP8 GEMMs for GLM-5.1’s matrix shapes on Blackwell (sm100). Caches under /dg-cache; marker file .compiled-GLM-5.1-FP8 skips repeat work.
- Hardware lock: compile on B200 (sm100). Kernels built here will not run on H200, A100, or T4.
- Typical one-time cost ~$8 for ~10 minutes on 8×B200; saves ~10–15 minutes of JIT on every cold start if omitted.
- Requires weights on disk first; implementation reloads volumes before compile and streams logs (text mode, line-buffered).
1modal run deploy.py::compile_deepgemm
Step 4: Verify volumes
Confirms volumes are consistent before you expose the API (matches reference checks).
- Verifies model.safetensors.index.json exists under the weight path.
- Verifies DeepGEMM marker .compiled-GLM-5.1-FP8 exists under the cache path.
- May print shard counts and size hints - fail fast with actionable errors if something is missing.
1modal run deploy.py::verify_setup
Step 5: Deploy & smoke test
min_containers=0 - API scales to zero; first traffic after idle pays cold-start cost.
- OpenAI-compatible: POST /v1/chat/completions with Authorization: Bearer $API_KEY.
- Prometheus: GET /metrics on the same host.
- Optional: modal run deploy.py runs download → compile → verify in one shot.
1modal deploy deploy.py2modal app logs glm-5.1-production34# After URL is live5curl -f https://<your-app>.modal.run/health
End-to-end commands
CLI setup, secrets, per-function runs, deploy, and health check. Adjust modal app logs to your app name in deploy.py.
1# Prerequisites: Docker, Python 3.10+, Modal CLI2pip install modal3modal setup45# Secrets (never baked into images)6modal secret create huggingface-secret HF_TOKEN=hf_xxxxx7modal secret create glm51-api-key API_KEY=your-production-key89# One-time pipeline (download → compile → verify)10modal run deploy.py::download_model11modal run deploy.py::compile_deepgemm12modal run deploy.py::verify_setup1314# Or run the chained entrypoint if your deploy.py defines it:15# modal run deploy.py1617# Production deploy (scale-to-zero: min_containers=0)18modal deploy deploy.py1920# Smoke test (replace host after deploy)21curl -f https://<your-app>.modal.run/health
Technical deep dives: understanding the "why" behind each deployment decision.
Technical deep dives
Answers to common questions about DeepGEMM, FP8 precision, volume consistency, and download acceleration.
DeepGEMM Architecture Lock
Why can't DeepGEMM kernels compiled on one GPU run on another?
GPUs execute instructions via specific ISA (Instruction Set Architecture). DeepGEMM JIT-compiles CUDA kernels targeting specific SM architectures and matrix shapes determined by the model.
- Blackwell (B200) uses sm_100 with 5th-gen Tensor Cores and native FP8 support (E4M3/E5M2 formats) plus new MMA (Matrix Multiply-Accumulate) opcodes.
- Ampere (A100) uses sm_80, Hopper (H100/H200) uses sm_90 — each has different instruction sets.
- A kernel compiled for sm_100 contains binary instructions that older GPUs cannot decode. The CUDA PTX intermediate representation gets compiled to SASS (actual machine code) for a specific target.
- Compilation must happen on the exact hardware used for serving. B200 kernels won't run on H200, and vice versa.
FP8 E4M3 Precision
Why FP8 E4M3 specifically, and what are the trade-offs?
GLM-5.1-FP8 uses NVIDIA's E4M3 format (4 exponent bits, 3 mantissa bits). This provides sufficient dynamic range for LLM weights while halving memory footprint versus BF16.
- Throughput: On Blackwell, FP8 operations deliver 2× throughput compared to FP16/BF16 because Tensor Cores process twice as many FP8 elements per cycle.
- Memory bandwidth: LLM inference is memory-bound. Reducing weight precision from 16-bit to 8-bit doubles effective bandwidth, directly increasing tokens/second.
- Dynamic range: E4M3 has range ±448, sufficient for normalized activations. The alternative E5M2 format has wider range but lower precision.
- Accuracy: Properly scaled FP8 typically shows <1% accuracy degradation on standard benchmarks versus BF16 on well-trained models.
XetHub Download Acceleration
How does huggingface-hub[hf_xet] speed up downloads?
XetHub uses content-defined chunking (similar to Git's pack algorithm) optimized for large ML files, offering 2–3× download speeds over standard HTTP for multi-gigabyte artifacts.
- Standard HTTP downloads transfer entire files sequentially. XetHub chunks files by content boundaries, enabling parallel retrieval and deduplication.
- For a ~700 GB model, this can reduce download time from 90+ minutes to 30–60 minutes depending on network conditions.
- The hf_xet extra installs the optimized transfer client. Set HF_XET_HIGH_PERFORMANCE=1 to enable maximum parallelism.
- Partial downloads can resume from the last completed chunk without restarting from zero.
Modal Volume Consistency
Why do we call volume.reload() at startup?
Modal Volumes are distributed filesystems with eventual consistency. If Container A writes data, Container B might not see it immediately without explicit synchronization.
- When download_model commits to the volume, the data is durably stored but metadata propagation across the distributed system takes time.
- A server container starting milliseconds after the commit might have a stale directory listing that doesn't show the new files.
- volume.reload() forces a metadata refresh from the central store, guaranteeing the container sees the latest committed state.
- Both model_volume.reload() and dg_volume.reload() are called in setup() to prevent FileNotFoundError on cold starts.
References
Further reading on Modal, SGLang, CUDA architecture, and large-artifact transfer.
- Modal - scale-to-zero serverless inferencemodal.com
- SGLang - DeepGEMM & FP8github.com
- Hugging Face - XetHub for large fileshuggingface.co
- NVIDIA - Blackwell architecturenvidia.com
- NVIDIA - Tensor Core MMA (PTX)docs.nvidia.com
- NVIDIA - CUDA Virtual Architecturedocs.nvidia.com
- NVIDIA - FP8 Formats for Deep Learning (arXiv)arxiv.org