GuideProduction

GLM-5.1 FP8 on Modal

Deployment Pipeline

One-time setup tasks, volumes, and verification steps.

Production deployment of GLM-5.1-FP8 on Modal. We prioritize cost-efficiency and inference performance over minimal setup. The pipeline has two phases: one-time setup (weights + DeepGEMM on CPU/GPU) and production deployment (API lifecycle, scale-to-zero).

DockerPython 3.10+Modal CLI (pip install modal)

Hardware target: 8× NVIDIA B200 (Blackwell)
Engine: SGLang v0.5.10+ with DeepGEMM
Scale: pay-per-use (min_containers = 0 when idle)

Cold starts, keep-warm crons, and log triage live in Tune & Operate.

Architecture overview

Decoupled storage: weights and compiled kernels live in Modal Volumes so inference containers do not re-download on every cold start.

Modal Cloud

Volume

glm51-model-weights

~700 GB Safetensors (FP8)
Source: HuggingFace
XetHub acceleration enabled

Volume

glm51-deepgemm-cache

pre-compiled CUDA kernels
SM100/B200 specific
from JIT compilation step

Container

SGLang server

8x B200 GPUs (1.5 TB total VRAM)
loads weights volume (lazy/hot)
loads kernel volume (instant)
runs inference with EAGLE decoding

SGLang port 8000 -> Client requests

Volume storage breakdown

Two persistent volumes hold the model artifacts. Both are created automatically on first use.

Volume Name

glm51-model-weights

Size

~700 GB

Contents

FP8 safetensors shards, config.json, tokenizer files

Source

HuggingFace snapshot_download with Xet acceleration

Volume Name

glm51-deepgemm-cache

Size

~1–2 GB

Contents

Pre-compiled SM100 CUDA kernels for GLM-5.1 matrix shapes

Source

sglang.compile_deep_gemm JIT compilation on B200

Volume Name	Size	Contents	Source
glm51-model-weights	~700 GB	FP8 safetensors shards, config.json, tokenizer files	HuggingFace snapshot_download with Xet acceleration
glm51-deepgemm-cache	~1–2 GB	Pre-compiled SM100 CUDA kernels for GLM-5.1 matrix shapes	sglang.compile_deep_gemm JIT compilation on B200

Why separate volumes? The weight volume is pure I/O (no GPU needed to download), while the DeepGEMM volume requires GPU compilation. Separating them allows cost-efficient CPU-only downloads (~$0.01/hr) followed by one-time GPU compilation (~$8).

Pipeline steps

Jump to a step below. Each block lists GPU, cost, duration, bullets, and copy-paste commands.

Step 1: CLI, secrets, and repo

Credentials never bake into images; Modal injects secrets at runtime.

huggingface-secret - required for snapshot_download and Xet acceleration.
glm51-api-key - API_KEY env passed into SGLang via _build_sglang_cmd.
Keep deploy.py in a clean directory; run commands from that folder.

GPU: None (local)Cost: $0Duration: ~5 min

Commands

setup.sh

1# Modal CLI
2pip install modal
3modal setup
4
5# Hugging Face (weights + Xet)
6modal secret create huggingface-secret HF_TOKEN=hf_xxxxx
7
8# API key for /v1/chat/completions
9modal secret create glm51-api-key API_KEY=your-production-key

Step 2: Download weights (CPU)

CPU-optimized download (~700 GB) into Volume glm51-model-weights. Uses huggingface-hub[hf_xet] for XetHub acceleration (content-defined chunking, often 2–3× faster than plain HTTP for large files).

Idempotent: if model.safetensors.index.json exists, the run assumes the download is complete.
Retries: up to 3 attempts with exponential backoff (30s, 60s, 90s) for transient network errors.
After commit, serving containers mount the same volume path without re-pulling from the hub at runtime (when HF_HUB_OFFLINE=1 in serve).

GPU: None (CPU)Cost: ~$0.01/hrDuration: 30–60 min

Commands

terminal

1modal run deploy.py::download_model

Step 3: Compile DeepGEMM (8×B200)

DeepGEMM JIT-compiles FP8 GEMMs for GLM-5.1’s matrix shapes on Blackwell (sm100). Caches under /dg-cache; marker file .compiled-GLM-5.1-FP8 skips repeat work.

Hardware lock: compile on B200 (sm100). Kernels built here will not run on H200, A100, or T4.
Typical one-time cost ~$8 for ~10 minutes on 8×B200; saves ~10–15 minutes of JIT on every cold start if omitted.
Requires weights on disk first; implementation reloads volumes before compile and streams logs (text mode, line-buffered).

GPU: 8×B200Cost: ~$8 one-timeDuration: 10–15 min

Commands

terminal

1modal run deploy.py::compile_deepgemm

Step 4: Verify volumes

Confirms volumes are consistent before you expose the API (matches reference checks).

Verifies model.safetensors.index.json exists under the weight path.
Verifies DeepGEMM marker .compiled-GLM-5.1-FP8 exists under the cache path.
May print shard counts and size hints - fail fast with actionable errors if something is missing.

GPU: None (CPU)Cost: ~$0.001Duration: Instant

Commands

terminal

1modal run deploy.py::verify_setup

Step 5: Deploy & smoke test

min_containers=0 - API scales to zero; first traffic after idle pays cold-start cost.

OpenAI-compatible: POST /v1/chat/completions with Authorization: Bearer $API_KEY.
Prometheus: GET /metrics on the same host.
Optional: modal run deploy.py runs download → compile → verify in one shot.

GPU: On first requestCost: Pay per GPU-secondDuration: Cold 6–10 min first time

Commands

terminal

1modal deploy deploy.py
2modal app logs glm-5.1-production
3
4# After URL is live
5curl -f https://<your-app>.modal.run/health

End-to-end commands

CLI setup, secrets, per-function runs, deploy, and health check. Adjust modal app logs to your app name in deploy.py.

deployment.sh

1# Prerequisites: Docker, Python 3.10+, Modal CLI
2pip install modal
3modal setup
4
5# Secrets (never baked into images)
6modal secret create huggingface-secret HF_TOKEN=hf_xxxxx
7modal secret create glm51-api-key API_KEY=your-production-key
8
9# One-time pipeline (download → compile → verify)
10modal run deploy.py::download_model
11modal run deploy.py::compile_deepgemm
12modal run deploy.py::verify_setup
13
14# Or run the chained entrypoint if your deploy.py defines it:
15# modal run deploy.py
16
17# Production deploy (scale-to-zero: min_containers=0)
18modal deploy deploy.py
19
20# Smoke test (replace host after deploy)
21curl -f https://<your-app>.modal.run/health

Technical deep dives: understanding the "why" behind each deployment decision.

Technical deep dives

Answers to common questions about DeepGEMM, FP8 precision, volume consistency, and download acceleration.

DeepGEMM Architecture Lock

Why can't DeepGEMM kernels compiled on one GPU run on another?

GPUs execute instructions via specific ISA (Instruction Set Architecture). DeepGEMM JIT-compiles CUDA kernels targeting specific SM architectures and matrix shapes determined by the model.

Blackwell (B200) uses sm_100 with 5th-gen Tensor Cores and native FP8 support (E4M3/E5M2 formats) plus new MMA (Matrix Multiply-Accumulate) opcodes.
Ampere (A100) uses sm_80, Hopper (H100/H200) uses sm_90 — each has different instruction sets.
A kernel compiled for sm_100 contains binary instructions that older GPUs cannot decode. The CUDA PTX intermediate representation gets compiled to SASS (actual machine code) for a specific target.
Compilation must happen on the exact hardware used for serving. B200 kernels won't run on H200, and vice versa.

FP8 E4M3 Precision

Why FP8 E4M3 specifically, and what are the trade-offs?

GLM-5.1-FP8 uses NVIDIA's E4M3 format (4 exponent bits, 3 mantissa bits). This provides sufficient dynamic range for LLM weights while halving memory footprint versus BF16.

Throughput: On Blackwell, FP8 operations deliver 2× throughput compared to FP16/BF16 because Tensor Cores process twice as many FP8 elements per cycle.
Memory bandwidth: LLM inference is memory-bound. Reducing weight precision from 16-bit to 8-bit doubles effective bandwidth, directly increasing tokens/second.
Dynamic range: E4M3 has range ±448, sufficient for normalized activations. The alternative E5M2 format has wider range but lower precision.
Accuracy: Properly scaled FP8 typically shows <1% accuracy degradation on standard benchmarks versus BF16 on well-trained models.

XetHub Download Acceleration

How does huggingface-hub[hf_xet] speed up downloads?

XetHub uses content-defined chunking (similar to Git's pack algorithm) optimized for large ML files, offering 2–3× download speeds over standard HTTP for multi-gigabyte artifacts.

Standard HTTP downloads transfer entire files sequentially. XetHub chunks files by content boundaries, enabling parallel retrieval and deduplication.
For a ~700 GB model, this can reduce download time from 90+ minutes to 30–60 minutes depending on network conditions.
The hf_xet extra installs the optimized transfer client. Set HF_XET_HIGH_PERFORMANCE=1 to enable maximum parallelism.
Partial downloads can resume from the last completed chunk without restarting from zero.

Modal Volume Consistency

Why do we call volume.reload() at startup?

Modal Volumes are distributed filesystems with eventual consistency. If Container A writes data, Container B might not see it immediately without explicit synchronization.

When download_model commits to the volume, the data is durably stored but metadata propagation across the distributed system takes time.
A server container starting milliseconds after the commit might have a stale directory listing that doesn't show the new files.
volume.reload() forces a metadata refresh from the central store, guaranteeing the container sees the latest committed state.
Both model_volume.reload() and dg_volume.reload() are called in setup() to prevent FileNotFoundError on cold starts.

References

Further reading on Modal, SGLang, CUDA architecture, and large-artifact transfer.

Overview & Architecture Configuration & Flags Tune & Operate Code walkthrough

Deployment pipeline

Architecture overview

Volume storage breakdown

Pipeline steps

Step 1: CLI, secrets, and repo

Step 2: Download weights (CPU)

Step 3: Compile DeepGEMM (8×B200)

Step 4: Verify volumes

Step 5: Deploy & smoke test

End-to-end commands

Technical deep dives

DeepGEMM Architecture Lock

FP8 E4M3 Precision

XetHub Download Acceleration

Modal Volume Consistency

References