GuideProduction

GLM-5.1 FP8 on Modal

Deployment Pipeline

One-time setup tasks, volumes, and verification steps.

Deployment pipeline

Production deployment of GLM-5.1-FP8 on Modal. We prioritize cost-efficiency and inference performance over minimal setup. The pipeline has two phases: one-time setup (weights + DeepGEMM on CPU/GPU) and production deployment (API lifecycle, scale-to-zero).

DockerPython 3.10+Modal CLI (pip install modal)
  • Hardware target: 8× NVIDIA B200 (Blackwell)
  • Engine: SGLang v0.5.10+ with DeepGEMM
  • Scale: pay-per-use (min_containers = 0 when idle)

Cold starts, keep-warm crons, and log triage live in Tune & Operate.

01

Architecture overview

Decoupled storage: weights and compiled kernels live in Modal Volumes so inference containers do not re-download on every cold start.

Modal Cloud

Volume

glm51-model-weights

  • ~700 GB Safetensors (FP8)
  • Source: HuggingFace
  • XetHub acceleration enabled

Volume

glm51-deepgemm-cache

  • pre-compiled CUDA kernels
  • SM100/B200 specific
  • from JIT compilation step

Container

SGLang server

  • 8x B200 GPUs (1.5 TB total VRAM)
  • loads weights volume (lazy/hot)
  • loads kernel volume (instant)
  • runs inference with EAGLE decoding
SGLang port 8000 -> Client requests
02

Volume storage breakdown

Two persistent volumes hold the model artifacts. Both are created automatically on first use.

Volume Name

glm51-model-weights

Size

~700 GB

Contents

FP8 safetensors shards, config.json, tokenizer files

Source

HuggingFace snapshot_download with Xet acceleration

Volume Name

glm51-deepgemm-cache

Size

~1–2 GB

Contents

Pre-compiled SM100 CUDA kernels for GLM-5.1 matrix shapes

Source

sglang.compile_deep_gemm JIT compilation on B200

Why separate volumes? The weight volume is pure I/O (no GPU needed to download), while the DeepGEMM volume requires GPU compilation. Separating them allows cost-efficient CPU-only downloads (~$0.01/hr) followed by one-time GPU compilation (~$8).

03

Pipeline steps

Jump to a step below. Each block lists GPU, cost, duration, bullets, and copy-paste commands.

Step 1: CLI, secrets, and repo

Credentials never bake into images; Modal injects secrets at runtime.

  • huggingface-secret - required for snapshot_download and Xet acceleration.
  • glm51-api-key - API_KEY env passed into SGLang via _build_sglang_cmd.
  • Keep deploy.py in a clean directory; run commands from that folder.
GPU: None (local)Cost: $0Duration: ~5 min
Commands
setup.sh
1# Modal CLI
2pip install modal
3modal setup
4
5# Hugging Face (weights + Xet)
6modal secret create huggingface-secret HF_TOKEN=hf_xxxxx
7
8# API key for /v1/chat/completions
9modal secret create glm51-api-key API_KEY=your-production-key

Step 2: Download weights (CPU)

CPU-optimized download (~700 GB) into Volume glm51-model-weights. Uses huggingface-hub[hf_xet] for XetHub acceleration (content-defined chunking, often 2–3× faster than plain HTTP for large files).

  • Idempotent: if model.safetensors.index.json exists, the run assumes the download is complete.
  • Retries: up to 3 attempts with exponential backoff (30s, 60s, 90s) for transient network errors.
  • After commit, serving containers mount the same volume path without re-pulling from the hub at runtime (when HF_HUB_OFFLINE=1 in serve).
GPU: None (CPU)Cost: ~$0.01/hrDuration: 30–60 min
Commands
terminal
1modal run deploy.py::download_model

Step 3: Compile DeepGEMM (8×B200)

DeepGEMM JIT-compiles FP8 GEMMs for GLM-5.1’s matrix shapes on Blackwell (sm100). Caches under /dg-cache; marker file .compiled-GLM-5.1-FP8 skips repeat work.

  • Hardware lock: compile on B200 (sm100). Kernels built here will not run on H200, A100, or T4.
  • Typical one-time cost ~$8 for ~10 minutes on 8×B200; saves ~10–15 minutes of JIT on every cold start if omitted.
  • Requires weights on disk first; implementation reloads volumes before compile and streams logs (text mode, line-buffered).
GPU: 8×B200Cost: ~$8 one-timeDuration: 10–15 min
Commands
terminal
1modal run deploy.py::compile_deepgemm

Step 4: Verify volumes

Confirms volumes are consistent before you expose the API (matches reference checks).

  • Verifies model.safetensors.index.json exists under the weight path.
  • Verifies DeepGEMM marker .compiled-GLM-5.1-FP8 exists under the cache path.
  • May print shard counts and size hints - fail fast with actionable errors if something is missing.
GPU: None (CPU)Cost: ~$0.001Duration: Instant
Commands
terminal
1modal run deploy.py::verify_setup

Step 5: Deploy & smoke test

min_containers=0 - API scales to zero; first traffic after idle pays cold-start cost.

  • OpenAI-compatible: POST /v1/chat/completions with Authorization: Bearer $API_KEY.
  • Prometheus: GET /metrics on the same host.
  • Optional: modal run deploy.py runs download → compile → verify in one shot.
GPU: On first requestCost: Pay per GPU-secondDuration: Cold 6–10 min first time
Commands
terminal
1modal deploy deploy.py
2modal app logs glm-5.1-production
3
4# After URL is live
5curl -f https://<your-app>.modal.run/health
04

End-to-end commands

CLI setup, secrets, per-function runs, deploy, and health check. Adjust modal app logs to your app name in deploy.py.

deployment.sh
1# Prerequisites: Docker, Python 3.10+, Modal CLI
2pip install modal
3modal setup
4
5# Secrets (never baked into images)
6modal secret create huggingface-secret HF_TOKEN=hf_xxxxx
7modal secret create glm51-api-key API_KEY=your-production-key
8
9# One-time pipeline (download → compile → verify)
10modal run deploy.py::download_model
11modal run deploy.py::compile_deepgemm
12modal run deploy.py::verify_setup
13
14# Or run the chained entrypoint if your deploy.py defines it:
15# modal run deploy.py
16
17# Production deploy (scale-to-zero: min_containers=0)
18modal deploy deploy.py
19
20# Smoke test (replace host after deploy)
21curl -f https://<your-app>.modal.run/health

Technical deep dives: understanding the "why" behind each deployment decision.

05

Technical deep dives

Answers to common questions about DeepGEMM, FP8 precision, volume consistency, and download acceleration.

?

DeepGEMM Architecture Lock

Why can't DeepGEMM kernels compiled on one GPU run on another?

GPUs execute instructions via specific ISA (Instruction Set Architecture). DeepGEMM JIT-compiles CUDA kernels targeting specific SM architectures and matrix shapes determined by the model.

  • Blackwell (B200) uses sm_100 with 5th-gen Tensor Cores and native FP8 support (E4M3/E5M2 formats) plus new MMA (Matrix Multiply-Accumulate) opcodes.
  • Ampere (A100) uses sm_80, Hopper (H100/H200) uses sm_90 — each has different instruction sets.
  • A kernel compiled for sm_100 contains binary instructions that older GPUs cannot decode. The CUDA PTX intermediate representation gets compiled to SASS (actual machine code) for a specific target.
  • Compilation must happen on the exact hardware used for serving. B200 kernels won't run on H200, and vice versa.
?

FP8 E4M3 Precision

Why FP8 E4M3 specifically, and what are the trade-offs?

GLM-5.1-FP8 uses NVIDIA's E4M3 format (4 exponent bits, 3 mantissa bits). This provides sufficient dynamic range for LLM weights while halving memory footprint versus BF16.

  • Throughput: On Blackwell, FP8 operations deliver 2× throughput compared to FP16/BF16 because Tensor Cores process twice as many FP8 elements per cycle.
  • Memory bandwidth: LLM inference is memory-bound. Reducing weight precision from 16-bit to 8-bit doubles effective bandwidth, directly increasing tokens/second.
  • Dynamic range: E4M3 has range ±448, sufficient for normalized activations. The alternative E5M2 format has wider range but lower precision.
  • Accuracy: Properly scaled FP8 typically shows <1% accuracy degradation on standard benchmarks versus BF16 on well-trained models.
?

XetHub Download Acceleration

How does huggingface-hub[hf_xet] speed up downloads?

XetHub uses content-defined chunking (similar to Git's pack algorithm) optimized for large ML files, offering 2–3× download speeds over standard HTTP for multi-gigabyte artifacts.

  • Standard HTTP downloads transfer entire files sequentially. XetHub chunks files by content boundaries, enabling parallel retrieval and deduplication.
  • For a ~700 GB model, this can reduce download time from 90+ minutes to 30–60 minutes depending on network conditions.
  • The hf_xet extra installs the optimized transfer client. Set HF_XET_HIGH_PERFORMANCE=1 to enable maximum parallelism.
  • Partial downloads can resume from the last completed chunk without restarting from zero.
?

Modal Volume Consistency

Why do we call volume.reload() at startup?

Modal Volumes are distributed filesystems with eventual consistency. If Container A writes data, Container B might not see it immediately without explicit synchronization.

  • When download_model commits to the volume, the data is durably stored but metadata propagation across the distributed system takes time.
  • A server container starting milliseconds after the commit might have a stale directory listing that doesn't show the new files.
  • volume.reload() forces a metadata refresh from the central store, guaranteeing the container sees the latest committed state.
  • Both model_volume.reload() and dg_volume.reload() are called in setup() to prevent FileNotFoundError on cold starts.
06

References

Further reading on Modal, SGLang, CUDA architecture, and large-artifact transfer.