GuideProduction

GLM-5.1 FP8 on Modal

Deployment Pipeline

One-time setup tasks, volumes, and verification steps.

Deployment pipeline

Production deployment of GLM-5.1-FP8 on Modal. We prioritize cost-efficiency and inference performance over minimal setup. The pipeline has two phases: one-time setup (weights + DeepGEMM on CPU/GPU) and production deployment (API lifecycle, scale-to-zero).

DockerPython 3.10+Modal CLI (pip install modal)
  • Hardware target: 8× NVIDIA B200 (Blackwell)
  • Engine: SGLang v0.5.10+ with DeepGEMM
  • Scale: pay-per-use (min_containers = 0 when idle)

Cold starts, keep-warm crons, and log triage live in Tune & Operate.

01

Architecture overview

Decoupled storage: weights and compiled kernels live in Modal Volumes so inference containers do not re-download on every cold start.

┌──────────────────────────────────────────────────────────────────┐
│ Modal Cloud                                                      │
│                                                                  │
│  [Volume: glm51-model-weights]                                   │
│  └─ ~700 GB Safetensors (FP8)                                    │
│     └─ Source: HuggingFace (XetHub Acceleration)                 │
│                                                                  │
│  [Volume: glm51-deepgemm-cache]                                  │
│  └─ Pre-compiled CUDA Kernels (SM100/B200 Specific)              │
│     └─ Source: JIT Compilation Step                              │
│                                                                  │
│  [Container: SGLang Server] ─────────────────────────────────┐   │
│  │                                                             │   │
│  │  8× B200 GPUs (1.5TB VRAM Total)                           │   │
│  │  ├── Loads weights from Volume (Lazy/Hot)                   │   │
│  │  ├── Loads kernels from Volume (Instant)                   │   │
│  │  └── Runs Inference (EAGLE Speculative Decoding)            │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                           │ port 8000                              │
└───────────────────────────┬────────────────────────────────────────┘
                            ▼
                      Client Requests
02

Pipeline steps

Jump to a step below. Each block lists GPU, cost, duration, bullets, and copy-paste commands.

Step 1: CLI, secrets, and repo

Credentials never bake into images; Modal injects secrets at runtime.

  • huggingface-secret - required for snapshot_download and Xet acceleration.
  • glm51-api-key - API_KEY env passed into SGLang via _build_sglang_cmd.
  • Keep deploy.py in a clean directory; run commands from that folder.
GPU: None (local)Cost: $0Duration: ~5 min
Commands
setup.sh
1# Modal CLI
2pip install modal
3modal setup
4
5# Hugging Face (weights + Xet)
6modal secret create huggingface-secret HF_TOKEN=hf_xxxxx
7
8# API key for /v1/chat/completions
9modal secret create glm51-api-key API_KEY=your-production-key

Step 2: Download weights (CPU)

CPU-optimized download (~700 GB) into Volume glm51-model-weights. Uses huggingface-hub[hf_xet] for XetHub acceleration (content-defined chunking, often 2–3× faster than plain HTTP for large files).

  • Idempotent: if model.safetensors.index.json exists, the run assumes the download is complete.
  • Retries: up to 3 attempts with exponential backoff (30s, 60s, 90s) for transient network errors.
  • After commit, serving containers mount the same volume path without re-pulling from the hub at runtime (when HF_HUB_OFFLINE=1 in serve).
GPU: None (CPU)Cost: ~$0.01/hrDuration: 30–60 min
Commands
terminal
1modal run deploy.py::download_model

Step 3: Compile DeepGEMM (8×B200)

DeepGEMM JIT-compiles FP8 GEMMs for GLM-5.1’s matrix shapes on Blackwell (sm100). Caches under /dg-cache; marker file .compiled-GLM-5.1-FP8 skips repeat work.

  • Hardware lock: compile on B200 (sm100). Kernels built here will not run on H200, A100, or T4.
  • Typical one-time cost ~$8 for ~10 minutes on 8×B200; saves ~10–15 minutes of JIT on every cold start if omitted.
  • Requires weights on disk first; implementation reloads volumes before compile and streams logs (text mode, line-buffered).
GPU: 8×B200Cost: ~$8 one-timeDuration: 10–15 min
Commands
terminal
1modal run deploy.py::compile_deepgemm

Step 4: Verify volumes

Confirms volumes are consistent before you expose the API (matches reference checks).

  • Verifies model.safetensors.index.json exists under the weight path.
  • Verifies DeepGEMM marker .compiled-GLM-5.1-FP8 exists under the cache path.
  • May print shard counts and size hints - fail fast with actionable errors if something is missing.
GPU: None (CPU)Cost: ~$0.001Duration: Instant
Commands
terminal
1modal run deploy.py::verify_setup

Step 5: Deploy & smoke test

min_containers=0 - API scales to zero; first traffic after idle pays cold-start cost.

  • OpenAI-compatible: POST /v1/chat/completions with Authorization: Bearer $API_KEY.
  • Prometheus: GET /metrics on the same host.
  • Optional: modal run deploy.py runs download → compile → verify in one shot.
GPU: On first requestCost: Pay per GPU-secondDuration: Cold 6–10 min first time
Commands
terminal
1modal deploy deploy.py
2modal app logs glm-5.1-production
3
4# After URL is live
5curl -f https://<your-app>.modal.run/health
03

End-to-end commands

CLI setup, secrets, per-function runs, deploy, and health check. Adjust modal app logs to your app name in deploy.py.

deployment.sh
1# Prerequisites: Docker, Python 3.10+, Modal CLI
2pip install modal
3modal setup
4
5# Secrets (never baked into images)
6modal secret create huggingface-secret HF_TOKEN=hf_xxxxx
7modal secret create glm51-api-key API_KEY=your-production-key
8
9# One-time pipeline (download → compile → verify)
10modal run deploy.py::download_model
11modal run deploy.py::compile_deepgemm
12modal run deploy.py::verify_setup
13
14# Or run the chained entrypoint if your deploy.py defines it:
15# modal run deploy.py
16
17# Production deploy (scale-to-zero: min_containers=0)
18modal deploy deploy.py
19
20# Smoke test (replace host after deploy)
21curl -f https://<your-app>.modal.run/health
04

References

Further reading on Modal, SGLang, and large-artifact transfer.