> **Note:** The canonical experience is the interactive HTML tab: [Deployment Pipeline](https://www.quantml.org/guides/glm-5-1-fp8/deployment). This file is a text mirror for search engines and AI tools.

# GLM-5.1 FP8 - Deployment Pipeline

Production deployment of **GLM-5.1-FP8** on Modal. We prioritize **cost-efficiency** and **inference performance** over minimal setup. The pipeline has two phases: **one-time setup** (weights + DeepGEMM on CPU/GPU) and **production deployment** (API lifecycle, scale-to-zero).

**Prerequisites:** Docker, Python 3.10+, Modal CLI (pip install modal)

- **Hardware target:** 8× NVIDIA B200 (Blackwell)
- **Engine:** SGLang v0.5.10+ with DeepGEMM
- **Scale:** pay-per-use (min_containers = 0 when idle)

> **Note:** Cold starts, keep-warm crons, and log triage live in [Tune & Operate](https://www.quantml.org/guides/glm-5-1-fp8/operations).

---

## 1. Architecture overview {#architecture}

Decoupled storage: weights and compiled kernels live in Modal Volumes so inference containers do not re-download on every cold start.

### Modal Cloud

**Volume: glm51-model-weights**
- ~700 GB Safetensors (FP8)
- Source: HuggingFace
- XetHub acceleration enabled

**Volume: glm51-deepgemm-cache**
- Pre-compiled CUDA kernels
- SM100/B200 specific
- From JIT compilation step

**Container: SGLang server**
- 8× B200 GPUs (1.5 TB VRAM total)
- Loads weights from Volume (lazy/hot)
- Loads kernels from Volume (instant)
- Runs inference with EAGLE decoding

→ SGLang **port 8000** → Client requests

---

## 2. Volume storage breakdown {#volumes}

Two persistent volumes hold the model artifacts. Both are created automatically on first use.

| Volume Name | Size | Contents | Source |
|---|---|---|---|
| `glm51-model-weights` | ~700 GB | FP8 safetensors shards, config.json, tokenizer files | HuggingFace snapshot_download with Xet acceleration |
| `glm51-deepgemm-cache` | ~1–2 GB | Pre-compiled SM100 CUDA kernels for GLM-5.1 matrix shapes | sglang.compile_deep_gemm JIT compilation on B200 |

**Why separate volumes?** The weight volume is pure I/O (no GPU needed to download), while the DeepGEMM volume requires GPU compilation. Separating them allows cost-efficient CPU-only downloads (~$0.01/hr) followed by one-time GPU compilation (~$8).

---

## 3. Pipeline steps {#pipeline-steps}

Jump to a step below. Each block lists GPU, cost, duration, bullets, and copy-paste commands.

### Step 1: CLI, secrets, and repo

- **GPU:** None (local)
- **Cost:** $0
- **Duration:** ~5 min

```bash
# Modal CLI
pip install modal
modal setup

# Hugging Face (weights + Xet)
modal secret create huggingface-secret HF_TOKEN=hf_xxxxx

# API key for /v1/chat/completions
modal secret create glm51-api-key API_KEY=your-production-key
```

**Summary:** Credentials never bake into images; Modal injects secrets at runtime.

- `huggingface-secret` — required for snapshot_download and Xet acceleration.
- `glm51-api-key` — API_KEY env passed into SGLang via _build_sglang_cmd.
- Keep deploy.py in a clean directory; run commands from that folder.

### Step 2: Download weights (CPU)

- **GPU:** None (CPU)
- **Cost:** ~$0.01/hr
- **Duration:** 30–60 min

```bash
modal run deploy.py::download_model
```

**Summary:** CPU-optimized download (~700 GB) into Volume glm51-model-weights. Uses huggingface-hub[hf_xet] for XetHub acceleration (content-defined chunking, often 2–3× faster than plain HTTP for large files).

- **Idempotent:** if model.safetensors.index.json exists, the run assumes the download is complete.
- **Retries:** up to 3 attempts with exponential backoff (30s, 60s, 90s) for transient network errors.
- After commit, serving containers mount the same volume path without re-pulling from the hub at runtime (when HF_HUB_OFFLINE=1 in serve).

### Step 3: Compile DeepGEMM (8×B200)

- **GPU:** 8×B200
- **Cost:** ~$8 one-time
- **Duration:** 10–15 min

```bash
modal run deploy.py::compile_deepgemm
```

**Summary:** DeepGEMM JIT-compiles FP8 GEMMs for GLM-5.1's matrix shapes on Blackwell (sm100). Caches under /dg-cache; marker file .compiled-GLM-5.1-FP8 skips repeat work.

- **Hardware lock:** compile on B200 (sm100). Kernels built here will not run on H200, A100, or T4.
- Typical one-time cost ~$8 for ~10 minutes on 8×B200; saves ~10–15 minutes of JIT on every cold start if omitted.
- Requires weights on disk first; implementation reloads volumes before compile and streams logs (text mode, line-buffered).

### Step 4: Verify volumes

- **GPU:** None (CPU)
- **Cost:** ~$0.001
- **Duration:** Instant

```bash
modal run deploy.py::verify_setup
```

**Summary:** Confirms volumes are consistent before you expose the API (matches reference checks).

- Verifies model.safetensors.index.json exists under the weight path.
- Verifies DeepGEMM marker .compiled-GLM-5.1-FP8 exists under the cache path.
- May print shard counts and size hints — fail fast with actionable errors if something is missing.

### Step 5: Deploy & smoke test

- **GPU:** On first request
- **Cost:** Pay per GPU-second
- **Duration:** Cold 6–10 min first time

```bash
modal deploy deploy.py
modal app logs glm-5.1-production

# After URL is live
curl -f https://<your-app>.modal.run/health
```

**Summary:** min_containers=0 — API scales to zero; first traffic after idle pays cold-start cost.

- **OpenAI-compatible:** POST /v1/chat/completions with Authorization: Bearer $API_KEY.
- **Prometheus:** GET /metrics on the same host.
- **Optional:** `modal run deploy.py` runs download → compile → verify in one shot.

---

## 4. End-to-end commands {#quickstart}

CLI setup, secrets, per-function runs, deploy, and health check. Adjust modal app logs to your app name in deploy.py.

```bash
# Prerequisites: Docker, Python 3.10+, Modal CLI
pip install modal
modal setup

# Secrets (never baked into images)
modal secret create huggingface-secret HF_TOKEN=hf_xxxxx
modal secret create glm51-api-key API_KEY=your-production-key

# One-time pipeline (download → compile → verify)
modal run deploy.py::download_model
modal run deploy.py::compile_deepgemm
modal run deploy.py::verify_setup

# Or run the chained entrypoint if your deploy.py defines it:
# modal run deploy.py

# Production deploy (scale-to-zero: min_containers=0)
modal deploy deploy.py

# Smoke test (replace host after deploy)
curl -f https://<your-app>.modal.run/health
```

---

## 5. Technical deep dives {#technical-deep-dives}

Answers to common questions about DeepGEMM, FP8 precision, volume consistency, and download acceleration.

### DeepGEMM Architecture Lock

**Why can't DeepGEMM kernels compiled on one GPU run on another?**

GPUs execute instructions via specific ISA (Instruction Set Architecture). DeepGEMM JIT-compiles CUDA kernels targeting specific SM architectures and matrix shapes determined by the model.

- Blackwell (B200) uses sm_100 with 5th-gen Tensor Cores and native FP8 support (E4M3/E5M2 formats) plus new MMA (Matrix Multiply-Accumulate) opcodes.
- Ampere (A100) uses sm_80, Hopper (H100/H200) uses sm_90 — each has different instruction sets.
- A kernel compiled for sm_100 contains binary instructions that older GPUs cannot decode. The CUDA PTX intermediate representation gets compiled to SASS (actual machine code) for a specific target.
- Compilation must happen on the exact hardware used for serving. B200 kernels won't run on H200, and vice versa.

### FP8 E4M3 Precision

**Why FP8 E4M3 specifically, and what are the trade-offs?**

GLM-5.1-FP8 uses NVIDIA's E4M3 format (4 exponent bits, 3 mantissa bits). This provides sufficient dynamic range for LLM weights while halving memory footprint versus BF16.

- **Throughput:** On Blackwell, FP8 operations deliver 2× throughput compared to FP16/BF16 because Tensor Cores process twice as many FP8 elements per cycle.
- **Memory bandwidth:** LLM inference is memory-bound. Reducing weight precision from 16-bit to 8-bit doubles effective bandwidth, directly increasing tokens/second.
- **Dynamic range:** E4M3 has range ±448, sufficient for normalized activations. The alternative E5M2 format has wider range but lower precision.
- **Accuracy:** Properly scaled FP8 typically shows <1% accuracy degradation on standard benchmarks versus BF16 on well-trained models.

### XetHub Download Acceleration

**How does huggingface-hub[hf_xet] speed up downloads?**

XetHub uses content-defined chunking (similar to Git's pack algorithm) optimized for large ML files, offering 2–3× download speeds over standard HTTP for multi-gigabyte artifacts.

- Standard HTTP downloads transfer entire files sequentially. XetHub chunks files by content boundaries, enabling parallel retrieval and deduplication.
- For a ~700 GB model, this can reduce download time from 90+ minutes to 30–60 minutes depending on network conditions.
- The hf_xet extra installs the optimized transfer client. Set HF_XET_HIGH_PERFORMANCE=1 to enable maximum parallelism.
- Partial downloads can resume from the last completed chunk without restarting from zero.

### Modal Volume Consistency

**Why do we call volume.reload() at startup?**

Modal Volumes are distributed filesystems with eventual consistency. If Container A writes data, Container B might not see it immediately without explicit synchronization.

- When download_model commits to the volume, the data is durably stored but metadata propagation across the distributed system takes time.
- A server container starting milliseconds after the commit might have a stale directory listing that doesn't show the new files.
- `volume.reload()` forces a metadata refresh from the central store, guaranteeing the container sees the latest committed state.
- Both `model_volume.reload()` and `dg_volume.reload()` are called in setup() to prevent FileNotFoundError on cold starts.

---

## 6. References {#references}

Further reading on Modal, SGLang, CUDA architecture, and large-artifact transfer.

1. [Modal - scale-to-zero serverless inference](https://modal.com/docs/guide/serve?utm_source=quantml.org)
2. [SGLang - DeepGEMM & FP8](https://github.com/sgl-project/sglang)
3. [Hugging Face - XetHub for large files](https://huggingface.co/blog/hf-xet)
4. [NVIDIA - Blackwell architecture](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)
5. [NVIDIA - Tensor Core MMA (PTX)](https://docs.nvidia.com/cuda/parallel-thread-execution/#matrix-multiply-accumulate-instructions)
6. [NVIDIA - CUDA Virtual Architecture](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list)
7. [NVIDIA - FP8 Formats for Deep Learning (arXiv)](https://arxiv.org/abs/2209.05433)

---

## Related sections

- [Overview & Architecture](https://www.quantml.org/guides/glm-5-1-fp8)
- [Configuration & Flags](https://www.quantml.org/guides/glm-5-1-fp8/configuration)
- [Tune & Operate](https://www.quantml.org/guides/glm-5-1-fp8/operations)
- [Code Walkthrough](https://www.quantml.org/guides/glm-5-1-fp8/code)
