> **Note:** The canonical experience is the interactive HTML tab: [Modal deployment](https://www.quantml.org/guides/gemma-4-gguf/deployment). This file is a text mirror for search engines and AI tools.

# Gemma 4 GGUF — Modal deployment

This page is the end-to-end blueprint for running **Gemma-4-26B-A4B-it-GGUF** on [Modal](https://modal.com/docs/guide/apps) with [llama-server](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md), [Modal Volumes](https://modal.com/docs/guide/volumes) for artifacts, and [Modal memory snapshots](https://modal.com/docs/guide/memory-snapshot) so scale-to-zero remains practical. The priorities are predictable operations, reproducible builds, and a restore path that does not replay a multi-minute GPU weight load on every idle→active transition.

**Prerequisites (chips):** Python 3.10+ · Modal CLI (`pip install modal`) · Hugging Face token (weights) · NVIDIA L40S (48GB) baseline or equivalent · Modal memory snapshot support enabled

- **GPU envelope:** plan for roughly 25 GB class VRAM including weights, mmproj, KV, and buffers (see [Stack overview](https://www.quantml.org/guides/gemma-4-gguf) tables).
- **Infra theme:** static artifacts in Volume, ephemeral GPU for inference, snapshots for fast revive.
- **Platform context:** Modal’s [high-performance LLM inference](https://modal.com/docs/guide/high-performance-llm-inference) guide covers autoscaled GPU web endpoints and related primitives.

> **Info:** Log triage, concurrency tuning, and failure playbooks live in [Operate & compare](https://www.quantml.org/guides/gemma-4-gguf/operations). Runtime flags and benchmark targets are in [Runtime tuning](https://www.quantml.org/guides/gemma-4-gguf/configuration).

**Related markdown:** [Overview](https://www.quantml.org/guides/gemma-4-gguf.md) · [Configuration](https://www.quantml.org/guides/gemma-4-gguf/configuration.md) · [Features](https://www.quantml.org/guides/gemma-4-gguf/features.md) · [Operations](https://www.quantml.org/guides/gemma-4-gguf/operations.md)

---

## 1. Architecture {#architecture}

_Volumes hold immutable artifacts; containers mount them and run llama-server under Modal’s autoscaling policies._

Modal splits concerns cleanly: **storage** (GGUF, mmproj, compiled binary) lives in a [Modal Volume](https://modal.com/docs/guide/volumes); **compute** scales with traffic per [apps and scaling](https://modal.com/docs/guide/apps); **snapshot capture** ([Modal memory snapshots](https://modal.com/docs/guide/memory-snapshot)) preserves a post-warm process image so new containers can rehydrate instead of cold-loading tensors from scratch each time. Scale-to-zero (`min_containers=0`) keeps idle cost near zero while snapshots make the next user session feel responsive.

```text
┌─────────────────────────────────────────────────────────────────────────────┐
│ Modal Cloud                                                                  │
│                                                                              │
│  [Image model directory: /models]                                           │
│  ├─ gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf ←─ huggingface-hub[hf_xet]            │
│  ├─ mmproj-F32.gguf               ←─ vision / multimodal                     │
│  └─ pulled at image build time                                               │
│                                                                              │
│  [Volume: llama-cpp-builds/<LLAMA_CPP_TAG>]                                 │
│  └─ llama-server + shared libs + interleaved.jinja                          │
│                                                                              │
│  [GPU Container: llama-server subprocess]                                     │
│  │  L40S (48GB baseline) — weights + KV + mmproj + system headroom           │
│  │  ├─ --no-mmap  (required for snapshot restore stability)                  │
│  │  ├─ OpenAI-compatible HTTP (/v1/chat/completions, /metrics, …)            │
│  │  └─ Warmup requests → then @modal.enter(snap=True) captures snapshot      │
│  └──────────────────────────────────────────────────────────────────────────┘
│         │ scale-to-zero: min_containers=0 when idle                          │
└─────────┼────────────────────────────────────────────────────────────────────┘
          ▼ HTTPS
    Clients (API, Cursor /v1/responses bridge, agents)
```

---

## 2. Production hardening patterns {#production-hardening}

_Eight things that bite Modal subprocess deployments in practice—apply these before you share the URL._

### 1. Call `volume.reload()` before serving

[Modal Volumes](https://modal.com/docs/guide/volumes) are eventually consistent across containers. If `build_llama_server` or `download_model` ran in container A and committed, container B (your server) may not see those files without an explicit `volume.reload()` at the top of the startup method. Add it before any path access or subprocess launch.

### 2. Sentinel files, not directory counts

Idempotency checks like `len(os.listdir(path)) > 5` will pass on partial or truncated downloads. Instead, check for the specific artifact you need: the GGUF file path, the compiled binary, or a marker file you write only after a successful commit. A file that must exist for the server to start is also a valid sentinel.

### 3. Retry large downloads with backoff

A multi-GB download over a 30–60 minute window will see transient network failures. Wrap `snapshot_download` or your equivalent in a retry loop—3 attempts, 30s / 60s / 90s delays is a reasonable baseline. Log each failure and re-raise after exhaustion.

### 4. Stream subprocess logs without causing deadlocks

Never let `stdout=PIPE` go unread. Kernel pipe buffers fill up, the child blocks on write, and the container hangs silently—often right during a slow weight load when you most need visibility. Use `text=True, bufsize=1` and drain lines in a daemon thread. One caveat: `bufsize=1` only line-buffers in text mode; in binary mode it sets a 1-byte buffer, which is catastrophically slow.

### 5. Poll `/health`—never use a fixed sleep

Replace `time.sleep(120)` with a loop that GETs the health endpoint, backs off exponentially, and has a hard timeout with failure logging. When `--no-mmap` forces a full sequential tensor read, startup time varies with GPU availability and memory bandwidth—a fixed sleep either wastes time or silently proceeds on a half-ready server.

### 6. Warm every code path you intend to snapshot

The snapshot captures whatever state warmup leaves behind. A text-only warmup misses vision tower initialization (if mmproj is loaded) and the adaptive thinking template path. Send at least: one short text prompt, one with thinking kwargs, and one with an image payload if multimodal is enabled. Keep token counts modest—the goal is to prime internal state, not to benchmark.

### 7. Gate the crash monitor until after startup completes

A background thread that calls `os._exit(1)` on `process.poll() is not None` should only activate *after* `snap=True` setup returns. Starting it during the weight load phase can kill the container if the process briefly restarts internally—turning a recoverable situation into a permanent crash loop.

### 8. Verify warmup response bodies, not just HTTP status

HTTP 200 does not mean inference is working. Parse the response body and check that `choices[0].message.content` is non-empty—or `tool_calls` / `reasoning_content` for those paths. An empty-content 200 that silently seals a broken snapshot is far worse than a visible failure during warmup.

| Modal scaling knob | Starting value | When to change |
|--------------------|----------------|----------------|
| `min_containers` | 0 | Set to 1 only when latency SLO cannot tolerate any idle wake, accepting always-on cost. |
| `max_containers` | 2–3 | Caps peak GPU spend; raise if queue depth consistently grows under burst load. |
| `scaledown_window` | 900s | Keeps containers alive between bursty requests; lower for strict cost control. |
| `@modal.concurrent(max_inputs=…)` | Match `--parallel` slots | Modal-level queue depth before llama-server sees requests; tune alongside `--parallel`. |

---

## 3. Prerequisites {#prerequisites}

_Accounts, tokens, and hardware assumptions before you spend GPU time._

| Topic | Requirement |
|-------|---------------|
| Modal | Working CLI, billing enabled, an App name reserved for production vs staging. |
| Secrets | HF token for model pulls; API key secret for authenticating your HTTP surface. |
| Compute | Datacenter GPU with enough VRAM for UD-Q4 weights + projection + KV + slack (L40S-48GB is the baseline in deploy.py). |
| Snapshot compatibility | Modal provides memory snapshots; your job is to obey mmap and process constraints detailed below. |

Modal account setup and apps: [Serve and scale](https://modal.com/docs/guide/apps). Store `HF_TOKEN` and API keys with [Modal Secrets](https://modal.com/docs/guide/secrets).

---

## 4. Cold start vs snapshot economics {#economics}

_Why snapshot restore is load-bearing for scale-to-zero UX._

A naive cold start pays the full price every time: allocate GPU, map or read tens of gigabytes into VRAM, compile shaders where applicable, and only then accept user traffic. For this model class that can land in the **few-minute** range—unacceptable for interactive clients if “idle” is common. Persisting artifacts in a [Modal Volume](https://modal.com/docs/guide/volumes) already removes repeated hub downloads; [memory snapshots](https://modal.com/docs/guide/memory-snapshot) remove the repeated *compute* initialization once the snapshot exists. See also Modal’s [very large models](https://modal.com/docs/examples/very_large_models) example for transfer and mount patterns at this scale.

The trade is intentional: you spend meaningful time once building a checkpoint of a *warmed* [`llama-server`](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md) process. After that, restore should be dominated by orchestration and memory remap—not tensor reload from network or disk in the hot path. If `--no-mmap` is missing, restore may fail or diverge ([Modal snapshot semantics](https://modal.com/docs/guide/memory-snapshot)); treat that flag as part of the economic model, not an optimization detail.

---

## 5. Builder vs runtime images {#two-image}

_Separate the compile toolchain from what you ship and snapshot._

**Builder image** includes compilers, CMake or equivalent, CUDA toolkit headers, and whatever you need to compile `llama-server` from [llama.cpp](https://github.com/ggml-org/llama.cpp). It is larger, slower to pull, and may carry a wider security surface. You run it only for compile jobs.

**Runtime image** holds the produced binary plus minimal shared libraries, supervision (often just your Python shim + subprocess), and health/metrics hooks. Smaller images reduce pull time for new workers and shrink the set of packages that must be trusted in production.

The compiled binary then lands on the Volume next to weights so both builder churn and model bytes are versioned independently: bump `LLAMA_CPP_TAG` without re-downloading GGUF, or refresh weights without rebuilding unless the server ABI expectations changed.

---

## 6. Gemma 4 thinking-tag patch {#thinking-patch}

_Critical source patch for thinking budget support—without this, budget controls are silently ignored._

> **Warning:** **This patch is required for thinking budget to work.** The Gemma 4 dedicated parser in [llama.cpp (PR #21418)](https://github.com/ggml-org/llama.cpp/pull/21418) omits `thinking_start_tag` and `thinking_end_tag` that the reasoning budget sampler needs. Without this patch, `thinking_budget_tokens` is silently ignored.

**Root cause:** The function `common_chat_params_init_gemma4()` in `common/chat.cpp` sets `data.supports_thinking = true` but never sets `data.thinking_start_tag` or `data.thinking_end_tag`.

The reasoning budget code in `server-common.cpp` has a guard that checks if `thinking_end_tag` is empty—if it is, the entire budget block is skipped. Gemma 4's thinking markers (`<|channel>thought\n` and `<channel|>`) are handled by the PEG grammar but never exposed to the budget system.

**`build_llama_server()` patch** (`build_llama_server() patch` — applied after clone, before cmake):

```python
# The patch applied by _patch_gemma4_thinking_tags():

old = 'data.supports_thinking = true;\n\n    data.preserved_tokens = {'
new = (
    'data.supports_thinking = true;\n'
    '    data.thinking_start_tag = "<|channel>thought\\n";\n'
    '    data.thinking_end_tag = "<channel|>";\n'
    '\n    data.preserved_tokens = {'
)

# Applied to common/chat.cpp after git clone, before cmake
```

**Verified results after patch**

| thinking_budget_tokens | Reasoning chars | Tokens | Behavior |
|------------------------|-----------------|--------|----------|
| 0 | 44 | 26 | Only budget transition message |
| 32 | 124 | 59 | Brief thinking + budget cutoff |
| 128 | 359 | 155 | Moderate thinking + budget cutoff |
| unlimited | 1004 | 522 | Full unconstrained thinking |

**Upgrade consideration:** When bumping `LLAMA_CPP_TAG`, verify this patch still applies cleanly. If upstream fixes the Gemma 4 parser to include thinking tags, the patch can be removed.

---

## 7. Xet storage downloads {#xet-storage}

_5–10x faster model downloads via Hugging Face's Xet storage backend._

**The problem:** Model files total ~19 GB (17.1 GB GGUF + 2 GB mmproj). With regular HTTP download via `huggingface-hub`, this takes **5–10 minutes** during image build.

**The solution:** [Unsloth's published GGUF repo](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF) uses HF's Xet storage backend. Installing the `hf_xet` extra enables chunked transport that can be **5–10x faster**—the same 19 GB downloads in ~1–2 minutes.

**`xet-storage-setup.py`**

```python
# In your Modal image definition:
.pip_install("huggingface-hub[hf_xet]==0.30.2")
.env({"HF_XET_HIGH_PERFORMANCE": "1"})

# Without hf_xet, you'll see this warning:
# "Xet Storage is enabled for this repo, but the 'hf_xet' 
#  package is not installed. Falling back to regular HTTP download."
```

| Configuration | Download time (~19 GB) | Notes |
|---------------|------------------------|-------|
| Regular HTTP | 5–10 min | Default huggingface-hub behavior |
| Xet Storage + HF_XET_HIGH_PERFORMANCE | ~1–2 min | Chunked transport via hf_xet |

**HF_TOKEN requirement:** Even though the model is public (Apache 2.0), `hf_hub_download` benefits from authentication—it avoids rate limits and is required for Xet storage. Store as a Modal secret: `modal.Secret.from_name("huggingface-secret")`.

**Offline serving:** Set `HF_HUB_OFFLINE=1` in serving containers after the first successful sync to prevent accidental re-downloads at runtime.

---

## 8. Pipeline steps {#pipeline}

_Copy-paste sequence from empty workspace to a health-checked endpoint._

### Step 1: CLI, secrets, and workspace

**Summary:** Install the Modal CLI, authenticate, and create secrets for Hugging Face downloads and your inference API key. Nothing here belongs in source code or Docker layers.

| | |
|--|--|
| GPU | None |
| Cost | $0 |
| Duration | ~5 min |

- `modal setup` links your Modal account and configures credentials locally.
- `huggingface-secret` carries `HF_TOKEN` for `huggingface_hub` snapshot downloads.
- `gemma4-api-key` (or similar) becomes `API_KEY` for authenticating clients against your server.
- Keep `deploy.py` and related assets in a dedicated directory so paths in Modal functions stay predictable.

```bash
pip install "modal[huggingface]" huggingface-hub
modal setup

modal secret create huggingface-secret HF_TOKEN=hf_xxxxx
modal secret create gemma4-api-key API_KEY=your-production-key
```

### Step 2: Builder image: compile llama-server

**Summary:** Compile `llama-server` from a pinned `llama.cpp` commit so production and experiments share one reproducible binary. The builder image holds compilers and headers; only the binary is copied into the lean runtime image and volume.

| | |
|--|--|
| GPU | None (CPU build) or GPU image for CUDA-linked builds |
| Cost | CPU build time |
| Duration | ~10–25 min |

- Pin `LLAMA_CPP_TAG` to a specific git ref; upgrades are deliberate, not accidental drift.
- If you rely on Gemma-specific HTTP features (for example adaptive `enable_thinking`), confirm they exist at that ref or carry a small patch (often `common/chat.cpp`).
- Write the resulting `llama-server` binary into the Modal Volume so every replica mounts the same artifact.
- Treat this step as infrequent: cache success and only rebuild when you change the tag or patch set.

```bash
modal run deploy.py::build_llama_server
# or: modal run deploy.py::build  (if your script bundles compile + install)
```

### Step 3: Download GGUF + mmproj into the volume

**Summary:** Pull `gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf` and `mmproj-F32.gguf` into `/models` at image build time. `huggingface_hub[hf_xet]` speeds large artifact transfer via chunked transport.

| | |
|--|--|
| GPU | None |
| Cost | Egress / CPU |
| Duration | ~20–60 min |

- Idempotent scripts should skip work when shard fingerprints or marker files already exist.
- Commit the build volume after compile so subsequent containers can mount `llama-server` and shared libraries from `/llama-build/<tag>`.
- Set `HF_HUB_OFFLINE=1` in serving containers after the first successful sync to prevent accidental re-downloads.

```bash
modal run deploy.py::download_model
```

### Step 4: Optional: verify volume consistency

**Summary:** Fail fast before you expose users: confirm both weight files exist, sizes look sane, and the compiled binary is present.

| | |
|--|--|
| GPU | None |
| Cost | ~$0 |
| Duration | 1–2 min |

- Check GGUF + mmproj paths match what `_build_llama_cmd()` passes to `--model` and `--mmproj`.
- Optionally hash a small header or read GGUF metadata to catch truncated downloads.
- If verification fails, fix data before paying GPU time on a broken tree.

```bash
modal run deploy.py::verify_model  # name may vary in your repo
```

### Step 5: Deploy, warm, snapshot

**Summary:** Run `modal deploy` so Modal schedules the service. On first container start the subprocess loads weights into VRAM, serves a short synthetic warmup, then Modal’s `@modal.enter(snap=True)` path freezes process state.

| | |
|--|--|
| GPU | L40S baseline (or equivalent) |
| Cost | GPU seconds during warmup + snapshot |
| Duration | First cold several min; restore much faster |

- `--no-mmap` must be set before snapshot capture: mmap-backed weight loading has been unstable across restore boundaries.
- Warmup primes caches and exercises code paths you care about (tokenizer, attention, optional vision tower if used).
- Expect snapshot creation to be a meaningful one-time cost; amortize it across many future idle→active transitions.
- `modal deploy deploy.py` publishes the app; watch logs until `/health` passes on the issued URL.

```bash
modal deploy deploy.py
modal app logs <your-app-name>

curl -fsS https://<app>.modal.run/health
```

### Step 6: Smoke test the API surface

**Summary:** Exercise chat completions, and—if you expose it—`/v1/responses` for Cursor-shaped calls. Confirm auth rejects missing keys and Prometheus metrics scrape if enabled.

| | |
|--|--|
| GPU | Negligible |
| Cost | Inference only |
| Duration | ~2 min |

- Send a minimal `POST /v1/chat/completions` with your `API_KEY` header or bearer token, matching how llama-server expects auth.
- If you bridge Responses API, send a representative payload and confirm field mapping to chat completions.
- Optionally `GET /metrics` and confirm token counters move under load.

```bash
curl -fsS https://<app>.modal.run/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma-4","messages":[{"role":"user","content":"ping"}]}'
```

**`deploy-e2e.sh`**

```bash
# Typical one-direction script; align function names with your deploy.py

pip install modal "huggingface-hub[hf_xet]"
modal setup

modal secret create huggingface-secret HF_TOKEN=$HF_TOKEN
modal secret create gemma4-api-key API_KEY=$API_KEY

modal run deploy.py::build_llama_server
modal run deploy.py::download_model
# modal run deploy.py::verify_model   # if implemented

modal deploy deploy.py
curl -fsS https://<app>.modal.run/health
```

---

## 9. Snapshots and the --no-mmap contract {#criu-mmap}

_What gets frozen, why file-backed mappings break restore, and how to prove the contract holds._

### What snapshotting is doing in this stack

**Checkpoint/Restore In Userspace** serializes a live process tree so a new host can reconstruct it without replaying your full startup script. For inference, that means: threads, register state, virtual memory areas (VMAs), open file descriptors, credentials, and relevant network state. The goal is to hand the next container an image that already has weights resident in the form the runtime expects—not to re-execute disk I/O and GPU initialization from scratch.

Modal’s `@modal.enter(snap=True)` path aligns with [memory snapshot](https://modal.com/docs/guide/memory-snapshot) semantics: you finish *warm* work, then seal state. Snapshotting cannot magically fix inconsistent mappings: if restore cannot faithfully recreate how virtual memory points at files or devices, you get subtle corruption, immediate crash, or worse—silent wrong answers.

### Why `mmap` fights checkpoint/restore

Default weight loading often uses memory-mapped files: pages are demand-paged from the GGUF on disk, shared with the page cache, and may be shared read-only across processes. That is efficient for normal servers, but the mapping is a contract between three parties: **virtual address**, **file identity** (path, inode, offsets), and **kernel bookkeeping**. After checkpoint, a new container may mount the same Volume path but still differ in ways that break the mapping contract: inode generation, bind-mount layout, device minor numbers, or timing of lazy faults. Restore metadata must match precisely; when anything drifts, restore either fails or maps the wrong bytes into the address space.

GPU memory is a separate concern: snapshots capture what the Modal stack and platform integration support. Treat **GPU state + driver version + binary ABI** as part of your release matrix. If you change CUDA driver assumptions or rebuild `llama-server` without forcing a fresh snapshot cycle, expect restore-time failures that look like “random CUDA errors.”

### What `--no-mmap` buys you

In [llama.cpp](https://github.com/ggml-org/llama.cpp), `--no-mmap` forces tensors to be read into ordinary allocated memory rather than file-backed mappings (see the [server README](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md)). Initial load can be slower and use more RSS during the read, but the steady-state memory image is far closer to “a big blob of anonymous pages + known file descriptors” that checkpoint tooling can round-trip. You are trading optimal Linux page-cache sharing for **restore determinism**—which is the right trade for snapshot-first scale-to-zero.

### Operational guardrails

- Centralize command construction in one function (`_build_llama_cmd()`) and unit-test that the argv always contains `--no-mmap`.
- Log the final argv at info level on startup (redact secrets) so production incidents show whether a bad deploy dropped the flag.
- After any change to Volume mount paths or filenames, run one full idle→restore cycle before promoting: snapshot bugs often surface only on second boot.

Related Modal pattern: [SGLang with snapshots](https://modal.com/docs/examples/sglang_snapshot) (different engine, same snapshot idea).

| Symptom after idle / restore | Likely cause |
|------------------------------|--------------|
| Restore fails immediately or llama-server exits on first request | Missing `--no-mmap`, binary/GPU driver mismatch vs snapshot, or weight path not identical across boots. |
| Nonsense outputs or sudden NaNs after restore | Rare but catastrophic: treat as memory image corruption—rebuild snapshot with pinned binary and verified weights. |
| Long pause but eventual success | Cold path still doing heavy work; confirm you are hitting restore (Modal logs) vs full re-init. |

### Snapshot size and RAM tradeoffs

Anonymous mappings still occupy space in the checkpoint image. A full GPU-resident model state can make snapshots large; that is expected. The economic win is not “tiny images”—it is **amortizing minutes of load work** across many wakeups. If snapshot creation time or storage cost becomes painful, tune upstream (fewer warm steps, slimmer warmup) in staging before touching production—see [Runtime tuning](https://www.quantml.org/guides/gemma-4-gguf/configuration) for runtime flags.

**`criu-guardrails.sh`**

```python
# _build_llama_cmd() must always pass --no-mmap (assert in tests):

def _build_llama_cmd() -> list[str]:
    cmd = [
        "/app/llama-server-bin/llama-server",
        "--model", "/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
        "--mmproj", "/models/mmproj-F32.gguf",
        "--no-mmap",
        # ... host, port, ctx, parallel, cache types, etc.
    ]
    assert "--no-mmap" in cmd
    return cmd
```

---

## 10. Server lifecycle in code {#lifecycle}

_Subprocess boundaries, logging, health, warmup content, teardown, and where the snapshot commits._

### Why a subprocess (not in-process load)

Run `llama-server` as a **child process** so the Python Modal class stays a thin supervisor: it can stream logs, enforce timeouts on readiness, capture exit codes, and optionally restart without losing the whole container on a single CUDA fault. If the server segfaults, your wrapper can log, emit metrics, and exit cleanly so the platform replaces the worker rather than wedging a shared library in the same address space as Modal’s runtime.

### Logging without deadlocks

If you attach `PIPE` to stdout/stderr, you must continuously drain those pipes from another thread or async reader. Undrained pipes can block the child once the kernel buffer fills, which looks like a mysterious hang right after startup. Prefer line-buffered reads and forward each line to Modal’s logging with a prefix (e.g. `[llama]`). Alternatively, let the server log to files on the Volume if you truly need retention—but for most cases, streaming to Modal logs is enough.

### Readiness: poll `/health`, do not sleep

Replace `time.sleep(120)` with a loop that GETs `http://127.0.0.1:<port>/health` (or your bound address) with exponential backoff and a hard cap. Log every N failures so stalled boots are visible. The health endpoint should flip to success only when the model can actually accept completions—not merely when the HTTP socket listens.

### Warmup: what to send before `snap=True` seals

Warmup should exercise the code paths you care about in production. At minimum: one short chat completion that touches tokenizer + attention + sampling. If you use multimodal weights, include a tiny image payload so vision towers and mmproj paths are initialized. If you rely on adaptive thinking, send one request with `chat_template_kwargs.enable_thinking` so the template and response shape you need are hot. Keep warmup token counts modest—enough to prime, not enough to dominate snapshot time.

### Where the snapshot commits

`@modal.enter(snap=True)` means: when this method returns successfully, Modal may treat the resulting process state as the template for future containers. Therefore **do not return until** health is green and warmup finished. If you return early, you snapshot a half-ready server and pay debugging cost forever. If you need background work after serving starts, schedule it only after you are sure it should be part of the golden image—or accept a second snapshot cycle.

### Shutdown and replacement

On container teardown, send `SIGTERM` to the child, wait with a timeout, then `SIGKILL` if needed. Unclosed GPU contexts can delay exit; log wall-clock time for shutdown. For long-running deployments, pair this with the operational notes in [Operate & compare](https://www.quantml.org/guides/gemma-4-gguf/operations) for recycle and diagnostics.

**`deploy.py` (supervisor sketch)**

```python
import subprocess
import threading
import time
import urllib.request

@modal.enter(snap=True)
def setup(self):
    self.process = subprocess.Popen(
        _build_llama_cmd(),
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
        bufsize=1,  # line-buffered where possible
    )

    def _pump_logs():
        assert self.process.stdout is not None
        for line in self.process.stdout:
            print(f"[llama-server] {line.rstrip()}")

    threading.Thread(target=_pump_logs, daemon=True).start()

    base = "http://127.0.0.1:8080"
    _wait_for_health(f"{base}/health", timeout_s=600)
    _warmup_chat_completions(base)  # short prompts; optional image / thinking probes
    # Successful return => snapshot boundary for new workers

def _wait_for_health(url: str, timeout_s: int) -> None:
    deadline = time.time() + timeout_s
    delay = 0.5
    while time.time() < deadline:
        try:
            with urllib.request.urlopen(url, timeout=5) as r:
                if r.status == 200:
                    return
        except OSError:
            pass
        time.sleep(delay)
        delay = min(delay * 1.5, 5.0)
    raise RuntimeError(f"health never ready: {url}")
```

**`deploy.py` (teardown sketch)**

```python
import signal
import subprocess

@modal.exit()
def teardown(self):
    if getattr(self, "process", None) and self.process.poll() is None:
        self.process.send_signal(signal.SIGTERM)
        try:
            self.process.wait(timeout=30)
        except subprocess.TimeoutExpired:
            self.process.kill()
```

---

## 11. Validation checklist {#validation}

_Phased checks from static artifacts through auth, latency, and correctness—before you share the URL._

### Phase A — Volume and process contract

- On the build host or a one-off `modal run`, print the resolved paths passed to `--model`, `--mmproj`, and the `llama-server` binary. Compare against `ls -la` on the mounted Volume.
- Confirm file sizes are in the expected ballpark (multi-GB GGUF, mmproj on the order of gigabytes). Truncated downloads often pass existence checks but fail at first load.
- Grep runtime logs for the argv or startup banner so `--no-mmap` appears exactly once and is not overridden by a wrapper script.

### Phase B — Authentication and negative tests

- Call `/v1/chat/completions` **without** credentials and expect 401/403 (whatever your server is configured to return). Misconfigured auth often ships as “open by accident.”
- Repeat with an intentionally wrong key. Then with the production key. Document the header shape (`Authorization: Bearer …`) in your runbook so client teams do not guess.

### Phase C — Latency: warm path vs idle wake

- Measure time-to-first-token (TTFT) on a warm container (traffic just happened). Then scale to zero, wait until the platform confirms idle, send one request, and measure TTFT again. The second number is what intermittent users feel—optimize that, not only steady-state.
- Log Modal’s transition timestamps if available; correlate with client-side stopwatch to separate queueing from server work.

### Phase D — Functional and multimodal correctness

- Text: short deterministic prompt (“Return the word OK only”) to catch garbled output or template failures.
- Multimodal: if `--mmproj` is enabled, send a minimal valid image payload and assert a structured response—vision failures should not be discovered in production traffic.
- Adaptive thinking: one request with template kwargs enabling thinking; confirm your client can parse both answer and reasoning fields if your product depends on them. Details live under [APIs & clients](https://www.quantml.org/guides/gemma-4-gguf/features).

| Signal | Where to look |
|--------|----------------|
| Request volume and error rate | Server logs + HTTP status histogram; spike in 5xx after idle points to restore or OOM. |
| Tokens/sec and queue depth | If Prometheus is enabled on llama-server, scrape `/metrics` after validation load. |
| GPU memory headroom | `nvidia-smi` in one-off debug container or platform metrics; OOM during warmup must be caught before snapshot. |

**`validation-curls.sh`**

```bash
BASE="https://<app>.modal.run"
KEY="$API_KEY"

# Negative test (expect failure)
curl -sS -o /dev/null -w "%{http_code}\n" "$BASE/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma-4","messages":[{"role":"user","content":"hi"}]}'

# Positive test
curl -sS "$BASE/v1/chat/completions" \
  -H "Authorization: Bearer $KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma-4","messages":[{"role":"user","content":"Say OK."}],"max_tokens":16}'
```

---

## 12. Upgrading llama.cpp {#upgrades}

_Pin tags, rebase patches, rebuild binaries, refresh snapshots, and roll back with confidence._

### Why upgrades are a coordinated release

Bumping `llama-server` changes HTTP behavior, tokenizer handling, CUDA kernels, and sometimes GGUF expectations. Your Volume still holds the same GGUF bytes, but the binary interpreting them changed. Treat every bump of `LLAMA_CPP_TAG` as a **mini release**: compile, run the full validation matrix, then redeploy—never only swap the binary in place on a live snapshot without a new capture cycle.

### Recommended sequence

1. Read upstream release notes or compare commits between your old and new tag; search for breaking changes in `server`, `ggml`, and CUDA backends.
2. Rebase local patches (for example Gemma chat-template / thinking-related edits in `common/chat.cpp`). Conflicts here are common—resolve in a branch, not on the production builder.
3. Run `build_llama_server` (or your equivalent) against the new tag; write the new binary to a staging path or staging Volume first.
4. Boot the server against the **same** GGUF and mmproj you use in prod; run Phase A–D validation from the checklist above.
5. Deploy to staging Modal app, force at least one idle period, and confirm restore still works. Snapshot incompatibilities often appear only after the second boot.
6. Promote: update production Volume path or swap symlink, redeploy, and let a fresh `snap=True` cycle capture the new golden image. Keep the previous binary on disk until you are confident—rollback is then a pointer swap.

### CUDA / driver coupling

If Modal’s base image or GPU driver revision moves in parallel with your upgrade, test together. A new `llama-server` built against one CUDA toolkit and run under another is a frequent source of “works on my builder” failures. Pin builder and runtime image digests in version control when possible.

### Rollback

Keep the previous `LLAMA_CPP_TAG` and matching binary artifact addressable (git tag, Volume path, or object name). Rollback is: redeploy known-good binary, run health + one chat completion, trigger a new snapshot. Do not assume an old snapshot image remains compatible after you have upgraded drivers or changed GPU type.

| Upgrade risk | Mitigation |
|--------------|------------|
| Patch no longer applies cleanly | Cherry-pick upstream fixes first; reduce custom diff to the minimum you truly need. |
| New server rejects old API fields | Diff OpenAPI or server README between tags; run contract tests against `/v1/chat/completions`. |
| Restore works but quality regressed | Separate infra validation from model QA—run your eval harness on the new binary before promotion. |

---

## 13. References {#references}

_Official docs for Modal snapshots, llama.cpp server, and model artifacts. Same URLs as inline links._

1. [Modal: Serve and scale](https://modal.com/docs/guide/apps)
2. [Modal Volumes](https://modal.com/docs/guide/volumes)
3. [Modal: Secrets](https://modal.com/docs/guide/secrets)
4. [Modal: Serve very large models](https://modal.com/docs/examples/very_large_models)
5. [Modal: High-performance LLM inference](https://modal.com/docs/guide/high-performance-llm-inference)
6. [Modal: SGLang with snapshots (related pattern)](https://modal.com/docs/examples/sglang_snapshot)
7. [Modal memory snapshots](https://modal.com/docs/guide/memory-snapshot)
8. [llama.cpp server README](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md)
9. [HuggingFace: unsloth/gemma-4-26B-A4B-it-GGUF](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF)

---

## Other guide tabs

- [Stack overview](https://www.quantml.org/guides/gemma-4-gguf) · [markdown](https://www.quantml.org/guides/gemma-4-gguf.md)
- [Runtime tuning](https://www.quantml.org/guides/gemma-4-gguf/configuration) · [markdown](https://www.quantml.org/guides/gemma-4-gguf/configuration.md)
- [APIs & clients](https://www.quantml.org/guides/gemma-4-gguf/features) · [markdown](https://www.quantml.org/guides/gemma-4-gguf/features.md)
- [Operate & compare](https://www.quantml.org/guides/gemma-4-gguf/operations) · [markdown](https://www.quantml.org/guides/gemma-4-gguf/operations.md)
