GuideProduction

Gemma-4-26B-A4B-it-GGUF on Modal

Modal deployment

Build llama-server, model artifacts, Volumes, GPU memory snapshots, and the --no-mmap contract.

This page is the end-to-end blueprint for running Gemma-4-26B-A4B-it-GGUF on Modal with llama-server, Modal Volumes for artifacts, and Modal memory snapshots so scale-to-zero remains practical. The priorities are predictable operations, reproducible builds, and a restore path that does not replay a multi-minute GPU weight load on every idle→active transition.

Python 3.10+Modal CLI (`pip install modal`)Hugging Face token (weights)NVIDIA L40S (48GB) baseline or equivalentModal memory snapshot support enabled

GPU envelope: plan for roughly 25 GB class VRAM including weights, mmproj, KV, and buffers (see overview tables).
Infra theme: static artifacts in Volume, ephemeral GPU for inference, snapshots for fast revive.
Platform context: Modal's high-performance LLM inference guide covers autoscaled GPU web endpoints and related primitives.

Log triage, concurrency tuning, and failure playbooks live in Operate & compare. Runtime flags and benchmark targets are in Runtime tuning.

Architecture

Volumes hold immutable artifacts; containers mount them and run llama-server under Modal’s autoscaling policies.

Modal splits concerns cleanly: storage (GGUF, mmproj, compiled binary) lives in a Modal Volume; compute scales with traffic per apps and scaling; snapshot capture (Modal memory snapshots) preserves a post-warm process image so new containers can rehydrate instead of cold-loading tensors from scratch each time. Scale-to-zero (min_containers=0) keeps idle cost near zero while snapshots make the next user session feel responsive.

┌─────────────────────────────────────────────────────────────────────────────┐
│ Modal Cloud                                                                  │
│                                                                              │
│  [Image model directory: /models]                                           │
│  ├─ gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf ←─ huggingface-hub[hf_xet]            │
│  ├─ mmproj-F32.gguf               ←─ vision / multimodal                     │
│  └─ pulled at image build time                                               │
│                                                                              │
│  [Volume: llama-cpp-builds/<LLAMA_CPP_TAG>]                                 │
│  └─ llama-server + shared libs + interleaved.jinja                          │
│                                                                              │
│  [GPU Container: llama-server subprocess]                                     │
│  │  L40S (48GB baseline) — weights + KV + mmproj + system headroom           │
│  │  ├─ --no-mmap  (required for snapshot restore stability)                  │
│  │  ├─ OpenAI-compatible HTTP (/v1/chat/completions, /metrics, …)            │
│  │  └─ Warmup requests → then @modal.enter(snap=True) captures snapshot      │
│  └──────────────────────────────────────────────────────────────────────────┘
│         │ scale-to-zero: min_containers=0 when idle                          │
└─────────┼────────────────────────────────────────────────────────────────────┘
          ▼ HTTPS
    Clients (API, Cursor /v1/responses bridge, agents)

Production hardening patterns

Eight things that bite Modal subprocess deployments in practice—apply these before you share the URL.

1. Call volume.reload() before serving

Modal Volumes are eventually consistent across containers. If build_llama_server or download_model ran in container A and committed, container B (your server) may not see those files without an explicit volume.reload() at the top of the startup method. Add it before any path access or subprocess launch.

2. Sentinel files, not directory counts

Idempotency checks like len(os.listdir(path)) > 5 will pass on partial or truncated downloads. Instead, check for the specific artifact you need: the GGUF file path, the compiled binary, or a marker file you write only after a successful commit. A file that must exist for the server to start is also a valid sentinel.

3. Retry large downloads with backoff

A multi-GB download over a 30–60 minute window will see transient network failures. Wrap snapshot_download or your equivalent in a retry loop—3 attempts, 30s / 60s / 90s delays is a reasonable baseline. Log each failure and re-raise after exhaustion.

4. Stream subprocess logs without causing deadlocks

Never let stdout=PIPE go unread. Kernel pipe buffers fill up, the child blocks on write, and the container hangs silently—often right during a slow weight load when you most need visibility. Use text=True, bufsize=1 and drain lines in a daemon thread. One caveat: bufsize=1 only line-buffers in text mode; in binary mode it sets a 1-byte buffer, which is catastrophically slow.

5. Poll /health—never use a fixed sleep

Replace time.sleep(120) with a loop that GETs the health endpoint, backs off exponentially, and has a hard timeout with failure logging. When --no-mmap forces a full sequential tensor read, startup time varies with GPU availability and memory bandwidth—a fixed sleep either wastes time or silently proceeds on a half-ready server.

6. Warm every code path you intend to snapshot

The snapshot captures whatever state warmup leaves behind. A text-only warmup misses vision tower initialization (if mmproj is loaded) and the adaptive thinking template path. Send at least: one short text prompt, one with thinking kwargs, and one with an image payload if multimodal is enabled. Keep token counts modest—the goal is to prime internal state, not to benchmark.

7. Gate the crash monitor until after startup completes

A background thread that calls os._exit(1) on process.poll() is not None should only activate after snap=True setup returns. Starting it during the weight load phase can kill the container if the process briefly restarts internally—turning a recoverable situation into a permanent crash loop.

8. Verify warmup response bodies, not just HTTP status

HTTP 200 does not mean inference is working. Parse the response body and check that choices[0].message.content is non-empty—or tool_calls / reasoning_content for those paths. An empty-content 200 that silently seals a broken snapshot is far worse than a visible failure during warmup.

Modal scaling knob	Starting value	When to change
`min_containers`	0	Set to 1 only when latency SLO cannot tolerate any idle wake, accepting always-on cost.
`max_containers`	2–3	Caps peak GPU spend; raise if queue depth consistently grows under burst load.
`scaledown_window`	900s	Keeps containers alive between bursty requests; lower for strict cost control.
`@modal.concurrent(max_inputs=…)`	Match `--parallel` slots	Modal-level queue depth before llama-server sees requests; tune alongside `--parallel`.

Prerequisites

Accounts, tokens, and hardware assumptions before you spend GPU time.

Topic

Modal

Requirement

Working CLI, billing enabled, an App name reserved for production vs staging.

Topic

Secrets

Requirement

HF token for model pulls; API key secret for authenticating your HTTP surface.

Topic

Compute

Requirement

Datacenter GPU with enough VRAM for UD-Q4 weights + projection + KV + slack (L40S-48GB is the baseline in deploy.py).

Topic

Snapshot compatibility

Requirement

Modal provides memory snapshots; your job is to obey mmap and process constraints detailed below.

Topic	Requirement
Modal	Working CLI, billing enabled, an App name reserved for production vs staging.
Secrets	HF token for model pulls; API key secret for authenticating your HTTP surface.
Compute	Datacenter GPU with enough VRAM for UD-Q4 weights + projection + KV + slack (L40S-48GB is the baseline in deploy.py).
Snapshot compatibility	Modal provides memory snapshots; your job is to obey mmap and process constraints detailed below.

Modal account setup and apps: Serve and scale. Store HF_TOKEN and API keys with Modal Secrets.

Cold start vs snapshot economics

Why snapshot restore is load-bearing for scale-to-zero UX.

A naive cold start pays the full price every time: allocate GPU, map or read tens of gigabytes into VRAM, compile shaders where applicable, and only then accept user traffic. For this model class that can land in the few-minute range—unacceptable for interactive clients if “idle” is common. Persisting artifacts in a Modal Volume already removes repeated hub downloads; memory snapshots remove the repeated compute initialization once the snapshot exists. See also Modal's very large models example for transfer and mount patterns at this scale.

The trade is intentional: you spend meaningful time once building a checkpoint of a warmed llama-server process. After that, restore should be dominated by orchestration and memory remap—not tensor reload from network or disk in the hot path. If --no-mmap is missing, restore may fail or diverge (Modal snapshot semantics); treat that flag as part of the economic model, not an optimization detail.

Builder vs runtime images

Separate the compile toolchain from what you ship and snapshot.

Builder image includes compilers, CMake or equivalent, CUDA toolkit headers, and whatever you need to compile llama-server from llama.cpp. It is larger, slower to pull, and may carry a wider security surface. You run it only for compile jobs.

Runtime image holds the produced binary plus minimal shared libraries, systemd-style supervision (often just your Python shim + subprocess), and health/metrics hooks. Smaller images reduce pull time for new workers and shrink the set of packages that must be trusted in production.

The compiled binary then lands on the Volume next to weights so both builder churn and model bytes are versioned independently: bump LLAMA_CPP_TAG without re-downloading GGUF, or refresh weights without rebuilding unless the server ABI expectations changed.

Gemma 4 thinking-tag patch

Critical source patch for thinking budget support—without this, budget controls are silently ignored.

This patch is required for thinking budget to work. The Gemma 4 dedicated parser in llama.cpp (PR #21418) omits thinking_start_tag and thinking_end_tag that the reasoning budget sampler needs. Without this patch, thinking_budget_tokens is silently ignored.

Root cause: The function common_chat_params_init_gemma4()in common/chat.cpp sets data.supports_thinking = truebut never sets data.thinking_start_tag or data.thinking_end_tag.

The reasoning budget code in server-common.cpp has a guard that checks if thinking_end_tagis empty—if it is, the entire budget block is skipped. Gemma 4's thinking markers (<|channel>thought\nand <channel|>) are handled by the PEG grammar but never exposed to the budget system.

build_llama_server() patch

1# The patch applied by _patch_gemma4_thinking_tags():
2
3old = 'data.supports_thinking = true;\n\n    data.preserved_tokens = {'
4new = (
5    'data.supports_thinking = true;\n'
6    '    data.thinking_start_tag = "<|channel>thought\\n";\n'
7    '    data.thinking_end_tag = "<channel|>";\n'
8    '\n    data.preserved_tokens = {'
9)
10
11# Applied to common/chat.cpp after git clone, before cmake

Verified results after patch

thinking_budget_tokens	Reasoning chars	Tokens	Behavior
0	44	26	Only budget transition message
32	124	59	Brief thinking + budget cutoff
128	359	155	Moderate thinking + budget cutoff
unlimited	1004	522	Full unconstrained thinking

Upgrade consideration: When bumping LLAMA_CPP_TAG, verify this patch still applies cleanly. If upstream fixes the Gemma 4 parser to include thinking tags, the patch can be removed.

Xet storage downloads

5-10x faster model downloads via Hugging Face's Xet storage backend.

The problem: Model files total ~19 GB (17.1 GB GGUF + 2 GB mmproj). With regular HTTP download via huggingface-hub, this takes 5-10 minutes during image build.

The solution: Unsloth's published GGUF repo uses HF's Xet storage backend. Installing the hf_xet extra enables chunked transport that can be 5-10x faster—the same 19 GB downloads in ~1-2 minutes.

xet-storage-setup.py

1# In your Modal image definition:
2.pip_install("huggingface-hub[hf_xet]==0.30.2")
3.env({"HF_XET_HIGH_PERFORMANCE": "1"})
4
5# Without hf_xet, you'll see this warning:
6# "Xet Storage is enabled for this repo, but the 'hf_xet' 
7#  package is not installed. Falling back to regular HTTP download."

Configuration	Download time (~19 GB)	Notes
Regular HTTP	5-10 min	Default huggingface-hub behavior
Xet Storage + HF_XET_HIGH_PERFORMANCE	~1-2 min	Chunked transport via hf_xet

HF_TOKEN requirement: Even though the model is public (Apache 2.0),hf_hub_download benefits from authentication—it avoids rate limits and is required for Xet storage. Store as a Modal secret: modal.Secret.from_name("huggingface-secret").

Offline serving: Set HF_HUB_OFFLINE=1 in serving containers after the first successful sync to prevent accidental re-downloads at runtime.

Pipeline steps

Copy-paste sequence from empty workspace to a health-checked endpoint.

Step 1: CLI, secrets, and workspace

Install the Modal CLI, authenticate, and create secrets for Hugging Face downloads and your inference API key. Nothing here belongs in source code or Docker layers.

`modal setup` links your Modal account and configures credentials locally.
`huggingface-secret` carries `HF_TOKEN` for `huggingface_hub` snapshot downloads.
`gemma4-api-key` (or similar) becomes `API_KEY` for authenticating clients against your server.
Keep `deploy.py` and related assets in a dedicated directory so paths in Modal functions stay predictable.

GPU: NoneCost: $0Duration: ~5 min

Commands

setup.sh

1pip install "modal[huggingface]" huggingface-hub
2modal setup
3
4modal secret create huggingface-secret HF_TOKEN=hf_xxxxx
5modal secret create gemma4-api-key API_KEY=your-production-key

Step 2: Builder image: compile llama-server

Compile `llama-server` from a pinned `llama.cpp` commit so production and experiments share one reproducible binary. The builder image holds compilers and headers; only the binary is copied into the lean runtime image and volume.

Pin `LLAMA_CPP_TAG` to a specific git ref; upgrades are deliberate, not accidental drift.
If you rely on Gemma-specific HTTP features (for example adaptive `enable_thinking`), confirm they exist at that ref or carry a small patch (often `common/chat.cpp`).
Write the resulting `llama-server` binary into the Modal Volume so every replica mounts the same artifact.
Treat this step as infrequent: cache success and only rebuild when you change the tag or patch set.

GPU: None (CPU build) or GPU image for CUDA-linked buildsCost: CPU build timeDuration: ~10–25 min

Commands

terminal

1modal run deploy.py::build_llama_server
2# or: modal run deploy.py::build  (if your script bundles compile + install)

Step 3: Download GGUF + mmproj into the volume

Pull `gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf` and `mmproj-F32.gguf` into `/models` at image build time. `huggingface_hub[hf_xet]` speeds large artifact transfer via chunked transport.

Idempotent scripts should skip work when shard fingerprints or marker files already exist.
Commit the build volume after compile so subsequent containers can mount `llama-server` and shared libraries from `/llama-build/<tag>`.
Set `HF_HUB_OFFLINE=1` in serving containers after the first successful sync to prevent accidental re-downloads.

GPU: NoneCost: Egress / CPUDuration: ~20–60 min

Commands

terminal

1modal run deploy.py::download_model

Step 4: Optional: verify volume consistency

Fail fast before you expose users: confirm both weight files exist, sizes look sane, and the compiled binary is present. This mirrors production-grade GLM-style pipelines that gate deploy on structural checks.

Check GGUF + mmproj paths match what `_build_llama_cmd()` passes to `--model` and `--mmproj`.
Optionally hash a small header or read GGUF metadata to catch truncated downloads.
If verification fails, fix data before paying GPU time on a broken tree.

GPU: NoneCost: ~$0Duration: 1–2 min

Commands

terminal

1modal run deploy.py::verify_model  # name may vary in your repo

Step 5: Deploy, warm, snapshot

Run `modal deploy` so Modal schedules the service. On first container start the subprocess loads weights into VRAM, serves a short synthetic warmup, then Modal’s `@modal.enter(snap=True)` path freezes process state. Restores after idle should feel closer to “warm” than “reload 17 GB from disk each time.”

`--no-mmap` must be set before snapshot capture: mmap-backed weight loading has been unstable across restore boundaries.
Warmup primes caches and exercises code paths you care about (tokenizer, attention, optional vision tower if used).
Expect snapshot creation to be a meaningful one-time cost; amortize it across many future idle→active transitions.
`modal deploy deploy.py` publishes the app; watch logs until `/health` passes on the issued URL.

GPU: L40S baseline (or equivalent)Cost: GPU seconds during warmup + snapshotDuration: First cold several min; restore much faster

Commands

terminal

1modal deploy deploy.py
2modal app logs <your-app-name>
3
4curl -fsS https://<app>.modal.run/health

Step 6: Smoke test the API surface

Exercise chat completions, and—if you expose it—`/v1/responses` for Cursor-shaped calls. Confirm auth rejects missing keys and Prometheus metrics scrape if enabled.

Send a minimal `POST /v1/chat/completions` with your `API_KEY` header or bearer token, matching how llama-server expects auth.
If you bridge Responses API, send a representative payload and confirm field mapping to chat completions.
Optionally `GET /metrics` and confirm token counters move under load.

GPU: NegligibleCost: Inference onlyDuration: ~2 min

Commands

terminal

1curl -fsS https://<app>.modal.run/v1/chat/completions \
2  -H "Authorization: Bearer $API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{"model":"gemma-4","messages":[{"role":"user","content":"ping"}]}'

deploy-e2e.sh

1# Typical one-direction script; align function names with your deploy.py
2
3pip install modal "huggingface-hub[hf_xet]"
4modal setup
5
6modal secret create huggingface-secret HF_TOKEN=$HF_TOKEN
7modal secret create gemma4-api-key API_KEY=$API_KEY
8
9modal run deploy.py::build_llama_server
10modal run deploy.py::download_model
11# modal run deploy.py::verify_model   # if implemented
12
13modal deploy deploy.py
14curl -fsS https://<app>.modal.run/health

Snapshots and the --no-mmap contract

What gets frozen, why file-backed mappings break restore, and how to prove the contract holds.

What snapshotting is doing in this stack

Checkpoint/Restore In Userspace serializes a live process tree so a new host can reconstruct it without replaying your full startup script. For inference, that means: threads, register state, virtual memory areas (VMAs), open file descriptors, credentials, and relevant network state. The goal is to hand the next container an image that already has weights resident in the form the runtime expects—not to re-execute disk I/O and GPU initialization from scratch.

Modal's @modal.enter(snap=True) path aligns with memory snapshot semantics: you finish warm work, then seal state. Snapshotting cannot magically fix inconsistent mappings: if restore cannot faithfully recreate how virtual memory points at files or devices, you get subtle corruption, immediate crash, or worse—silent wrong answers.

Why mmap fights checkpoint/restore

Default weight loading often uses memory-mapped files: pages are demand-paged from the GGUF on disk, shared with the page cache, and may be shared read-only across processes. That is efficient for normal servers, but the mapping is a contract between three parties: virtual address, file identity (path, inode, offsets), and kernel bookkeeping. After checkpoint, a new container may mount the same Volume path but still differ in ways that break the mapping contract: inode generation, bind-mount layout, device minor numbers, or timing of lazy faults. Restore metadata must match precisely; when anything drifts, restore either fails or maps the wrong bytes into the address space.

GPU memory is a separate concern: snapshots capture what the Modal stack and platform integration support. Treat GPU state + driver version + binary ABI as part of your release matrix. If you change CUDA driver assumptions or rebuild llama-server without forcing a fresh snapshot cycle, expect restore-time failures that look like “random CUDA errors.”

What --no-mmap buys you

In llama.cpp, --no-mmap forces tensors to be read into ordinary allocated memory rather than file-backed mappings (see the server README). Initial load can be slower and use more RSS during the read, but the steady-state memory image is far closer to “a big blob of anonymous pages + known file descriptors” that checkpoint tooling can round-trip. You are trading optimal Linux page-cache sharing for restore determinism—which is the right trade for snapshot-first scale-to-zero.

Operational guardrails

Centralize command construction in one function (_build_llama_cmd()) and unit-test that the argv always contains --no-mmap.
Log the final argv at info level on startup (redact secrets) so production incidents show whether a bad deploy dropped the flag.
After any change to Volume mount paths or filenames, run one full idle→restore cycle before promoting: snapshot bugs often surface only on second boot.

Related Modal pattern: SGLang with snapshots (different engine, same snapshot idea).

Symptom after idle / restore

Restore fails immediately or llama-server exits on first request

Likely cause

Missing `--no-mmap`, binary/GPU driver mismatch vs snapshot, or weight path not identical across boots.

Symptom after idle / restore

Nonsense outputs or sudden NaNs after restore

Likely cause

Rare but catastrophic: treat as memory image corruption—rebuild snapshot with pinned binary and verified weights.

Symptom after idle / restore

Long pause but eventual success

Likely cause

Cold path still doing heavy work; confirm you are hitting restore (Modal logs) vs full re-init.

Symptom after idle / restore	Likely cause
Restore fails immediately or llama-server exits on first request	Missing `--no-mmap`, binary/GPU driver mismatch vs snapshot, or weight path not identical across boots.
Nonsense outputs or sudden NaNs after restore	Rare but catastrophic: treat as memory image corruption—rebuild snapshot with pinned binary and verified weights.
Long pause but eventual success	Cold path still doing heavy work; confirm you are hitting restore (Modal logs) vs full re-init.

Snapshot size and RAM tradeoffs

Anonymous mappings still occupy space in the checkpoint image. A full GPU-resident model state can make snapshots large; that is expected. The economic win is not “tiny images”—it is amortizing minutes of load work across many wakeups. If snapshot creation time or storage cost becomes painful, tune upstream (fewer warm steps, slimmer warmup) in staging before touching production—see Runtime tuning for runtime flags.

criu-guardrails.sh

1# _build_llama_cmd() must always pass --no-mmap (assert in tests):
2
3def _build_llama_cmd() -> list[str]:
4    cmd = [
5        "/app/llama-server-bin/llama-server",
6        "--model", "/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
7        "--mmproj", "/models/mmproj-F32.gguf",
8        "--no-mmap",
9        # ... host, port, ctx, parallel, cache types, etc.
10    ]
11    assert "--no-mmap" in cmd
12    return cmd

Server lifecycle in code

Subprocess boundaries, logging, health, warmup content, teardown, and where the snapshot commits.

Why a subprocess (not in-process load)

Run llama-server as a child process so the Python Modal class stays a thin supervisor: it can stream logs, enforce timeouts on readiness, capture exit codes, and optionally restart without losing the whole container on a single CUDA fault. If the server segfaults, your wrapper can log, emit metrics, and exit cleanly so the platform replaces the worker rather than wedging a shared library in the same address space as Modal’s runtime.

Logging without deadlocks

If you attach PIPE to stdout/stderr, you must continuously drain those pipes from another thread or async reader. Undrained pipes can block the child once the kernel buffer fills, which looks like a mysterious hang right after startup. Prefer line-buffered reads and forward each line to Modal’s logging with a prefix (e.g. [llama]). Alternatively, let the server log to files on the Volume if you truly need retention—but for most cases, streaming to Modal logs is enough.

Readiness: poll /health, do not sleep

Replace time.sleep(120) with a loop that GETs http://127.0.0.1:<port>/health (or your bound address) with exponential backoff and a hard cap. Log every N failures so stalled boots are visible. The health endpoint should flip to success only when the model can actually accept completions—not merely when the HTTP socket listens.

Warmup: what to send before snap=True seals

Warmup should exercise the code paths you care about in production. At minimum: one short chat completion that touches tokenizer + attention + sampling. If you use multimodal weights, include a tiny image payload so vision towers and mmproj paths are initialized. If you rely on adaptive thinking, send one request with chat_template_kwargs.enable_thinking so the template and response shape you need are hot. Keep warmup token counts modest—enough to prime, not enough to dominate snapshot time.

Where the snapshot commits

@modal.enter(snap=True) means: when this method returns successfully, Modal may treat the resulting process state as the template for future containers. Therefore do not return until health is green and warmup finished. If you return early, you snapshot a half-ready server and pay debugging cost forever. If you need background work after serving starts, schedule it only after you are sure it should be part of the golden image—or accept a second snapshot cycle.

Shutdown and replacement

On container teardown, send SIGTERM to the child, wait with a timeout, then SIGKILL if needed. Unclosed GPU contexts can delay exit; log wall-clock time for shutdown. For long-running deployments, pair this with the operational notes in Operate & compare for recycle and diagnostics.

deploy.py (supervisor sketch)

1import subprocess
2import threading
3import time
4import urllib.request
5
6@modal.enter(snap=True)
7def setup(self):
8    self.process = subprocess.Popen(
9        _build_llama_cmd(),
10        stdout=subprocess.PIPE,
11        stderr=subprocess.STDOUT,
12        text=True,
13        bufsize=1,  # line-buffered where possible
14    )
15
16    def _pump_logs():
17        assert self.process.stdout is not None
18        for line in self.process.stdout:
19            print(f"[llama-server] {line.rstrip()}")
20
21    threading.Thread(target=_pump_logs, daemon=True).start()
22
23    base = "http://127.0.0.1:8080"
24    _wait_for_health(f"{base}/health", timeout_s=600)
25    _warmup_chat_completions(base)  # short prompts; optional image / thinking probes
26    # Successful return => snapshot boundary for new workers
27
28def _wait_for_health(url: str, timeout_s: int) -> None:
29    deadline = time.time() + timeout_s
30    delay = 0.5
31    while time.time() < deadline:
32        try:
33            with urllib.request.urlopen(url, timeout=5) as r:
34                if r.status == 200:
35                    return
36        except OSError:
37            pass
38        time.sleep(delay)
39        delay = min(delay * 1.5, 5.0)
40    raise RuntimeError(f"health never ready: {url}")

deploy.py (teardown sketch)

1import signal
2import subprocess
3
4@modal.exit()
5def teardown(self):
6    if getattr(self, "process", None) and self.process.poll() is None:
7        self.process.send_signal(signal.SIGTERM)
8        try:
9            self.process.wait(timeout=30)
10        except subprocess.TimeoutExpired:
11            self.process.kill()

Validation checklist

Phased checks from static artifacts through auth, latency, and correctness—before you share the URL.

Phase A — Volume and process contract

On the build host or a one-off modal run, print the resolved paths passed to --model, --mmproj, and the llama-server binary. Compare against ls -la on the mounted Volume.
Confirm file sizes are in the expected ballpark (multi-GB GGUF, mmproj on the order of gigabytes). Truncated downloads often pass existence checks but fail at first load.
Grep runtime logs for the argv or startup banner so --no-mmap appears exactly once and is not overridden by a wrapper script.

Phase B — Authentication and negative tests

Call /v1/chat/completions without credentials and expect 401/403 (whatever your server is configured to return). Misconfigured auth often ships as “open by accident.”
Repeat with an intentionally wrong key. Then with the production key. Document the header shape (Authorization: Bearer …) in your runbook so client teams do not guess.

Phase C — Latency: warm path vs idle wake

Measure time-to-first-token (TTFT) on a warm container (traffic just happened). Then scale to zero, wait until the platform confirms idle, send one request, and measure TTFT again. The second number is what intermittent users feel—optimize that, not only steady-state.
Log Modal’s transition timestamps if available; correlate with client-side stopwatch to separate queueing from server work.

Phase D — Functional and multimodal correctness

Text: short deterministic prompt (“Return the word OK only”) to catch garbled output or template failures.
Multimodal: if --mmproj is enabled, send a minimal valid image payload and assert a structured response—vision failures should not be discovered in production traffic.
Adaptive thinking: one request with template kwargs enabling thinking; confirm your client can parse both answer and reasoning fields if your product depends on them. Details live under APIs & clients.

Signal

Request volume and error rate

Where to look

Server logs + HTTP status histogram; spike in 5xx after idle points to restore or OOM.

Signal

Tokens/sec and queue depth

Where to look

If Prometheus is enabled on llama-server, scrape /metrics after validation load.

Signal

GPU memory headroom

Where to look

nvidia-smi in one-off debug container or platform metrics; OOM during warmup must be caught before snapshot.

Signal	Where to look
Request volume and error rate	Server logs + HTTP status histogram; spike in 5xx after idle points to restore or OOM.
Tokens/sec and queue depth	If Prometheus is enabled on llama-server, scrape /metrics after validation load.
GPU memory headroom	nvidia-smi in one-off debug container or platform metrics; OOM during warmup must be caught before snapshot.

validation-curls.sh

1BASE="https://<app>.modal.run"
2KEY="$API_KEY"
3
4# Negative test (expect failure)
5curl -sS -o /dev/null -w "%{http_code}\n" "$BASE/v1/chat/completions" \
6  -H "Content-Type: application/json" \
7  -d '{"model":"gemma-4","messages":[{"role":"user","content":"hi"}]}'
8
9# Positive test
10curl -sS "$BASE/v1/chat/completions" \
11  -H "Authorization: Bearer $KEY" \
12  -H "Content-Type: application/json" \
13  -d '{"model":"gemma-4","messages":[{"role":"user","content":"Say OK."}],"max_tokens":16}'

Upgrading llama.cpp

Pin tags, rebase patches, rebuild binaries, refresh snapshots, and roll back with confidence.

Why upgrades are a coordinated release

Bumping llama-server changes HTTP behavior, tokenizer handling, CUDA kernels, and sometimes GGUF expectations. Your Volume still holds the same GGUF bytes, but the binary interpreting them changed. Treat every bump of LLAMA_CPP_TAG as a mini release: compile, run the full validation matrix, then redeploy—never only swap the binary in place on a live snapshot without a new capture cycle.

Recommended sequence

Read upstream release notes or compare commits between your old and new tag; search for breaking changes in server, ggml, and CUDA backends.
Rebase local patches (for example Gemma chat-template / thinking-related edits in common/chat.cpp). Conflicts here are common—resolve in a branch, not on the production builder.
Run build_llama_server (or your equivalent) against the new tag; write the new binary to a staging path or staging Volume first.
Boot the server against the same GGUF and mmproj you use in prod; run Phase A–D validation from the checklist above.
Deploy to staging Modal app, force at least one idle period, and confirm restore still works. Snapshot incompatibilities often appear only after the second boot.
Promote: update production Volume path or swap symlink, redeploy, and let a fresh snap=True cycle capture the new golden image. Keep the previous binary on disk until you are confident—rollback is then a pointer swap.

CUDA / driver coupling

If Modal’s base image or GPU driver revision moves in parallel with your upgrade, test together. A new llama-server built against one CUDA toolkit and run under another is a frequent source of “works on my builder” failures. Pin builder and runtime image digests in version control when possible.

Rollback

Keep the previous LLAMA_CPP_TAG and matching binary artifact addressable (git tag, Volume path, or object name). Rollback is: redeploy known-good binary, run health + one chat completion, trigger a new snapshot. Do not assume an old snapshot image remains compatible after you have upgraded drivers or changed GPU type.

Upgrade risk

Patch no longer applies cleanly

Mitigation

Cherry-pick upstream fixes first; reduce custom diff to the minimum you truly need.

Upgrade risk

New server rejects old API fields

Mitigation

Diff OpenAPI or server README between tags; run contract tests against /v1/chat/completions.

Upgrade risk

Restore works but quality regressed

Mitigation

Separate infra validation from model QA—run your eval harness on the new binary before promotion.

Upgrade risk	Mitigation
Patch no longer applies cleanly	Cherry-pick upstream fixes first; reduce custom diff to the minimum you truly need.
New server rejects old API fields	Diff OpenAPI or server README between tags; run contract tests against /v1/chat/completions.
Restore works but quality regressed	Separate infra validation from model QA—run your eval harness on the new binary before promotion.

References

Official docs for Modal snapshots, llama.cpp server, and model artifacts.

Same URLs as the inline links above; open in a new tab for full context.

Stack overview Runtime tuning APIs & clients Operate & compare