Gemma-4-26B-A4B-it-GGUF on Modal
Modal deployment
Build llama-server, model artifacts, Volumes, GPU memory snapshots, and the --no-mmap contract.
Modal deployment
This page is the end-to-end blueprint for running Gemma-4-26B-A4B-it-GGUF on Modal with llama-server, Modal Volumes for artifacts, and Modal memory snapshots so scale-to-zero remains practical. The priorities are predictable operations, reproducible builds, and a restore path that does not replay a multi-minute GPU weight load on every idle→active transition.
- GPU envelope: plan for roughly 25 GB class VRAM including weights, mmproj, KV, and buffers (see overview tables).
- Infra theme: static artifacts in Volume, ephemeral GPU for inference, snapshots for fast revive.
- Platform context: Modal's high-performance LLM inference guide covers autoscaled GPU web endpoints and related primitives.
Log triage, concurrency tuning, and failure playbooks live in Operate & compare. Runtime flags and benchmark targets are in Runtime tuning.
Architecture
Volumes hold immutable artifacts; containers mount them and run llama-server under Modal’s autoscaling policies.
Modal splits concerns cleanly: storage (GGUF, mmproj, compiled binary) lives in a Modal Volume; compute scales with traffic per apps and scaling; snapshot capture (Modal memory snapshots) preserves a post-warm process image so new containers can rehydrate instead of cold-loading tensors from scratch each time. Scale-to-zero (min_containers=0) keeps idle cost near zero while snapshots make the next user session feel responsive.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Modal Cloud │
│ │
│ [Image model directory: /models] │
│ ├─ gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf ←─ huggingface-hub[hf_xet] │
│ ├─ mmproj-F32.gguf ←─ vision / multimodal │
│ └─ pulled at image build time │
│ │
│ [Volume: llama-cpp-builds/<LLAMA_CPP_TAG>] │
│ └─ llama-server + shared libs + interleaved.jinja │
│ │
│ [GPU Container: llama-server subprocess] │
│ │ L40S (48GB baseline) — weights + KV + mmproj + system headroom │
│ │ ├─ --no-mmap (required for snapshot restore stability) │
│ │ ├─ OpenAI-compatible HTTP (/v1/chat/completions, /metrics, …) │
│ │ └─ Warmup requests → then @modal.enter(snap=True) captures snapshot │
│ └──────────────────────────────────────────────────────────────────────────┘
│ │ scale-to-zero: min_containers=0 when idle │
└─────────┼────────────────────────────────────────────────────────────────────┘
▼ HTTPS
Clients (API, Cursor /v1/responses bridge, agents)Production hardening patterns
Eight things that bite Modal subprocess deployments in practice—apply these before you share the URL.
1. Call volume.reload() before serving
Modal Volumes are eventually consistent across containers. If build_llama_server or download_model ran in container A and committed, container B (your server) may not see those files without an explicit volume.reload() at the top of the startup method. Add it before any path access or subprocess launch.
2. Sentinel files, not directory counts
Idempotency checks like len(os.listdir(path)) > 5 will pass on partial or truncated downloads. Instead, check for the specific artifact you need: the GGUF file path, the compiled binary, or a marker file you write only after a successful commit. A file that must exist for the server to start is also a valid sentinel.
3. Retry large downloads with backoff
A multi-GB download over a 30–60 minute window will see transient network failures. Wrap snapshot_download or your equivalent in a retry loop—3 attempts, 30s / 60s / 90s delays is a reasonable baseline. Log each failure and re-raise after exhaustion.
4. Stream subprocess logs without causing deadlocks
Never let stdout=PIPE go unread. Kernel pipe buffers fill up, the child blocks on write, and the container hangs silently—often right during a slow weight load when you most need visibility. Use text=True, bufsize=1 and drain lines in a daemon thread. One caveat: bufsize=1 only line-buffers in text mode; in binary mode it sets a 1-byte buffer, which is catastrophically slow.
5. Poll /health—never use a fixed sleep
Replace time.sleep(120) with a loop that GETs the health endpoint, backs off exponentially, and has a hard timeout with failure logging. When --no-mmap forces a full sequential tensor read, startup time varies with GPU availability and memory bandwidth—a fixed sleep either wastes time or silently proceeds on a half-ready server.
6. Warm every code path you intend to snapshot
The snapshot captures whatever state warmup leaves behind. A text-only warmup misses vision tower initialization (if mmproj is loaded) and the adaptive thinking template path. Send at least: one short text prompt, one with thinking kwargs, and one with an image payload if multimodal is enabled. Keep token counts modest—the goal is to prime internal state, not to benchmark.
7. Gate the crash monitor until after startup completes
A background thread that calls os._exit(1) on process.poll() is not None should only activate after snap=True setup returns. Starting it during the weight load phase can kill the container if the process briefly restarts internally—turning a recoverable situation into a permanent crash loop.
8. Verify warmup response bodies, not just HTTP status
HTTP 200 does not mean inference is working. Parse the response body and check that choices[0].message.content is non-empty—or tool_calls / reasoning_content for those paths. An empty-content 200 that silently seals a broken snapshot is far worse than a visible failure during warmup.
| Modal scaling knob | Starting value | When to change |
|---|---|---|
min_containers | 0 | Set to 1 only when latency SLO cannot tolerate any idle wake, accepting always-on cost. |
max_containers | 2–3 | Caps peak GPU spend; raise if queue depth consistently grows under burst load. |
scaledown_window | 900s | Keeps containers alive between bursty requests; lower for strict cost control. |
@modal.concurrent(max_inputs=…) | Match --parallel slots | Modal-level queue depth before llama-server sees requests; tune alongside --parallel. |
Prerequisites
Accounts, tokens, and hardware assumptions before you spend GPU time.
Topic
Modal
Requirement
Working CLI, billing enabled, an App name reserved for production vs staging.
Topic
Secrets
Requirement
HF token for model pulls; API key secret for authenticating your HTTP surface.
Topic
Compute
Requirement
Datacenter GPU with enough VRAM for UD-Q4 weights + projection + KV + slack (L40S-48GB is the baseline in deploy.py).
Topic
Snapshot compatibility
Requirement
Modal provides memory snapshots; your job is to obey mmap and process constraints detailed below.
| Topic | Requirement |
|---|---|
| Modal | Working CLI, billing enabled, an App name reserved for production vs staging. |
| Secrets | HF token for model pulls; API key secret for authenticating your HTTP surface. |
| Compute | Datacenter GPU with enough VRAM for UD-Q4 weights + projection + KV + slack (L40S-48GB is the baseline in deploy.py). |
| Snapshot compatibility | Modal provides memory snapshots; your job is to obey mmap and process constraints detailed below. |
Modal account setup and apps: Serve and scale. Store HF_TOKEN and API keys with Modal Secrets.
Cold start vs snapshot economics
Why snapshot restore is load-bearing for scale-to-zero UX.
A naive cold start pays the full price every time: allocate GPU, map or read tens of gigabytes into VRAM, compile shaders where applicable, and only then accept user traffic. For this model class that can land in the few-minute range—unacceptable for interactive clients if “idle” is common. Persisting artifacts in a Modal Volume already removes repeated hub downloads; memory snapshots remove the repeated compute initialization once the snapshot exists. See also Modal's very large models example for transfer and mount patterns at this scale.
The trade is intentional: you spend meaningful time once building a checkpoint of a warmed llama-server process. After that, restore should be dominated by orchestration and memory remap—not tensor reload from network or disk in the hot path. If --no-mmap is missing, restore may fail or diverge (Modal snapshot semantics); treat that flag as part of the economic model, not an optimization detail.
Builder vs runtime images
Separate the compile toolchain from what you ship and snapshot.
Builder image includes compilers, CMake or equivalent, CUDA toolkit headers, and whatever you need to compile llama-server from llama.cpp. It is larger, slower to pull, and may carry a wider security surface. You run it only for compile jobs.
Runtime image holds the produced binary plus minimal shared libraries, systemd-style supervision (often just your Python shim + subprocess), and health/metrics hooks. Smaller images reduce pull time for new workers and shrink the set of packages that must be trusted in production.
The compiled binary then lands on the Volume next to weights so both builder churn and model bytes are versioned independently: bump LLAMA_CPP_TAG without re-downloading GGUF, or refresh weights without rebuilding unless the server ABI expectations changed.
Gemma 4 thinking-tag patch
Critical source patch for thinking budget support—without this, budget controls are silently ignored.
This patch is required for thinking budget to work. The Gemma 4 dedicated parser in llama.cpp (PR #21418) omits thinking_start_tag and thinking_end_tag that the reasoning budget sampler needs. Without this patch, thinking_budget_tokens is silently ignored.
Root cause: The function common_chat_params_init_gemma4()in common/chat.cpp sets data.supports_thinking = truebut never sets data.thinking_start_tag or data.thinking_end_tag.
The reasoning budget code in server-common.cpp has a guard that checks if thinking_end_tagis empty—if it is, the entire budget block is skipped. Gemma 4's thinking markers (<|channel>thought\nand <channel|>) are handled by the PEG grammar but never exposed to the budget system.
1# The patch applied by _patch_gemma4_thinking_tags():23old = 'data.supports_thinking = true;\n\n data.preserved_tokens = {'4new = (5 'data.supports_thinking = true;\n'6 ' data.thinking_start_tag = "<|channel>thought\\n";\n'7 ' data.thinking_end_tag = "<channel|>";\n'8 '\n data.preserved_tokens = {'9)1011# Applied to common/chat.cpp after git clone, before cmake
Verified results after patch
| thinking_budget_tokens | Reasoning chars | Tokens | Behavior |
|---|---|---|---|
| 0 | 44 | 26 | Only budget transition message |
| 32 | 124 | 59 | Brief thinking + budget cutoff |
| 128 | 359 | 155 | Moderate thinking + budget cutoff |
| unlimited | 1004 | 522 | Full unconstrained thinking |
Upgrade consideration: When bumping LLAMA_CPP_TAG, verify this patch still applies cleanly. If upstream fixes the Gemma 4 parser to include thinking tags, the patch can be removed.
Xet storage downloads
5-10x faster model downloads via Hugging Face's Xet storage backend.
The problem: Model files total ~19 GB (17.1 GB GGUF + 2 GB mmproj). With regular HTTP download via huggingface-hub, this takes 5-10 minutes during image build.
The solution: Unsloth's published GGUF repo uses HF's Xet storage backend. Installing the hf_xet extra enables chunked transport that can be 5-10x faster—the same 19 GB downloads in ~1-2 minutes.
1# In your Modal image definition:2.pip_install("huggingface-hub[hf_xet]==0.30.2")3.env({"HF_XET_HIGH_PERFORMANCE": "1"})45# Without hf_xet, you'll see this warning:6# "Xet Storage is enabled for this repo, but the 'hf_xet'7# package is not installed. Falling back to regular HTTP download."
| Configuration | Download time (~19 GB) | Notes |
|---|---|---|
| Regular HTTP | 5-10 min | Default huggingface-hub behavior |
| Xet Storage + HF_XET_HIGH_PERFORMANCE | ~1-2 min | Chunked transport via hf_xet |
HF_TOKEN requirement: Even though the model is public (Apache 2.0),hf_hub_download benefits from authentication—it avoids rate limits and is required for Xet storage. Store as a Modal secret: modal.Secret.from_name("huggingface-secret").
Offline serving: Set HF_HUB_OFFLINE=1 in serving containers after the first successful sync to prevent accidental re-downloads at runtime.
Pipeline steps
Copy-paste sequence from empty workspace to a health-checked endpoint.
Step 1: CLI, secrets, and workspace
Install the Modal CLI, authenticate, and create secrets for Hugging Face downloads and your inference API key. Nothing here belongs in source code or Docker layers.
- `modal setup` links your Modal account and configures credentials locally.
- `huggingface-secret` carries `HF_TOKEN` for `huggingface_hub` snapshot downloads.
- `gemma4-api-key` (or similar) becomes `API_KEY` for authenticating clients against your server.
- Keep `deploy.py` and related assets in a dedicated directory so paths in Modal functions stay predictable.
1pip install "modal[huggingface]" huggingface-hub2modal setup34modal secret create huggingface-secret HF_TOKEN=hf_xxxxx5modal secret create gemma4-api-key API_KEY=your-production-key
Step 2: Builder image: compile llama-server
Compile `llama-server` from a pinned `llama.cpp` commit so production and experiments share one reproducible binary. The builder image holds compilers and headers; only the binary is copied into the lean runtime image and volume.
- Pin `LLAMA_CPP_TAG` to a specific git ref; upgrades are deliberate, not accidental drift.
- If you rely on Gemma-specific HTTP features (for example adaptive `enable_thinking`), confirm they exist at that ref or carry a small patch (often `common/chat.cpp`).
- Write the resulting `llama-server` binary into the Modal Volume so every replica mounts the same artifact.
- Treat this step as infrequent: cache success and only rebuild when you change the tag or patch set.
1modal run deploy.py::build_llama_server2# or: modal run deploy.py::build (if your script bundles compile + install)
Step 3: Download GGUF + mmproj into the volume
Pull `gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf` and `mmproj-F32.gguf` into `/models` at image build time. `huggingface_hub[hf_xet]` speeds large artifact transfer via chunked transport.
- Idempotent scripts should skip work when shard fingerprints or marker files already exist.
- Commit the build volume after compile so subsequent containers can mount `llama-server` and shared libraries from `/llama-build/<tag>`.
- Set `HF_HUB_OFFLINE=1` in serving containers after the first successful sync to prevent accidental re-downloads.
1modal run deploy.py::download_model
Step 4: Optional: verify volume consistency
Fail fast before you expose users: confirm both weight files exist, sizes look sane, and the compiled binary is present. This mirrors production-grade GLM-style pipelines that gate deploy on structural checks.
- Check GGUF + mmproj paths match what `_build_llama_cmd()` passes to `--model` and `--mmproj`.
- Optionally hash a small header or read GGUF metadata to catch truncated downloads.
- If verification fails, fix data before paying GPU time on a broken tree.
1modal run deploy.py::verify_model # name may vary in your repo
Step 5: Deploy, warm, snapshot
Run `modal deploy` so Modal schedules the service. On first container start the subprocess loads weights into VRAM, serves a short synthetic warmup, then Modal’s `@modal.enter(snap=True)` path freezes process state. Restores after idle should feel closer to “warm” than “reload 17 GB from disk each time.”
- `--no-mmap` must be set before snapshot capture: mmap-backed weight loading has been unstable across restore boundaries.
- Warmup primes caches and exercises code paths you care about (tokenizer, attention, optional vision tower if used).
- Expect snapshot creation to be a meaningful one-time cost; amortize it across many future idle→active transitions.
- `modal deploy deploy.py` publishes the app; watch logs until `/health` passes on the issued URL.
1modal deploy deploy.py2modal app logs <your-app-name>34curl -fsS https://<app>.modal.run/health
Step 6: Smoke test the API surface
Exercise chat completions, and—if you expose it—`/v1/responses` for Cursor-shaped calls. Confirm auth rejects missing keys and Prometheus metrics scrape if enabled.
- Send a minimal `POST /v1/chat/completions` with your `API_KEY` header or bearer token, matching how llama-server expects auth.
- If you bridge Responses API, send a representative payload and confirm field mapping to chat completions.
- Optionally `GET /metrics` and confirm token counters move under load.
1curl -fsS https://<app>.modal.run/v1/chat/completions \2 -H "Authorization: Bearer $API_KEY" \3 -H "Content-Type: application/json" \4 -d '{"model":"gemma-4","messages":[{"role":"user","content":"ping"}]}'
1# Typical one-direction script; align function names with your deploy.py23pip install modal "huggingface-hub[hf_xet]"4modal setup56modal secret create huggingface-secret HF_TOKEN=$HF_TOKEN7modal secret create gemma4-api-key API_KEY=$API_KEY89modal run deploy.py::build_llama_server10modal run deploy.py::download_model11# modal run deploy.py::verify_model # if implemented1213modal deploy deploy.py14curl -fsS https://<app>.modal.run/health
Snapshots and the --no-mmap contract
What gets frozen, why file-backed mappings break restore, and how to prove the contract holds.
What snapshotting is doing in this stack
Checkpoint/Restore In Userspace serializes a live process tree so a new host can reconstruct it without replaying your full startup script. For inference, that means: threads, register state, virtual memory areas (VMAs), open file descriptors, credentials, and relevant network state. The goal is to hand the next container an image that already has weights resident in the form the runtime expects—not to re-execute disk I/O and GPU initialization from scratch.
Modal's @modal.enter(snap=True) path aligns with memory snapshot semantics: you finish warm work, then seal state. Snapshotting cannot magically fix inconsistent mappings: if restore cannot faithfully recreate how virtual memory points at files or devices, you get subtle corruption, immediate crash, or worse—silent wrong answers.
Why mmap fights checkpoint/restore
Default weight loading often uses memory-mapped files: pages are demand-paged from the GGUF on disk, shared with the page cache, and may be shared read-only across processes. That is efficient for normal servers, but the mapping is a contract between three parties: virtual address, file identity (path, inode, offsets), and kernel bookkeeping. After checkpoint, a new container may mount the same Volume path but still differ in ways that break the mapping contract: inode generation, bind-mount layout, device minor numbers, or timing of lazy faults. Restore metadata must match precisely; when anything drifts, restore either fails or maps the wrong bytes into the address space.
GPU memory is a separate concern: snapshots capture what the Modal stack and platform integration support. Treat GPU state + driver version + binary ABI as part of your release matrix. If you change CUDA driver assumptions or rebuild llama-server without forcing a fresh snapshot cycle, expect restore-time failures that look like “random CUDA errors.”
What --no-mmap buys you
In llama.cpp, --no-mmap forces tensors to be read into ordinary allocated memory rather than file-backed mappings (see the server README). Initial load can be slower and use more RSS during the read, but the steady-state memory image is far closer to “a big blob of anonymous pages + known file descriptors” that checkpoint tooling can round-trip. You are trading optimal Linux page-cache sharing for restore determinism—which is the right trade for snapshot-first scale-to-zero.
Operational guardrails
- Centralize command construction in one function (
_build_llama_cmd()) and unit-test that the argv always contains--no-mmap. - Log the final argv at info level on startup (redact secrets) so production incidents show whether a bad deploy dropped the flag.
- After any change to Volume mount paths or filenames, run one full idle→restore cycle before promoting: snapshot bugs often surface only on second boot.
Related Modal pattern: SGLang with snapshots (different engine, same snapshot idea).
Symptom after idle / restore
Restore fails immediately or llama-server exits on first request
Likely cause
Missing `--no-mmap`, binary/GPU driver mismatch vs snapshot, or weight path not identical across boots.
Symptom after idle / restore
Nonsense outputs or sudden NaNs after restore
Likely cause
Rare but catastrophic: treat as memory image corruption—rebuild snapshot with pinned binary and verified weights.
Symptom after idle / restore
Long pause but eventual success
Likely cause
Cold path still doing heavy work; confirm you are hitting restore (Modal logs) vs full re-init.
| Symptom after idle / restore | Likely cause |
|---|---|
| Restore fails immediately or llama-server exits on first request | Missing `--no-mmap`, binary/GPU driver mismatch vs snapshot, or weight path not identical across boots. |
| Nonsense outputs or sudden NaNs after restore | Rare but catastrophic: treat as memory image corruption—rebuild snapshot with pinned binary and verified weights. |
| Long pause but eventual success | Cold path still doing heavy work; confirm you are hitting restore (Modal logs) vs full re-init. |
Snapshot size and RAM tradeoffs
Anonymous mappings still occupy space in the checkpoint image. A full GPU-resident model state can make snapshots large; that is expected. The economic win is not “tiny images”—it is amortizing minutes of load work across many wakeups. If snapshot creation time or storage cost becomes painful, tune upstream (fewer warm steps, slimmer warmup) in staging before touching production—see Runtime tuning for runtime flags.
1# _build_llama_cmd() must always pass --no-mmap (assert in tests):23def _build_llama_cmd() -> list[str]:4 cmd = [5 "/app/llama-server-bin/llama-server",6 "--model", "/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",7 "--mmproj", "/models/mmproj-F32.gguf",8 "--no-mmap",9 # ... host, port, ctx, parallel, cache types, etc.10 ]11 assert "--no-mmap" in cmd12 return cmd
Server lifecycle in code
Subprocess boundaries, logging, health, warmup content, teardown, and where the snapshot commits.
Why a subprocess (not in-process load)
Run llama-server as a child process so the Python Modal class stays a thin supervisor: it can stream logs, enforce timeouts on readiness, capture exit codes, and optionally restart without losing the whole container on a single CUDA fault. If the server segfaults, your wrapper can log, emit metrics, and exit cleanly so the platform replaces the worker rather than wedging a shared library in the same address space as Modal’s runtime.
Logging without deadlocks
If you attach PIPE to stdout/stderr, you must continuously drain those pipes from another thread or async reader. Undrained pipes can block the child once the kernel buffer fills, which looks like a mysterious hang right after startup. Prefer line-buffered reads and forward each line to Modal’s logging with a prefix (e.g. [llama]). Alternatively, let the server log to files on the Volume if you truly need retention—but for most cases, streaming to Modal logs is enough.
Readiness: poll /health, do not sleep
Replace time.sleep(120) with a loop that GETs http://127.0.0.1:<port>/health (or your bound address) with exponential backoff and a hard cap. Log every N failures so stalled boots are visible. The health endpoint should flip to success only when the model can actually accept completions—not merely when the HTTP socket listens.
Warmup: what to send before snap=True seals
Warmup should exercise the code paths you care about in production. At minimum: one short chat completion that touches tokenizer + attention + sampling. If you use multimodal weights, include a tiny image payload so vision towers and mmproj paths are initialized. If you rely on adaptive thinking, send one request with chat_template_kwargs.enable_thinking so the template and response shape you need are hot. Keep warmup token counts modest—enough to prime, not enough to dominate snapshot time.
Where the snapshot commits
@modal.enter(snap=True) means: when this method returns successfully, Modal may treat the resulting process state as the template for future containers. Therefore do not return until health is green and warmup finished. If you return early, you snapshot a half-ready server and pay debugging cost forever. If you need background work after serving starts, schedule it only after you are sure it should be part of the golden image—or accept a second snapshot cycle.
Shutdown and replacement
On container teardown, send SIGTERM to the child, wait with a timeout, then SIGKILL if needed. Unclosed GPU contexts can delay exit; log wall-clock time for shutdown. For long-running deployments, pair this with the operational notes in Operate & compare for recycle and diagnostics.
1import subprocess2import threading3import time4import urllib.request56@modal.enter(snap=True)7def setup(self):8 self.process = subprocess.Popen(9 _build_llama_cmd(),10 stdout=subprocess.PIPE,11 stderr=subprocess.STDOUT,12 text=True,13 bufsize=1, # line-buffered where possible14 )1516 def _pump_logs():17 assert self.process.stdout is not None18 for line in self.process.stdout:19 print(f"[llama-server] {line.rstrip()}")2021 threading.Thread(target=_pump_logs, daemon=True).start()2223 base = "http://127.0.0.1:8080"24 _wait_for_health(f"{base}/health", timeout_s=600)25 _warmup_chat_completions(base) # short prompts; optional image / thinking probes26 # Successful return => snapshot boundary for new workers2728def _wait_for_health(url: str, timeout_s: int) -> None:29 deadline = time.time() + timeout_s30 delay = 0.531 while time.time() < deadline:32 try:33 with urllib.request.urlopen(url, timeout=5) as r:34 if r.status == 200:35 return36 except OSError:37 pass38 time.sleep(delay)39 delay = min(delay * 1.5, 5.0)40 raise RuntimeError(f"health never ready: {url}")
1import signal2import subprocess34@modal.exit()5def teardown(self):6 if getattr(self, "process", None) and self.process.poll() is None:7 self.process.send_signal(signal.SIGTERM)8 try:9 self.process.wait(timeout=30)10 except subprocess.TimeoutExpired:11 self.process.kill()
Validation checklist
Phased checks from static artifacts through auth, latency, and correctness—before you share the URL.
Phase A — Volume and process contract
- On the build host or a one-off
modal run, print the resolved paths passed to--model,--mmproj, and thellama-serverbinary. Compare againstls -laon the mounted Volume. - Confirm file sizes are in the expected ballpark (multi-GB GGUF, mmproj on the order of gigabytes). Truncated downloads often pass existence checks but fail at first load.
- Grep runtime logs for the argv or startup banner so
--no-mmapappears exactly once and is not overridden by a wrapper script.
Phase B — Authentication and negative tests
- Call
/v1/chat/completionswithout credentials and expect 401/403 (whatever your server is configured to return). Misconfigured auth often ships as “open by accident.” - Repeat with an intentionally wrong key. Then with the production key. Document the header shape (
Authorization: Bearer …) in your runbook so client teams do not guess.
Phase C — Latency: warm path vs idle wake
- Measure time-to-first-token (TTFT) on a warm container (traffic just happened). Then scale to zero, wait until the platform confirms idle, send one request, and measure TTFT again. The second number is what intermittent users feel—optimize that, not only steady-state.
- Log Modal’s transition timestamps if available; correlate with client-side stopwatch to separate queueing from server work.
Phase D — Functional and multimodal correctness
- Text: short deterministic prompt (“Return the word OK only”) to catch garbled output or template failures.
- Multimodal: if
--mmprojis enabled, send a minimal valid image payload and assert a structured response—vision failures should not be discovered in production traffic. - Adaptive thinking: one request with template kwargs enabling thinking; confirm your client can parse both answer and reasoning fields if your product depends on them. Details live under APIs & clients.
Signal
Request volume and error rate
Where to look
Server logs + HTTP status histogram; spike in 5xx after idle points to restore or OOM.
Signal
Tokens/sec and queue depth
Where to look
If Prometheus is enabled on llama-server, scrape /metrics after validation load.
Signal
GPU memory headroom
Where to look
nvidia-smi in one-off debug container or platform metrics; OOM during warmup must be caught before snapshot.
| Signal | Where to look |
|---|---|
| Request volume and error rate | Server logs + HTTP status histogram; spike in 5xx after idle points to restore or OOM. |
| Tokens/sec and queue depth | If Prometheus is enabled on llama-server, scrape /metrics after validation load. |
| GPU memory headroom | nvidia-smi in one-off debug container or platform metrics; OOM during warmup must be caught before snapshot. |
1BASE="https://<app>.modal.run"2KEY="$API_KEY"34# Negative test (expect failure)5curl -sS -o /dev/null -w "%{http_code}\n" "$BASE/v1/chat/completions" \6 -H "Content-Type: application/json" \7 -d '{"model":"gemma-4","messages":[{"role":"user","content":"hi"}]}'89# Positive test10curl -sS "$BASE/v1/chat/completions" \11 -H "Authorization: Bearer $KEY" \12 -H "Content-Type: application/json" \13 -d '{"model":"gemma-4","messages":[{"role":"user","content":"Say OK."}],"max_tokens":16}'
Upgrading llama.cpp
Pin tags, rebase patches, rebuild binaries, refresh snapshots, and roll back with confidence.
Why upgrades are a coordinated release
Bumping llama-server changes HTTP behavior, tokenizer handling, CUDA kernels, and sometimes GGUF expectations. Your Volume still holds the same GGUF bytes, but the binary interpreting them changed. Treat every bump of LLAMA_CPP_TAG as a mini release: compile, run the full validation matrix, then redeploy—never only swap the binary in place on a live snapshot without a new capture cycle.
Recommended sequence
- Read upstream release notes or compare commits between your old and new tag; search for breaking changes in
server,ggml, and CUDA backends. - Rebase local patches (for example Gemma chat-template / thinking-related edits in
common/chat.cpp). Conflicts here are common—resolve in a branch, not on the production builder. - Run
build_llama_server(or your equivalent) against the new tag; write the new binary to a staging path or staging Volume first. - Boot the server against the same GGUF and mmproj you use in prod; run Phase A–D validation from the checklist above.
- Deploy to staging Modal app, force at least one idle period, and confirm restore still works. Snapshot incompatibilities often appear only after the second boot.
- Promote: update production Volume path or swap symlink, redeploy, and let a fresh
snap=Truecycle capture the new golden image. Keep the previous binary on disk until you are confident—rollback is then a pointer swap.
CUDA / driver coupling
If Modal’s base image or GPU driver revision moves in parallel with your upgrade, test together. A new llama-server built against one CUDA toolkit and run under another is a frequent source of “works on my builder” failures. Pin builder and runtime image digests in version control when possible.
Rollback
Keep the previous LLAMA_CPP_TAG and matching binary artifact addressable (git tag, Volume path, or object name). Rollback is: redeploy known-good binary, run health + one chat completion, trigger a new snapshot. Do not assume an old snapshot image remains compatible after you have upgraded drivers or changed GPU type.
Upgrade risk
Patch no longer applies cleanly
Mitigation
Cherry-pick upstream fixes first; reduce custom diff to the minimum you truly need.
Upgrade risk
New server rejects old API fields
Mitigation
Diff OpenAPI or server README between tags; run contract tests against /v1/chat/completions.
Upgrade risk
Restore works but quality regressed
Mitigation
Separate infra validation from model QA—run your eval harness on the new binary before promotion.
| Upgrade risk | Mitigation |
|---|---|
| Patch no longer applies cleanly | Cherry-pick upstream fixes first; reduce custom diff to the minimum you truly need. |
| New server rejects old API fields | Diff OpenAPI or server README between tags; run contract tests against /v1/chat/completions. |
| Restore works but quality regressed | Separate infra validation from model QA—run your eval harness on the new binary before promotion. |
References
Official docs for Modal snapshots, llama.cpp server, and model artifacts.
Same URLs as the inline links above; open in a new tab for full context.
- Modal: Serve and scalemodal.com
- Modal Volumesmodal.com
- Modal: Secretsmodal.com
- Modal: Serve very large modelsmodal.com
- Modal: High-performance LLM inferencemodal.com
- Modal: SGLang with snapshots (related pattern)modal.com
- Modal memory snapshotsmodal.com
- llama.cpp server READMEgithub.com
- HuggingFace: unsloth/gemma-4-26B-A4B-it-GGUFhuggingface.co