GuideProduction

Gemma-4-26B-A4B-it-GGUF on Modal

Operate & compare

Runbooks, known issues, monitoring, upgrades, and when this stack is the right fit.

Complete operational guide covering every issue encountered during deployment, root cause analysis, fixes, monitoring, upgrade procedures, and the 20 key lessons learned from production. Cross-check infra behavior with Modal memory snapshots and llama.cpp release notes when upgrading.

Complete issue runbook

Every issue encountered during deployment with root cause and fix.

#	Symptom	Root cause	Fix
1	`unknown model architecture: 'gemma4'`	GHCR Docker image pinned to build 8202; Gemma 4 support added in b8665	Build llama-server from source at b8678+, cache in Modal Volume
2	Thinking tokens not visible (~0 thinking tokens)	llama-cpp-python's create_chat_completion didn't pass enable_thinking to template	Switched to llama-server subprocess which handles this natively
3	`401 Unauthorized` during local testing	GEMMA4_API_KEY env var not set in local shell	Source .env file before running scripts: `source .env`
4	`--flash-attn` flag syntax error	Newer builds require explicit value, not bare flag	Changed from `--flash-attn` to `--flash-attn on`
5	AsyncUsageWarning in local entrypoint	Async function calling sync Modal method	Use `.aio()` variants in async functions
6	Build from source showing old build number	Modal image layer caching returned stale binary	Move build to Modal Volume (decouples from image cache), or use `--force`
7	Missing hf_xet for fast downloads	huggingface-hub installed without hf_xet extra	Install `huggingface-hub[hf_xet]` + set `HF_XET_HIGH_PERFORMANCE=1`
8	`thinking_budget_tokens` silently ignored	Gemma 4 parser (PR #21418) omits thinking_start_tag/thinking_end_tag that budget sampler needs	Patch common/chat.cpp to add `<\|channel>thought\n` and `<channel\|>` before compilation
9	Cursor IDE Agent mode tools not working	b8678's /v1/responses had incomplete tool-calling round-trips	Update LLAMA_CPP_TAG to `master`, rebuild, redeploy

Issue #8 is critical: Without the thinking-tag patch, budget controls are silently ignored for Gemma 4 (Gemma 4 parser PR). The budget block in server-common.cpp is skipped entirely because thinking_end_tag is empty. Always verify budget behavior after builds.

Architecture evolution issues

Why the deployment went through three phases before the final design.

Phase	Approach	Why it failed / limitations
Phase 1	GHCR prebuilt image (`ghcr.io/ggml-org/llama.cpp:server-cuda`)	Pinned to build 8202, predates Gemma 4 support (b8665)
Phase 2	llama-cpp-python in-process inference	No multimodal, no native tools, thinking tokens invisible, proxy overhead
Phase 3 (Final)	llama-server subprocess, build from source, Modal Volume cache	All features work: vision, tools, thinking, /v1/responses, zero-proxy

The llama-server binary problem

There is no reliable source of a "latest stable" CUDA-enabled llama-server binary for Linux. The options:

● GHCR Docker image: Often outdated (was pinned to 8202 when we needed 8665+)
● GitHub Releases: No Linux CUDA binaries for most builds
● PyPI (llama-cpp-python): Missing vision, tools, thinking
● Build from source: The only reliable option—cache in a Modal Volume

Tool-calling operations

Mode

tool_choice=auto

Observed behavior

Generally reliable

Root cause

Default server/model path

Defensive pattern

Preferred mode for production agents.

Mode

tool_choice=required

Observed behavior

Can emit empty tool payload

Root cause

Grammar activates but output can't be parsed

Defensive pattern

Validate tool_calls server-side and retry with stricter system message.

Mode

Specific named function

Observed behavior

Falls back to auto silently

Root cause

Server parses tool_choice as string only; object form ignored

Defensive pattern

Send only one tool in tools array to force it.

Mode

tool_choice=none

Observed behavior

May leak raw tokens

Root cause

Parser suppresses extraction but not generation

Defensive pattern

Reject response if tool_calls present and reissue.

Mode	Observed behavior	Root cause	Defensive pattern
tool_choice=auto	Generally reliable	Default server/model path	Preferred mode for production agents.
tool_choice=required	Can emit empty tool payload	Grammar activates but output can't be parsed	Validate tool_calls server-side and retry with stricter system message.
Specific named function	Falls back to auto silently	Server parses tool_choice as string only; object form ignored	Send only one tool in tools array to force it.
tool_choice=none	May leak raw tokens	Parser suppresses extraction but not generation	Reject response if tool_calls present and reissue.

Keep tool execution idempotent and log every retry chain with request IDs so operator debugging is deterministic. Cross-check client expectations with the OpenAI tool calling contract.

Error handling

Common errors and how to handle them in client code.

Status	Cause	Fix
401	Missing or invalid API key	Set `Authorization: Bearer <key>` header
408	Request timeout (slow generation)	Increase client timeout; reduce max_tokens
503	Server starting up (cold start)	Retry after 5-15 seconds
Empty content	Model spent all tokens on thinking	Increase max_tokens; disable thinking for simple queries

cold-start-handling.py

1import time
2from openai import OpenAI, APITimeoutError
3
4client = OpenAI(
5    base_url="https://<app>.modal.run/v1",
6    api_key="YOUR_KEY",
7    timeout=120.0,  # generous timeout for cold starts
8)
9
10for attempt in range(3):
11    try:
12        response = client.chat.completions.create(
13            model="gemma-4-26b-a4b",
14            messages=[{"role": "user", "content": "Hello"}],
15            max_tokens=64,
16        )
17        break
18    except APITimeoutError:
19        if attempt < 2:
20            time.sleep(5)
21            continue
22        raise

Monitoring and diagnostics

Track warm-path and post-idle restore TTFT separately—these are different user experiences.
Scrape /metrics for token counters, queue behavior, and error spikes (Prometheus concepts).
Monitor GPU memory headroom to catch OOM before snapshot capture.
Alert on repeated restore-health failures after idle windows.
Log the final argv at info level (redact secrets) so incidents show whether flags were correct.

Signal	Where to look
Request volume and error rate	Server logs + HTTP status histogram; spike in 5xx after idle points to restore or OOM
Tokens/sec and queue depth	Scrape /metrics for token counters
GPU memory headroom	nvidia-smi in debug container or platform metrics
Restore health failures	Modal logs after idle periods

ops.sh

1# Health check
2curl -f https://<app>.modal.run/health
3
4# Prometheus metrics
5curl https://<app>.modal.run/metrics
6
7# Modal logs
8modal app logs <app-name>
9
10# List available models (verify deployment)
11curl https://<app>.modal.run/v1/models \
12  -H "Authorization: Bearer $API_KEY"

Upgrade and re-snapshot playbook

Why upgrades are a coordinated release: Bumping llama-server changes HTTP behavior, tokenizer handling, CUDA kernels, and sometimes GGUF expectations. Treat every bump as a mini release: compile, run full validation, then redeploy.

Read upstream release notes; search for breaking changes in server, ggml, CUDA backends
Rebase local patches (e.g., Gemma thinking tags in common/chat.cpp)
Run modal run deploy.py::build_llama_server --force against new tag
Boot server against same GGUF and mmproj; run full validation
Deploy to staging Modal app, force idle period, confirm restore works
Promote: update production, redeploy, let new snap=True cycle capture golden image
Keep previous binary addressable for rollback

Upgrade risk	Mitigation
Patch no longer applies cleanly	Cherry-pick upstream fixes first; reduce custom diff to minimum
New server rejects old API fields	Diff OpenAPI or README between tags; run contract tests
Restore works but quality regressed	Separate infra validation from model QA—run eval harness before promotion
CUDA/driver coupling	Test new binary with same driver version; pin builder/runtime image digests

upgrade.sh

1# Step 1: Rebuild with new tag
2modal run deploy.py::build_llama_server --force
3
4# Step 2: Redeploy
5modal deploy deploy.py
6
7# Step 3: Verify
8modal app logs <app-name>
9curl -fsS https://<app>.modal.run/health
10
11# Step 4: Smoke test
12curl -X POST https://<app>.modal.run/v1/chat/completions \
13  -H "Authorization: Bearer $API_KEY" \
14  -H "Content-Type: application/json" \
15  -d '{"model":"gemma-4-26b-a4b","messages":[{"role":"user","content":"Hello"}],"max_tokens":16}'

Key takeaways

20 lessons learned from deploying Gemma 4 on Modal.

On model deployment

Always verify the inference engine version matches your model. llama.cpp adds architecture support in specific builds.
Build caching is essential. Compiling from source takes ~3 min. Cache in a persistent Modal Volume.
Prefer the native server over Python bindings. llama-server gets features before llama-cpp-python.
Zero-proxy is faster. Modal's @modal.web_server routes directly to llama-server.

On Gemma 4 specifically

Thinking is adaptive. The model may not think for easy problems (Google AI thinking doc). Don't assume thinking will always happen.
Use the interleaved template for agentic tasks. It preserves reasoning between tool calls.
enable_thinking is per-request via chat_template_kwargs. Toggle without restarting the server.
MoE = fast decode. Despite 26B params, only 3.8B fire per token. Throughput is closer to a 4B model.

On Modal

GPU memory snapshots are transformative. 60-120s cold starts → 5-15s (Modal snapshot guide).
--no-mmap for CRIU compatibility. Memory-mapped files break checkpoint/restore (llama-server README).
Warmup before snapshot. CUDA kernel JIT happens on first inference. Capture compiled state.
Volumes persist across deploys. Cache build artifacts and weights.
Two-image strategy saves cost. Devel image for building, runtime image for serving.

On debugging

Silent failures are the worst kind. thinking_budget_tokens was silently ignored. Write tests that verify parameters have effects.
Read the full error message. Most deployment errors are version mismatches.
Source patches are a valid strategy. Document clearly so they can be removed when upstream catches up.
Check Modal's image cache. Use --force if changes aren't taking effect.
Test with modal run before modal deploy. Ephemeral apps for quick testing.

On IDE and client compatibility

The Responses API is the new frontier. Cursor and Codex CLI use /v1/responses, not /v1/chat/completions (llama.cpp Responses PR).
Build from master for IDE compatibility. Tagged builds lag on API compatibility. The gap between b8678 and master fixed Cursor Agent mode.

Model and deployment comparison

Stack	Primary strength	Typical trade-off
Gemma-4 GGUF + llama.cpp + Modal snapshots	Portable, cost-aware interactive serving, scale-to-zero	Less deterministic tool_choice semantics
GLM-5.1 FP8 + SGLang	High-throughput multi-GPU online serving	Higher infrastructure complexity and spend
Llama-3 dense + vLLM	Broad ecosystem, mature infra tooling	Less parameter efficiency vs MoE at similar quality
OpenAI / Anthropic APIs	Zero ops, best tool_choice compliance	Vendor lock-in, no scale-to-zero, higher per-token cost

When to choose this stack

Good fit

● Internal APIs and IDE assistants
● Bursty workloads where scale-to-zero matters
● Interactive use cases needing fast decode
● Projects requiring Apache 2.0 licensing
● Vision + text combined workflows
● Cost-sensitive deployments

Not ideal for

● Strict OpenAI tool_choice determinism requirements
● Maximum aggregate throughput across multi-GPU fleets
● Audio input requirements (use E2B/E4B variants)
● Production systems requiring SLA guarantees

Pair this tab with Runtime tuning for tuning and APIs & clients for client contracts.

External references

Sources cited inline on this tab.

Same URLs as the violet inline links above.

Related sections

Stack overview Modal deployment Runtime tuning APIs & clients