GuideProduction

Gemma-4-26B-A4B-it-GGUF on Modal

Operate & compare

Runbooks, known issues, monitoring, upgrades, and when this stack is the right fit.

Operate & compare

Complete operational guide covering every issue encountered during deployment, root cause analysis, fixes, monitoring, upgrade procedures, and the 20 key lessons learned from production. Cross-check infra behavior with Modal memory snapshots and llama.cpp release notes when upgrading.

01

Complete issue runbook

Every issue encountered during deployment with root cause and fix.

#SymptomRoot causeFix
1unknown model architecture: 'gemma4'GHCR Docker image pinned to build 8202; Gemma 4 support added in b8665Build llama-server from source at b8678+, cache in Modal Volume
2Thinking tokens not visible (~0 thinking tokens)llama-cpp-python's create_chat_completion didn't pass enable_thinking to templateSwitched to llama-server subprocess which handles this natively
3401 Unauthorized during local testingGEMMA4_API_KEY env var not set in local shellSource .env file before running scripts: source .env
4--flash-attn flag syntax errorNewer builds require explicit value, not bare flagChanged from --flash-attn to --flash-attn on
5AsyncUsageWarning in local entrypointAsync function calling sync Modal methodUse .aio() variants in async functions
6Build from source showing old build numberModal image layer caching returned stale binaryMove build to Modal Volume (decouples from image cache), or use --force
7Missing hf_xet for fast downloadshuggingface-hub installed without hf_xet extraInstall huggingface-hub[hf_xet] + set HF_XET_HIGH_PERFORMANCE=1
8thinking_budget_tokens silently ignoredGemma 4 parser (PR #21418) omits thinking_start_tag/thinking_end_tag that budget sampler needsPatch common/chat.cpp to add <|channel>thought\n and <channel|> before compilation
9Cursor IDE Agent mode tools not workingb8678's /v1/responses had incomplete tool-calling round-tripsUpdate LLAMA_CPP_TAG to master, rebuild, redeploy

Issue #8 is critical: Without the thinking-tag patch, budget controls are silently ignored for Gemma 4 (Gemma 4 parser PR). The budget block in server-common.cpp is skipped entirely because thinking_end_tag is empty. Always verify budget behavior after builds.

02

Architecture evolution issues

Why the deployment went through three phases before the final design.

PhaseApproachWhy it failed / limitations
Phase 1GHCR prebuilt image (ghcr.io/ggml-org/llama.cpp:server-cuda)Pinned to build 8202, predates Gemma 4 support (b8665)
Phase 2llama-cpp-python in-process inferenceNo multimodal, no native tools, thinking tokens invisible, proxy overhead
Phase 3 (Final)llama-server subprocess, build from source, Modal Volume cacheAll features work: vision, tools, thinking, /v1/responses, zero-proxy

The llama-server binary problem

There is no reliable source of a "latest stable" CUDA-enabled llama-server binary for Linux. The options:

  • GHCR Docker image: Often outdated (was pinned to 8202 when we needed 8665+)
  • GitHub Releases: No Linux CUDA binaries for most builds
  • PyPI (llama-cpp-python): Missing vision, tools, thinking
  • Build from source: The only reliable option—cache in a Modal Volume
03

Tool-calling operations

Mode

tool_choice=auto

Observed behavior

Generally reliable

Root cause

Default server/model path

Defensive pattern

Preferred mode for production agents.

Mode

tool_choice=required

Observed behavior

Can emit empty tool payload

Root cause

Grammar activates but output can't be parsed

Defensive pattern

Validate tool_calls server-side and retry with stricter system message.

Mode

Specific named function

Observed behavior

Falls back to auto silently

Root cause

Server parses tool_choice as string only; object form ignored

Defensive pattern

Send only one tool in tools array to force it.

Mode

tool_choice=none

Observed behavior

May leak raw tokens

Root cause

Parser suppresses extraction but not generation

Defensive pattern

Reject response if tool_calls present and reissue.

Keep tool execution idempotent and log every retry chain with request IDs so operator debugging is deterministic. Cross-check client expectations with the OpenAI tool calling contract.

04

Error handling

Common errors and how to handle them in client code.

StatusCauseFix
401Missing or invalid API keySet Authorization: Bearer <key> header
408Request timeout (slow generation)Increase client timeout; reduce max_tokens
503Server starting up (cold start)Retry after 5-15 seconds
Empty contentModel spent all tokens on thinkingIncrease max_tokens; disable thinking for simple queries
cold-start-handling.py
1import time
2from openai import OpenAI, APITimeoutError
3
4client = OpenAI(
5 base_url="https://<app>.modal.run/v1",
6 api_key="YOUR_KEY",
7 timeout=120.0, # generous timeout for cold starts
8)
9
10for attempt in range(3):
11 try:
12 response = client.chat.completions.create(
13 model="gemma-4-26b-a4b",
14 messages=[{"role": "user", "content": "Hello"}],
15 max_tokens=64,
16 )
17 break
18 except APITimeoutError:
19 if attempt < 2:
20 time.sleep(5)
21 continue
22 raise
05

Monitoring and diagnostics

  • Track warm-path and post-idle restore TTFT separately—these are different user experiences.
  • Scrape /metrics for token counters, queue behavior, and error spikes (Prometheus concepts).
  • Monitor GPU memory headroom to catch OOM before snapshot capture.
  • Alert on repeated restore-health failures after idle windows.
  • Log the final argv at info level (redact secrets) so incidents show whether flags were correct.
SignalWhere to look
Request volume and error rateServer logs + HTTP status histogram; spike in 5xx after idle points to restore or OOM
Tokens/sec and queue depthScrape /metrics for token counters
GPU memory headroomnvidia-smi in debug container or platform metrics
Restore health failuresModal logs after idle periods
ops.sh
1# Health check
2curl -f https://<app>.modal.run/health
3
4# Prometheus metrics
5curl https://<app>.modal.run/metrics
6
7# Modal logs
8modal app logs <app-name>
9
10# List available models (verify deployment)
11curl https://<app>.modal.run/v1/models \
12 -H "Authorization: Bearer $API_KEY"
06

Upgrade and re-snapshot playbook

Why upgrades are a coordinated release: Bumping llama-server changes HTTP behavior, tokenizer handling, CUDA kernels, and sometimes GGUF expectations. Treat every bump as a mini release: compile, run full validation, then redeploy.

  1. Read upstream release notes; search for breaking changes in server, ggml, CUDA backends
  2. Rebase local patches (e.g., Gemma thinking tags in common/chat.cpp)
  3. Run modal run deploy.py::build_llama_server --force against new tag
  4. Boot server against same GGUF and mmproj; run full validation
  5. Deploy to staging Modal app, force idle period, confirm restore works
  6. Promote: update production, redeploy, let new snap=True cycle capture golden image
  7. Keep previous binary addressable for rollback
Upgrade riskMitigation
Patch no longer applies cleanlyCherry-pick upstream fixes first; reduce custom diff to minimum
New server rejects old API fieldsDiff OpenAPI or README between tags; run contract tests
Restore works but quality regressedSeparate infra validation from model QA—run eval harness before promotion
CUDA/driver couplingTest new binary with same driver version; pin builder/runtime image digests
upgrade.sh
1# Step 1: Rebuild with new tag
2modal run deploy.py::build_llama_server --force
3
4# Step 2: Redeploy
5modal deploy deploy.py
6
7# Step 3: Verify
8modal app logs <app-name>
9curl -fsS https://<app>.modal.run/health
10
11# Step 4: Smoke test
12curl -X POST https://<app>.modal.run/v1/chat/completions \
13 -H "Authorization: Bearer $API_KEY" \
14 -H "Content-Type: application/json" \
15 -d '{"model":"gemma-4-26b-a4b","messages":[{"role":"user","content":"Hello"}],"max_tokens":16}'
07

Key takeaways

20 lessons learned from deploying Gemma 4 on Modal.

On model deployment

  1. Always verify the inference engine version matches your model. llama.cpp adds architecture support in specific builds.
  2. Build caching is essential. Compiling from source takes ~3 min. Cache in a persistent Modal Volume.
  3. Prefer the native server over Python bindings. llama-server gets features before llama-cpp-python.
  4. Zero-proxy is faster. Modal's @modal.web_server routes directly to llama-server.

On Gemma 4 specifically

  1. Thinking is adaptive. The model may not think for easy problems (Google AI thinking doc). Don't assume thinking will always happen.
  2. Use the interleaved template for agentic tasks. It preserves reasoning between tool calls.
  3. enable_thinking is per-request via chat_template_kwargs. Toggle without restarting the server.
  4. MoE = fast decode. Despite 26B params, only 3.8B fire per token. Throughput is closer to a 4B model.

On Modal

  1. GPU memory snapshots are transformative. 60-120s cold starts → 5-15s (Modal snapshot guide).
  2. --no-mmap for CRIU compatibility. Memory-mapped files break checkpoint/restore (llama-server README).
  3. Warmup before snapshot. CUDA kernel JIT happens on first inference. Capture compiled state.
  4. Volumes persist across deploys. Cache build artifacts and weights.
  5. Two-image strategy saves cost. Devel image for building, runtime image for serving.

On debugging

  1. Silent failures are the worst kind. thinking_budget_tokens was silently ignored. Write tests that verify parameters have effects.
  2. Read the full error message. Most deployment errors are version mismatches.
  3. Source patches are a valid strategy. Document clearly so they can be removed when upstream catches up.
  4. Check Modal's image cache. Use --force if changes aren't taking effect.
  5. Test with modal run before modal deploy. Ephemeral apps for quick testing.

On IDE and client compatibility

  1. The Responses API is the new frontier. Cursor and Codex CLI use /v1/responses, not /v1/chat/completions (llama.cpp Responses PR).
  2. Build from master for IDE compatibility. Tagged builds lag on API compatibility. The gap between b8678 and master fixed Cursor Agent mode.
08

Model and deployment comparison

StackPrimary strengthTypical trade-off
Gemma-4 GGUF + llama.cpp + Modal snapshotsPortable, cost-aware interactive serving, scale-to-zeroLess deterministic tool_choice semantics
GLM-5.1 FP8 + SGLangHigh-throughput multi-GPU online servingHigher infrastructure complexity and spend
Llama-3 dense + vLLMBroad ecosystem, mature infra toolingLess parameter efficiency vs MoE at similar quality
OpenAI / Anthropic APIsZero ops, best tool_choice complianceVendor lock-in, no scale-to-zero, higher per-token cost
09

When to choose this stack

Good fit

  • ● Internal APIs and IDE assistants
  • ● Bursty workloads where scale-to-zero matters
  • ● Interactive use cases needing fast decode
  • ● Projects requiring Apache 2.0 licensing
  • ● Vision + text combined workflows
  • ● Cost-sensitive deployments

Not ideal for

  • ● Strict OpenAI tool_choice determinism requirements
  • ● Maximum aggregate throughput across multi-GPU fleets
  • ● Audio input requirements (use E2B/E4B variants)
  • ● Production systems requiring SLA guarantees

Pair this tab with Runtime tuning for tuning and APIs & clients for client contracts.

10

External references

Sources cited inline on this tab.