Gemma-4-26B-A4B-it-GGUF on Modal
Operate & compare
Runbooks, known issues, monitoring, upgrades, and when this stack is the right fit.
Operate & compare
Complete operational guide covering every issue encountered during deployment, root cause analysis, fixes, monitoring, upgrade procedures, and the 20 key lessons learned from production. Cross-check infra behavior with Modal memory snapshots and llama.cpp release notes when upgrading.
Complete issue runbook
Every issue encountered during deployment with root cause and fix.
| # | Symptom | Root cause | Fix |
|---|---|---|---|
| 1 | unknown model architecture: 'gemma4' | GHCR Docker image pinned to build 8202; Gemma 4 support added in b8665 | Build llama-server from source at b8678+, cache in Modal Volume |
| 2 | Thinking tokens not visible (~0 thinking tokens) | llama-cpp-python's create_chat_completion didn't pass enable_thinking to template | Switched to llama-server subprocess which handles this natively |
| 3 | 401 Unauthorized during local testing | GEMMA4_API_KEY env var not set in local shell | Source .env file before running scripts: source .env |
| 4 | --flash-attn flag syntax error | Newer builds require explicit value, not bare flag | Changed from --flash-attn to --flash-attn on |
| 5 | AsyncUsageWarning in local entrypoint | Async function calling sync Modal method | Use .aio() variants in async functions |
| 6 | Build from source showing old build number | Modal image layer caching returned stale binary | Move build to Modal Volume (decouples from image cache), or use --force |
| 7 | Missing hf_xet for fast downloads | huggingface-hub installed without hf_xet extra | Install huggingface-hub[hf_xet] + set HF_XET_HIGH_PERFORMANCE=1 |
| 8 | thinking_budget_tokens silently ignored | Gemma 4 parser (PR #21418) omits thinking_start_tag/thinking_end_tag that budget sampler needs | Patch common/chat.cpp to add <|channel>thought\n and <channel|> before compilation |
| 9 | Cursor IDE Agent mode tools not working | b8678's /v1/responses had incomplete tool-calling round-trips | Update LLAMA_CPP_TAG to master, rebuild, redeploy |
Issue #8 is critical: Without the thinking-tag patch, budget controls are silently ignored for Gemma 4 (Gemma 4 parser PR). The budget block in server-common.cpp is skipped entirely because thinking_end_tag is empty. Always verify budget behavior after builds.
Architecture evolution issues
Why the deployment went through three phases before the final design.
| Phase | Approach | Why it failed / limitations |
|---|---|---|
| Phase 1 | GHCR prebuilt image (ghcr.io/ggml-org/llama.cpp:server-cuda) | Pinned to build 8202, predates Gemma 4 support (b8665) |
| Phase 2 | llama-cpp-python in-process inference | No multimodal, no native tools, thinking tokens invisible, proxy overhead |
| Phase 3 (Final) | llama-server subprocess, build from source, Modal Volume cache | All features work: vision, tools, thinking, /v1/responses, zero-proxy |
The llama-server binary problem
There is no reliable source of a "latest stable" CUDA-enabled llama-server binary for Linux. The options:
- ● GHCR Docker image: Often outdated (was pinned to 8202 when we needed 8665+)
- ● GitHub Releases: No Linux CUDA binaries for most builds
- ● PyPI (llama-cpp-python): Missing vision, tools, thinking
- ● Build from source: The only reliable option—cache in a Modal Volume
Tool-calling operations
Mode
tool_choice=auto
Observed behavior
Generally reliable
Root cause
Default server/model path
Defensive pattern
Preferred mode for production agents.
Mode
tool_choice=required
Observed behavior
Can emit empty tool payload
Root cause
Grammar activates but output can't be parsed
Defensive pattern
Validate tool_calls server-side and retry with stricter system message.
Mode
Specific named function
Observed behavior
Falls back to auto silently
Root cause
Server parses tool_choice as string only; object form ignored
Defensive pattern
Send only one tool in tools array to force it.
Mode
tool_choice=none
Observed behavior
May leak raw tokens
Root cause
Parser suppresses extraction but not generation
Defensive pattern
Reject response if tool_calls present and reissue.
| Mode | Observed behavior | Root cause | Defensive pattern |
|---|---|---|---|
| tool_choice=auto | Generally reliable | Default server/model path | Preferred mode for production agents. |
| tool_choice=required | Can emit empty tool payload | Grammar activates but output can't be parsed | Validate tool_calls server-side and retry with stricter system message. |
| Specific named function | Falls back to auto silently | Server parses tool_choice as string only; object form ignored | Send only one tool in tools array to force it. |
| tool_choice=none | May leak raw tokens | Parser suppresses extraction but not generation | Reject response if tool_calls present and reissue. |
Keep tool execution idempotent and log every retry chain with request IDs so operator debugging is deterministic. Cross-check client expectations with the OpenAI tool calling contract.
Error handling
Common errors and how to handle them in client code.
| Status | Cause | Fix |
|---|---|---|
| 401 | Missing or invalid API key | Set Authorization: Bearer <key> header |
| 408 | Request timeout (slow generation) | Increase client timeout; reduce max_tokens |
| 503 | Server starting up (cold start) | Retry after 5-15 seconds |
| Empty content | Model spent all tokens on thinking | Increase max_tokens; disable thinking for simple queries |
1import time2from openai import OpenAI, APITimeoutError34client = OpenAI(5 base_url="https://<app>.modal.run/v1",6 api_key="YOUR_KEY",7 timeout=120.0, # generous timeout for cold starts8)910for attempt in range(3):11 try:12 response = client.chat.completions.create(13 model="gemma-4-26b-a4b",14 messages=[{"role": "user", "content": "Hello"}],15 max_tokens=64,16 )17 break18 except APITimeoutError:19 if attempt < 2:20 time.sleep(5)21 continue22 raise
Monitoring and diagnostics
- Track warm-path and post-idle restore TTFT separately—these are different user experiences.
- Scrape
/metricsfor token counters, queue behavior, and error spikes (Prometheus concepts). - Monitor GPU memory headroom to catch OOM before snapshot capture.
- Alert on repeated restore-health failures after idle windows.
- Log the final argv at info level (redact secrets) so incidents show whether flags were correct.
| Signal | Where to look |
|---|---|
| Request volume and error rate | Server logs + HTTP status histogram; spike in 5xx after idle points to restore or OOM |
| Tokens/sec and queue depth | Scrape /metrics for token counters |
| GPU memory headroom | nvidia-smi in debug container or platform metrics |
| Restore health failures | Modal logs after idle periods |
1# Health check2curl -f https://<app>.modal.run/health34# Prometheus metrics5curl https://<app>.modal.run/metrics67# Modal logs8modal app logs <app-name>910# List available models (verify deployment)11curl https://<app>.modal.run/v1/models \12 -H "Authorization: Bearer $API_KEY"
Upgrade and re-snapshot playbook
Why upgrades are a coordinated release: Bumping llama-server changes HTTP behavior, tokenizer handling, CUDA kernels, and sometimes GGUF expectations. Treat every bump as a mini release: compile, run full validation, then redeploy.
- Read upstream release notes; search for breaking changes in server, ggml, CUDA backends
- Rebase local patches (e.g., Gemma thinking tags in
common/chat.cpp) - Run
modal run deploy.py::build_llama_server --forceagainst new tag - Boot server against same GGUF and mmproj; run full validation
- Deploy to staging Modal app, force idle period, confirm restore works
- Promote: update production, redeploy, let new
snap=Truecycle capture golden image - Keep previous binary addressable for rollback
| Upgrade risk | Mitigation |
|---|---|
| Patch no longer applies cleanly | Cherry-pick upstream fixes first; reduce custom diff to minimum |
| New server rejects old API fields | Diff OpenAPI or README between tags; run contract tests |
| Restore works but quality regressed | Separate infra validation from model QA—run eval harness before promotion |
| CUDA/driver coupling | Test new binary with same driver version; pin builder/runtime image digests |
1# Step 1: Rebuild with new tag2modal run deploy.py::build_llama_server --force34# Step 2: Redeploy5modal deploy deploy.py67# Step 3: Verify8modal app logs <app-name>9curl -fsS https://<app>.modal.run/health1011# Step 4: Smoke test12curl -X POST https://<app>.modal.run/v1/chat/completions \13 -H "Authorization: Bearer $API_KEY" \14 -H "Content-Type: application/json" \15 -d '{"model":"gemma-4-26b-a4b","messages":[{"role":"user","content":"Hello"}],"max_tokens":16}'
Key takeaways
20 lessons learned from deploying Gemma 4 on Modal.
On model deployment
- Always verify the inference engine version matches your model. llama.cpp adds architecture support in specific builds.
- Build caching is essential. Compiling from source takes ~3 min. Cache in a persistent Modal Volume.
- Prefer the native server over Python bindings. llama-server gets features before llama-cpp-python.
- Zero-proxy is faster. Modal's @modal.web_server routes directly to llama-server.
On Gemma 4 specifically
- Thinking is adaptive. The model may not think for easy problems (Google AI thinking doc). Don't assume thinking will always happen.
- Use the interleaved template for agentic tasks. It preserves reasoning between tool calls.
- enable_thinking is per-request via chat_template_kwargs. Toggle without restarting the server.
- MoE = fast decode. Despite 26B params, only 3.8B fire per token. Throughput is closer to a 4B model.
On Modal
- GPU memory snapshots are transformative. 60-120s cold starts → 5-15s (Modal snapshot guide).
- --no-mmap for CRIU compatibility. Memory-mapped files break checkpoint/restore (llama-server README).
- Warmup before snapshot. CUDA kernel JIT happens on first inference. Capture compiled state.
- Volumes persist across deploys. Cache build artifacts and weights.
- Two-image strategy saves cost. Devel image for building, runtime image for serving.
On debugging
- Silent failures are the worst kind. thinking_budget_tokens was silently ignored. Write tests that verify parameters have effects.
- Read the full error message. Most deployment errors are version mismatches.
- Source patches are a valid strategy. Document clearly so they can be removed when upstream catches up.
- Check Modal's image cache. Use --force if changes aren't taking effect.
- Test with modal run before modal deploy. Ephemeral apps for quick testing.
On IDE and client compatibility
- The Responses API is the new frontier. Cursor and Codex CLI use
/v1/responses, not/v1/chat/completions(llama.cpp Responses PR). - Build from master for IDE compatibility. Tagged builds lag on API compatibility. The gap between b8678 and master fixed Cursor Agent mode.
Model and deployment comparison
| Stack | Primary strength | Typical trade-off |
|---|---|---|
| Gemma-4 GGUF + llama.cpp + Modal snapshots | Portable, cost-aware interactive serving, scale-to-zero | Less deterministic tool_choice semantics |
| GLM-5.1 FP8 + SGLang | High-throughput multi-GPU online serving | Higher infrastructure complexity and spend |
| Llama-3 dense + vLLM | Broad ecosystem, mature infra tooling | Less parameter efficiency vs MoE at similar quality |
| OpenAI / Anthropic APIs | Zero ops, best tool_choice compliance | Vendor lock-in, no scale-to-zero, higher per-token cost |
When to choose this stack
Good fit
- ● Internal APIs and IDE assistants
- ● Bursty workloads where scale-to-zero matters
- ● Interactive use cases needing fast decode
- ● Projects requiring Apache 2.0 licensing
- ● Vision + text combined workflows
- ● Cost-sensitive deployments
Not ideal for
- ● Strict OpenAI tool_choice determinism requirements
- ● Maximum aggregate throughput across multi-GPU fleets
- ● Audio input requirements (use E2B/E4B variants)
- ● Production systems requiring SLA guarantees
Pair this tab with Runtime tuning for tuning and APIs & clients for client contracts.
External references
Sources cited inline on this tab.
Same URLs as the violet inline links above.
- Modal: Serve and scalemodal.com
- Modal Volumesmodal.com
- Modal memory snapshotsmodal.com
- Modal: High-performance LLM inferencemodal.com
- Modal Web Servermodal.com
- llama.cpp GitHubgithub.com
- llama.cpp server READMEgithub.com
- llama.cpp Gemma 4 Parser (PR #21418)github.com
- llama.cpp Responses API (PR #18486)github.com
- HuggingFace: unsloth/gemma-4-26B-A4B-it-GGUFhuggingface.co
- Gemma 4 Thinking Capabilitiesai.google.dev
- Prometheus: Introductionprometheus.io
- OpenAI API: Create chat completionplatform.openai.com