> **Note:** The canonical experience is the interactive HTML tab: [Operate & compare](https://www.quantml.org/guides/gemma-4-gguf/operations). This file is a text mirror for search engines and AI tools.

# Gemma 4 GGUF — Operate & compare

Complete operational guide covering every issue encountered during deployment, root cause analysis, fixes, monitoring, upgrade procedures, and the 20 key lessons learned from production. Cross-check infra behavior with [Modal memory snapshots](https://modal.com/docs/guide/memory-snapshot) and [llama.cpp](https://github.com/ggml-org/llama.cpp) release notes when upgrading.

## 1. Complete issue runbook {#runbook}

_Every issue encountered during deployment with root cause and fix._

| # | Symptom | Root cause | Fix |
|---:|---|---|---|
| 1 | `unknown model architecture: 'gemma4'` | GHCR Docker image pinned to build 8202; Gemma 4 support added in b8665 | Build llama-server from source at b8678+, cache in Modal Volume |
| 2 | Thinking tokens not visible (~0 thinking tokens) | `llama-cpp-python`'s `create_chat_completion` didn't pass `enable_thinking` to template | Switched to llama-server subprocess which handles this natively |
| 3 | `401 Unauthorized` during local testing | `GEMMA4_API_KEY` env var not set in local shell | Source .env file before running scripts: `source .env` |
| 4 | `--flash-attn` flag syntax error | Newer builds require explicit value, not bare flag | Changed from `--flash-attn` to `--flash-attn on` |
| 5 | AsyncUsageWarning in local entrypoint | Async function calling sync Modal method | Use `.aio()` variants in async functions |
| 6 | Build from source showing old build number | Modal image layer caching returned stale binary | Move build to Modal Volume (decouples from image cache), or use `--force` |
| 7 | Missing `hf_xet` for fast downloads | `huggingface-hub` installed without `hf_xet` extra | Install `huggingface-hub[hf_xet]` + set `HF_XET_HIGH_PERFORMANCE=1` |
| 8 | `thinking_budget_tokens` silently ignored | Gemma 4 parser (PR #21418) omits `thinking_start_tag` / `thinking_end_tag` that budget sampler needs | Patch `common/chat.cpp` to add `<|channel>thought\n` and `<channel|>` before compilation |
| 9 | Cursor IDE Agent mode tools not working | b8678's `/v1/responses` had incomplete tool-calling round-trips | Update `LLAMA_CPP_TAG` to `master`, rebuild, redeploy |

> **Issue #8 is critical:** Without the thinking-tag patch, budget controls are silently ignored for Gemma 4 ([Gemma 4 parser PR](https://github.com/ggml-org/llama.cpp/pull/21418)). The budget block in `server-common.cpp` is skipped entirely because `thinking_end_tag` is empty. Always verify budget behavior after builds.

## 2. Architecture evolution issues {#architecture-issues}

_Why the deployment went through three phases before the final design._

| Phase | Approach | Why it failed / limitations |
|---|---|---|
| Phase 1 | GHCR prebuilt image (`ghcr.io/ggml-org/llama.cpp:server-cuda`) | Pinned to build 8202, predates Gemma 4 support (b8665) |
| Phase 2 | `llama-cpp-python` in-process inference | No multimodal, no native tools, thinking tokens invisible, proxy overhead |
| Phase 3 (Final) | llama-server subprocess, build from source, Modal Volume cache | All features work: vision, tools, thinking, `/v1/responses`, zero-proxy |

### The llama-server binary problem

There is no reliable source of a "latest stable" CUDA-enabled [llama-server](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md) binary for Linux. The options:
- **GHCR Docker image:** Often outdated (was pinned to 8202 when we needed 8665+)
- **GitHub Releases:** No Linux CUDA binaries for most builds
- **PyPI (`llama-cpp-python`):** Missing vision, tools, thinking
- **Build from source:** The only reliable option—cache in a [Modal Volume](https://modal.com/docs/guide/volumes)

## 3. Tool-calling operations {#tools}

| Mode | Observed behavior | Root cause | Defensive pattern |
|---|---|---|---|
| `tool_choice=auto` | Generally reliable | Default server/model path | Preferred mode for production agents. |
| `tool_choice=required` | Can emit empty tool payload | Grammar activates but output can't be parsed | Validate `tool_calls` server-side and retry with stricter system message. |
| Specific named function | Falls back to auto silently | Server parses `tool_choice` as string only; object form ignored | Send only one tool in `tools` array to force it. |
| `tool_choice=none` | May leak raw tokens | Parser suppresses extraction but not generation | Reject response if `tool_calls` present and reissue. |

Keep tool execution idempotent and log every retry chain with request IDs so operator debugging is deterministic. Cross-check client expectations with the [OpenAI tool calling contract](https://platform.openai.com/docs/api-reference/chat/create).

## 4. Error handling {#error-handling}

_Common errors and how to handle them in client code._

| Status | Cause | Fix |
|---|---|---|
| 401 | Missing or invalid API key | Set `Authorization: Bearer <key>` header |
| 408 | Request timeout (slow generation) | Increase client timeout; reduce `max_tokens` |
| 503 | Server starting up (cold start) | Retry after 5-15 seconds |
| Empty content | Model spent all tokens on thinking | Increase `max_tokens`; disable thinking for simple queries |

```python
import time
from openai import OpenAI, APITimeoutError

client = OpenAI(
    base_url="https://<app>.modal.run/v1",
    api_key="YOUR_KEY",
    timeout=120.0,  # generous timeout for cold starts
)

for attempt in range(3):
    try:
        response = client.chat.completions.create(
            model="gemma-4-26b-a4b",
            messages=[{"role": "user", "content": "Hello"}],
            max_tokens=64,
        )
        break
    except APITimeoutError:
        if attempt < 2:
            time.sleep(5)
            continue
        raise
```

## 5. Monitoring and diagnostics {#monitoring}

- Track warm-path and post-idle restore TTFT separately—these are different user experiences.
- Scrape `/metrics` for token counters, queue behavior, and error spikes ([Prometheus concepts](https://prometheus.io/docs/introduction/overview/)).
- Monitor GPU memory headroom to catch OOM before snapshot capture.
- Alert on repeated restore-health failures after idle windows.
- Log the final argv at info level (redact secrets) so incidents show whether flags were correct.

| Signal | Where to look |
|---|---|
| Request volume and error rate | Server logs + HTTP status histogram; spike in 5xx after idle points to restore or OOM |
| Tokens/sec and queue depth | Scrape `/metrics` for token counters |
| GPU memory headroom | `nvidia-smi` in debug container or platform metrics |
| Restore health failures | Modal logs after idle periods |

```bash
# Health check
curl -f https://<app>.modal.run/health

# Prometheus metrics
curl https://<app>.modal.run/metrics

# Modal logs
modal app logs <app-name>

# List available models (verify deployment)
curl https://<app>.modal.run/v1/models \
  -H "Authorization: Bearer $API_KEY"
```

## 6. Upgrade and re-snapshot playbook {#upgrades}

**Why upgrades are a coordinated release:** Bumping llama-server changes HTTP behavior, tokenizer handling, CUDA kernels, and sometimes GGUF expectations. Treat every bump as a **mini release**: compile, run full validation, then redeploy.

1. Read upstream release notes; search for breaking changes in server, ggml, CUDA backends
2. Rebase local patches (e.g., Gemma thinking tags in `common/chat.cpp`)
3. Run `modal run deploy.py::build_llama_server --force` against new tag
4. Boot server against same GGUF and mmproj; run full validation
5. Deploy to staging Modal app, force idle period, confirm restore works
6. Promote: update production, redeploy, let new `snap=True` cycle capture golden image
7. Keep previous binary addressable for rollback

| Upgrade risk | Mitigation |
|---|---|
| Patch no longer applies cleanly | Cherry-pick upstream fixes first; reduce custom diff to minimum |
| New server rejects old API fields | Diff OpenAPI or README between tags; run contract tests |
| Restore works but quality regressed | Separate infra validation from model QA—run eval harness before promotion |
| CUDA/driver coupling | Test new binary with same driver version; pin builder/runtime image digests |

```bash
# Step 1: Rebuild with new tag
modal run deploy.py::build_llama_server --force

# Step 2: Redeploy
modal deploy deploy.py

# Step 3: Verify
modal app logs <app-name>
curl -fsS https://<app>.modal.run/health

# Step 4: Smoke test
curl -X POST https://<app>.modal.run/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma-4-26b-a4b","messages":[{"role":"user","content":"Hello"}],"max_tokens":16}'
```

## 7. Key takeaways {#key-takeaways}

_20 lessons learned from deploying Gemma 4 on Modal._

### On model deployment (1-4)
1. **Always verify the inference engine version matches your model.** [llama.cpp](https://github.com/ggml-org/llama.cpp) adds architecture support in specific builds.
2. **Build caching is essential.** Compiling from source takes ~3 min. Cache in a persistent [Modal Volume](https://modal.com/docs/guide/volumes).
3. **Prefer the native server over Python bindings.** [llama-server](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md) gets features before `llama-cpp-python`.
4. **Zero-proxy is faster.** [Modal's `@modal.web_server`](https://modal.com/docs/guide/webhooks#web-server) routes directly to llama-server.

### On Gemma 4 specifically (5-8)
5. **Thinking is adaptive.** The model may not think for easy problems ([Google AI thinking doc](https://ai.google.dev/gemma/docs/capabilities/thinking)). Don't assume thinking will always happen.
6. **Use the interleaved template for agentic tasks.** It preserves reasoning between tool calls.
7. **`enable_thinking` is per-request via `chat_template_kwargs`.** Toggle without restarting the server.
8. **MoE = fast decode.** Despite 26B params, only 3.8B fire per token. Throughput is closer to a 4B model.

### On Modal (9-13)
9. **GPU memory snapshots are transformative.** 60-120s cold starts -> 5-15s ([Modal snapshot guide](https://modal.com/docs/guide/memory-snapshot)).
10. **`--no-mmap` for CRIU compatibility.** Memory-mapped files break checkpoint/restore ([llama-server README](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md)).
11. **Warmup before snapshot.** CUDA kernel JIT happens on first inference. Capture compiled state.
12. **Volumes persist across deploys.** Cache build artifacts and weights.
13. **Two-image strategy saves cost.** Devel image for building, runtime image for serving.

### On debugging (14-18)
14. **Silent failures are the worst kind.** `thinking_budget_tokens` was silently ignored. Write tests that verify parameters have effects.
15. **Read the full error message.** Most deployment errors are version mismatches.
16. **Source patches are a valid strategy.** Document clearly so they can be removed when upstream catches up.
17. **Check Modal's image cache.** Use `--force` if changes aren't taking effect.
18. **Test with `modal run` before `modal deploy`.** Ephemeral apps for quick testing.

### On IDE and client compatibility (19-20)
19. **The Responses API is the new frontier.** Cursor and Codex CLI use `/v1/responses`, not `/v1/chat/completions` ([llama.cpp Responses PR](https://github.com/ggml-org/llama.cpp/pull/18486)).
20. **Build from master for IDE compatibility.** Tagged builds lag on API compatibility. The gap between b8678 and master fixed Cursor Agent mode.

## 8. Model and deployment comparison {#compare}

| Stack | Primary strength | Typical trade-off |
|---|---|---|
| Gemma-4 GGUF + llama.cpp + Modal snapshots | Portable, cost-aware interactive serving, scale-to-zero | Less deterministic `tool_choice` semantics |
| GLM-5.1 FP8 + SGLang | High-throughput multi-GPU online serving | Higher infrastructure complexity and spend |
| Llama-3 dense + vLLM | Broad ecosystem, mature infra tooling | Less parameter efficiency vs MoE at similar quality |
| OpenAI / Anthropic APIs | Zero ops, best `tool_choice` compliance | Vendor lock-in, no scale-to-zero, higher per-token cost |

## 9. When to choose this stack {#fit}

**Good fit**
- Internal APIs and IDE assistants
- Bursty workloads where scale-to-zero matters
- Interactive use cases needing fast decode
- Projects requiring Apache 2.0 licensing
- Vision + text combined workflows
- Cost-sensitive deployments

**Not ideal for**
- Strict OpenAI `tool_choice` determinism requirements
- Maximum aggregate throughput across multi-GPU fleets
- Audio input requirements (use E2B/E4B variants)
- Production systems requiring SLA guarantees

Pair this tab with [Runtime tuning](https://www.quantml.org/guides/gemma-4-gguf/configuration) for tuning and [APIs & clients](https://www.quantml.org/guides/gemma-4-gguf/features) for client contracts.

## 10. External references {#references}

_Sources cited inline on this tab. Same URLs as the inline links above._

1. [Modal: Serve and scale](https://modal.com/docs/guide/apps)
2. [Modal Volumes](https://modal.com/docs/guide/volumes)
3. [Modal memory snapshots](https://modal.com/docs/guide/memory-snapshot)
4. [Modal: high-performance LLM inference](https://modal.com/docs/guide/high-performance-llm-inference)
5. [Modal web server](https://modal.com/docs/guide/webhooks#web-server)
6. [llama.cpp repository](https://github.com/ggml-org/llama.cpp)
7. [llama.cpp server README](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md)
8. [llama.cpp Gemma parser PR #21418](https://github.com/ggml-org/llama.cpp/pull/21418)
9. [llama.cpp Responses API PR #18486](https://github.com/ggml-org/llama.cpp/pull/18486)
10. [Hugging Face GGUF bundle](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF)
11. [Gemma thinking docs](https://ai.google.dev/gemma/docs/capabilities/thinking)
12. [Prometheus introduction](https://prometheus.io/docs/introduction/overview/)
13. [OpenAI API: Create chat completion](https://platform.openai.com/docs/api-reference/chat/create)

## Related sections

- [Stack overview](https://www.quantml.org/guides/gemma-4-gguf)
- [Modal deployment](https://www.quantml.org/guides/gemma-4-gguf/deployment)
- [Runtime tuning](https://www.quantml.org/guides/gemma-4-gguf/configuration)
- [APIs & clients](https://www.quantml.org/guides/gemma-4-gguf/features)