> **Note:** The canonical guide is the HTML version at [https://www.quantml.org/guides/gemma-4-gguf/configuration](https://www.quantml.org/guides/gemma-4-gguf/configuration). This markdown is for machine ingestion.

# Gemma 4 GGUF - Runtime tuning

This page is the operator baseline for the reference `deploy.py` stack: [`llama-server`](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md) as a subprocess, L40S baseline GPU, [Modal memory snapshots](https://modal.com/docs/guide/memory-snapshot), and exact runtime flags. Tune from these defaults before trying bespoke experiments.

## 1. Baseline constants from deploy.py

| Constant | Value | Why this baseline exists |
|---|---|---|
| `GPU` | `L40S` | 48 GB headroom supports weights + mmproj + KV + system overhead. |
| `N_PARALLEL` | `4` | Good interactive concurrency without saturating KV memory too early. |
| `PER_SLOT_CTX` | `65536` | Per-user context target before reducing slot count. |
| `TOTAL_CTX` | `N_PARALLEL * PER_SLOT_CTX` | `llama-server` divides this across parallel slots. |
| `KV_CACHE_TYPE` | `q8_0` | Reduces KV pressure compared with f16 cache. |
| `BATCH_SIZE` | `2048` | Balanced throughput without runaway memory pressure. |
| `SERVED_NAME` | `gemma-4-26b-a4b` | Stable alias for clients and IDE integrations. |

## 2. llama-server flag reference

Flag semantics follow [upstream llama-server documentation](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md). The `--no-mmap` default here pairs with [Modal snapshot restore](https://modal.com/docs/guide/memory-snapshot).

| Flag | Value | Reason |
|---|---|---|
| `--model` | `/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf` | Exact model artifact path used in `deploy.py`. |
| `--mmproj` | `/models/mmproj-F32.gguf` | Enables multimodal image flow. |
| `--chat-template-file` | `/app/llama-server-bin/interleaved.jinja` | Keeps interleaved thinking + tool flow stable. |
| `--no-mmap` | `enabled` | Required for reliable snapshot restore behavior. |
| `--n-gpu-layers` | `999` | Offload all possible layers to GPU. |
| `--parallel` | `4` | Balanced throughput for a single replica. |
| `--ctx-size` | `262144` | Computed as TOTAL_CTX (`4 * 65536`). |
| `--batch-size` | `2048` | Maintains decode throughput while controlling memory growth. |
| `--cache-type-k/v` | `q8_0` | Reduce KV memory pressure. |
| `--flash-attn` | `on` | Newer builds require explicit value, not bare flag. |
| `--jinja` | `enabled` | Required for Gemma chat templates and tools formatting. |
| `--reasoning` | `on` | Expose reasoning traces in API responses. |
| `--reasoning-budget-message` | `"... reasoning budget reached, answering now."` | Graceful cutoff text when thinking budget is exhausted. |
| `--metrics` | `enabled` | Prometheus-style telemetry on `/metrics`. |
| `--alias` | `gemma-4-26b-a4b` | Stable external model id. |

```python
# deploy.py (cmd extract)
[
  "--model", "/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
  "--mmproj", "/models/mmproj-F32.gguf",
  "--chat-template-file", "/app/llama-server-bin/interleaved.jinja",
  "--parallel", "4",
  "--ctx-size", "262144",
  "--flash-attn", "on",
  "--cache-type-k", "q8_0",
  "--cache-type-v", "q8_0",
  "--no-mmap",
  "--metrics",
  "--jinja",
  "--reasoning", "on",
]
```

## 3. OpenAI API parameters

Names and shapes follow the official [OpenAI Chat Completions](https://platform.openai.com/docs/api-reference/chat/create) contract; this stack documents where the server matches it and where extensions apply.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `model` | `str` | `required` | Model name: `"gemma-4-26b-a4b"` |
| `messages` | `list` | `required` | Conversation messages array |
| `stream` | `bool` | `false` | Enable streaming responses |
| `temperature` | `float` | `1.0` | Sampling temperature. Gemma 4 recommended: `1.0` |
| `top_p` | `float` | `0.95` | Nucleus sampling. Gemma 4 recommended: `0.95` |
| `max_tokens` | `int` | `model default` | Maximum tokens to generate |
| `stop` | `str \| list` | `none` | Stop sequences |
| `seed` | `int` | `none` | Reproducibility seed |
| `frequency_penalty` | `float` | `0` | Penalize frequent tokens |
| `presence_penalty` | `float` | `0` | Penalize already-used tokens |
| `tools` | `list` | `none` | Tool/function definitions |
| `tool_choice` | `str \| object` | `"auto"` | Tool selection strategy (see limitations in APIs and clients tab) |
| `response_format` | `object` | `none` | `{"type":"json_object"}` for JSON mode |

## 4. Extension parameters

`llama-server` extensions beyond the OpenAI spec. Pass these via `extra_body`.

> These parameters are not part of the [OpenAI Chat Completions](https://platform.openai.com/docs/api-reference/chat/create) spec. Pass them via `extra_body` in the Python SDK or directly in the JSON body for curl/HTTP clients.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `chat_template_kwargs.enable_thinking` | `bool` | `true` | Per-request thinking toggle |
| `thinking_budget_tokens` | `int` | `-1` | Max thinking tokens per request (`-1` = unlimited, `0` = disable, `N` = limit) |
| `thinking.type` | `str` | `--` | Anthropic-compatible: set to `"enabled"` |
| `thinking.budget_tokens` | `int` | `10000` | Anthropic-compatible thinking budget |
| `top_k` | `int` | `40` | Top-K sampling. Gemma 4 recommended: `64` |
| `min_p` | `float` | `0` | Minimum probability threshold |
| `repeat_penalty` | `float` | `1.0` | Repetition penalty |

```python
response = client.chat.completions.create(
    model="gemma-4-26b-a4b",
    messages=[{"role": "user", "content": "Solve this step by step"}],
    max_tokens=2048,
    extra_body={
        "chat_template_kwargs": {"enable_thinking": True},
        "thinking_budget_tokens": 256,  # limit thinking to 256 tokens
        "top_k": 64,                     # Gemma 4 recommended
        "min_p": 0.05,
        "repeat_penalty": 1.1,
    },
)
```

## 5. Sampling presets

Recommended parameter combinations from the Gemma 4 model card.

| Preset | temperature | top_p | top_k | min_p | Use case |
|---|---:|---:|---:|---:|---|
| default | `1.0` | `0.95` | `64` | `0` | Gemma 4 recommended baseline |
| creative | `1.2` | `0.95` | `64` | `0` | Writing and brainstorming |
| balanced | `0.7` | `0.9` | `40` | `0` | General use |
| precise | `0.3` | `0.8` | `20` | `0.05` | Factual and deterministic |

The recommended default settings are `temperature=1.0, top_p=0.95, top_k=64`. These are intentionally different from typical OpenAI defaults (temperature=1.0, no top_k). Using these values produces output that matches the model's training distribution per the [published GGUF catalog](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF).

## 6. Thinking configuration

Per-request control of chain-of-thought reasoning with budget limits.

| enable_thinking | Template behavior | Result |
|---|---|---|
| `True` | Injects `<\|think\|>` in system turn | Model *may* produce `reasoning_content` (adaptive) |
| `False` | Prepends empty thinking block to model turn | Thinking **reliably suppressed** |
| `Not set` | Server default (thinking ON via interleaved template) | Same as `True` |

> **Adaptive reasoning:** Setting `enable_thinking: True` does not *guarantee* thinking output. Gemma 4 decides adaptively whether reasoning is needed - it may skip thinking for trivial questions (see [Google's thinking capabilities doc](https://ai.google.dev/gemma/docs/capabilities/thinking)). Setting `enable_thinking: False` is deterministic and always suppresses thinking.

| thinking_budget_tokens | Behavior |
|---|---|
| `Not set / -1` | Unlimited thinking (model decides when to stop) |
| `0` | Immediately end thinking (similar to `enable_thinking: False`) |
| `N > 0` | Think for at most N tokens, then forced to answer |

Verified budget test results (`prompt: "What is 23 * 47?"`):

| Budget | Reasoning chars | Tokens | Behavior |
|---|---:|---:|---|
| `0` | `44` | `26` | Only budget message, no actual thinking |
| `32` | `124` | `59` | Brief thinking + budget cutoff |
| `128` | `359` | `155` | Moderate thinking + budget cutoff |
| `unlimited` | `1004` | `522` | Full unconstrained thinking |

> **Build patch required:** The Gemma 4 dedicated parser in [llama.cpp (PR #21418)](https://github.com/ggml-org/llama.cpp/pull/21418) omits `thinking_start_tag` / `thinking_end_tag` that the reasoning budget sampler needs ([reasoning budget wiring](https://github.com/ggml-org/llama.cpp/pull/20297)). The `build_llama_server()` function patches `common/chat.cpp` to add `<|channel>thought\n` and `<channel|>` as thinking tags before compilation. Without this patch, `thinking_budget_tokens` is silently ignored.

## 7. Response format

Understanding the response structure including `reasoning_content`.

```json
{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "model": "gemma-4-26b-a4b",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "Let me think about this...",
      "tool_calls": null
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 150,
    "total_tokens": 175
  }
}
```

```json
{"choices":[{"delta":{"reasoning_content":"Let me"},"index":0}]}
{"choices":[{"delta":{"reasoning_content":" think..."},"index":0}]}
{"choices":[{"delta":{"content":"The answer"},"index":0}]}
{"choices":[{"delta":{"content":" is 42."},"index":0}]}
{"choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}
```

| Field | Standard | Description |
|---|---|---|
| `message.content` | Yes | Response text |
| `message.tool_calls` | Yes | Tool/function calls |
| `message.reasoning_content` | Extension | Thinking/reasoning text (server behavior documented in the [llama.cpp server README](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md); field shape follows common OpenAI-compatible extensions) |
| `delta.reasoning_content` | Extension | Streaming thinking tokens |
| `usage` | Yes | Token counts |

Reading `reasoning_content` in Python:

```python
# Non-streaming
raw = response.choices[0].message.model_dump()
thinking = raw.get("reasoning_content") or ""

# Streaming
raw_delta = chunk.choices[0].delta.model_dump()
thinking = raw_delta.get("reasoning_content") or ""
```

> Each streaming chunk contains either `reasoning_content` OR `content`, never both. Thinking chunks always precede response chunks.

## 8. VRAM and concurrency trade-offs

- At fixed `ctx-size`, increasing `--parallel` multiplies active KV cache pressure; see [`llama-server`](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md) for how total context is divided across slots.
- If you hit memory pressure, first keep `q8_0` cache and reduce per-slot context before reducing slots.
- When throughput is stable but tail latency climbs, tune batch and slots together; changing one in isolation usually misleads.
- Reference footprint is about 25 GB active usage, leaving practical safety margin on L40S 48 GB for workload spikes.

| KV cache format | VRAM per 16K ctx | Quality impact |
|---|---:|---|
| `f16 (default)` | `~6 GB` | Baseline |
| `q8_0 (recommended)` | `~3 GB` | Negligible degradation |
| `q4_0` | `~1.5 GB` | Minor degradation |

Important: `ctx-size` / `parallel` interaction:

```text
--ctx-size 65536  --parallel 4  ->  16,384 tokens per slot
```

If you want 16K per slot with 4 slots, you must set `--ctx-size` to `4 * 16384 = 65536`.

## 9. Modal runtime knobs

Snapshot and web-serving primitives are described in [Modal memory snapshots](https://modal.com/docs/guide/memory-snapshot) and [Modal web servers](https://modal.com/docs/guide/webhooks#web-server); volumes for cached binaries live in [Modal Volumes](https://modal.com/docs/guide/volumes).

| Parameter | Value | Reason |
|---|---|---|
| `gpu` | `L40S` | Reference deployment baseline in `deploy.py`. |
| `enable_memory_snapshot` | `true` | Captures warmed state for faster idle restores. |
| `enable_gpu_snapshot` | `true` | Enables GPU-aware snapshot flow on Modal. |
| `@modal.enter(snap=True)` | `startup()` | Snapshot boundary after health + warmup completes. |
| `@modal.enter(snap=False)` | `restore()` | Post-restore health check before serving traffic. |
| `max_containers` | `3` | Burst handling with spend guardrail. |
| `scaledown_window` | `300s` | Reduces idle spend while still smoothing short traffic gaps. |
| `@modal.concurrent` | `max_inputs=10` | Keeps container-level queueing controlled before server saturation. |

## 10. Security and environment

- Inject `API_KEY` using [Modal secrets](https://modal.com/docs/guide/secrets), never hardcode in code or Docker layer.
- Store `HF_TOKEN` in `huggingface-secret` for artifact pulls.
- Use `huggingface-hub[hf_xet]` + `HF_XET_HIGH_PERFORMANCE=1` for large-file download performance.
- Set `HF_HUB_OFFLINE=1` for serving paths that should never redownload artifacts at runtime.

## 11. Throughput and tuning loop

Treat these as directional checkpoints; always validate on your own prompt mix and concurrency profile.

| Profile | Token/s | TTFT |
|---|---|---|
| Warm single-user (L40S baseline) | Measure on your prompts | Measure per release |
| 4 parallel slots | Track aggregate tokens/s | Watch p95/p99 TTFT |
| Post-snapshot restore | N/A | `~1s` ready state |

1. Establish warm-path baseline on fixed prompt set.
2. Tune KV cache + context, then retest.
3. Tune parallel slots next; compare queueing and p95 latency.
4. Inspect `/metrics` for token counters and error spikes before promoting.

## 12. Cost and release practices

- Scale-to-zero plus [memory snapshots](https://modal.com/docs/guide/memory-snapshot) is the default for bursty internal traffic.
- Set always-on replicas only when first-token SLO is stricter than idle wake tolerance.
- Keep staging and production apps separate for binary upgrades and snapshot refreshes.
- Deep deployment internals are documented in [Modal deployment](https://www.quantml.org/guides/gemma-4-gguf/deployment).

## 13. External references

Same URLs as the inline citations above:

- [OpenAI API: Create chat completion](https://platform.openai.com/docs/api-reference/chat/create)
- [llama.cpp server README](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md)
- [llama.cpp GitHub](https://github.com/ggml-org/llama.cpp)
- [Modal Volumes](https://modal.com/docs/guide/volumes)
- [Modal memory snapshots](https://modal.com/docs/guide/memory-snapshot)
- [Modal: Secrets](https://modal.com/docs/guide/secrets)
- [Modal Web Server](https://modal.com/docs/guide/webhooks#web-server)
- [Gemma 4 Thinking Capabilities](https://ai.google.dev/gemma/docs/capabilities/thinking)
- [Gemma 4 Prompt Formatting Guide](https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4)
- [HuggingFace: unsloth/gemma-4-26B-A4B-it-GGUF](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF)
- [llama.cpp Gemma 4 Parser (PR #21418)](https://github.com/ggml-org/llama.cpp/pull/21418)
- [llama.cpp Reasoning Budget (PR #20297)](https://github.com/ggml-org/llama.cpp/pull/20297)

## Related sections

- [Stack overview](https://www.quantml.org/guides/gemma-4-gguf)
- [Modal deployment](https://www.quantml.org/guides/gemma-4-gguf/deployment)
- [APIs and clients](https://www.quantml.org/guides/gemma-4-gguf/features)
- [Operate and compare](https://www.quantml.org/guides/gemma-4-gguf/operations)
