Gemma-4-26B-A4B-it-GGUF on Modal
Runtime tuning
llama.cpp flags, OpenAI and extension API parameters, sampling, thinking budgets, and Modal scaling knobs.
Runtime tuning
This page is the operator baseline for the reference deploy.py stack: llama-server as a subprocess, L40S baseline GPU, Modal memory snapshots, and exact runtime flags. Tune from these defaults before trying bespoke experiments.
Baseline constants from deploy.py
Constant
GPU
Value
L40S
Why this baseline exists
48 GB headroom supports weights + mmproj + KV + system overhead.
Constant
N_PARALLEL
Value
4
Why this baseline exists
Good interactive concurrency without saturating KV memory too early.
Constant
PER_SLOT_CTX
Value
65536
Why this baseline exists
Per-user context target before reducing slot count.
Constant
TOTAL_CTX
Value
N_PARALLEL * PER_SLOT_CTX
Why this baseline exists
llama-server divides this across parallel slots.
Constant
KV_CACHE_TYPE
Value
q8_0
Why this baseline exists
Reduces KV pressure compared with f16 cache.
Constant
BATCH_SIZE
Value
2048
Why this baseline exists
Balanced throughput without runaway memory pressure.
Constant
SERVED_NAME
Value
gemma-4-26b-a4b
Why this baseline exists
Stable alias for clients and IDE integrations.
| Constant | Value | Why this baseline exists |
|---|---|---|
| GPU | L40S | 48 GB headroom supports weights + mmproj + KV + system overhead. |
| N_PARALLEL | 4 | Good interactive concurrency without saturating KV memory too early. |
| PER_SLOT_CTX | 65536 | Per-user context target before reducing slot count. |
| TOTAL_CTX | N_PARALLEL * PER_SLOT_CTX | llama-server divides this across parallel slots. |
| KV_CACHE_TYPE | q8_0 | Reduces KV pressure compared with f16 cache. |
| BATCH_SIZE | 2048 | Balanced throughput without runaway memory pressure. |
| SERVED_NAME | gemma-4-26b-a4b | Stable alias for clients and IDE integrations. |
llama-server flag reference
Flag semantics follow upstream llama-server documentation. The --no-mmap default here pairs with Modal snapshot restore.
Flag
--model
Value
/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
Reason
Exact model artifact path used in deploy.py.
Flag
--mmproj
Value
/models/mmproj-F32.gguf
Reason
Enables multimodal image flow.
Flag
--chat-template-file
Value
/app/llama-server-bin/interleaved.jinja
Reason
Keeps interleaved thinking + tool flow stable.
Flag
--no-mmap
Value
enabled
Reason
Required for reliable snapshot restore behavior.
Flag
--n-gpu-layers
Value
999
Reason
Offload all possible layers to GPU.
Flag
--parallel
Value
4
Reason
Balanced throughput for a single replica.
Flag
--ctx-size
Value
262144
Reason
Computed as TOTAL_CTX (4 × 65536).
Flag
--batch-size
Value
2048
Reason
Maintains decode throughput while controlling memory growth.
Flag
--cache-type-k/v
Value
q8_0
Reason
Reduce KV memory pressure.
Flag
--flash-attn
Value
on
Reason
Newer builds require explicit value, not bare flag.
Flag
--jinja
Value
enabled
Reason
Required for Gemma chat templates and tools formatting.
Flag
--reasoning
Value
on
Reason
Expose reasoning traces in API responses.
Flag
--reasoning-budget-message
Value
"... reasoning budget reached, answering now."
Reason
Graceful cutoff text when thinking budget is exhausted.
Flag
--metrics
Value
enabled
Reason
Prometheus-style telemetry on /metrics.
Flag
--alias
Value
gemma-4-26b-a4b
Reason
Stable external model id.
| Flag | Value | Reason |
|---|---|---|
| --model | /models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf | Exact model artifact path used in deploy.py. |
| --mmproj | /models/mmproj-F32.gguf | Enables multimodal image flow. |
| --chat-template-file | /app/llama-server-bin/interleaved.jinja | Keeps interleaved thinking + tool flow stable. |
| --no-mmap | enabled | Required for reliable snapshot restore behavior. |
| --n-gpu-layers | 999 | Offload all possible layers to GPU. |
| --parallel | 4 | Balanced throughput for a single replica. |
| --ctx-size | 262144 | Computed as TOTAL_CTX (4 × 65536). |
| --batch-size | 2048 | Maintains decode throughput while controlling memory growth. |
| --cache-type-k/v | q8_0 | Reduce KV memory pressure. |
| --flash-attn | on | Newer builds require explicit value, not bare flag. |
| --jinja | enabled | Required for Gemma chat templates and tools formatting. |
| --reasoning | on | Expose reasoning traces in API responses. |
| --reasoning-budget-message | "... reasoning budget reached, answering now." | Graceful cutoff text when thinking budget is exhausted. |
| --metrics | enabled | Prometheus-style telemetry on /metrics. |
| --alias | gemma-4-26b-a4b | Stable external model id. |
1[2 "--model", "/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",3 "--mmproj", "/models/mmproj-F32.gguf",4 "--chat-template-file", "/app/llama-server-bin/interleaved.jinja",5 "--parallel", "4",6 "--ctx-size", "262144",7 "--flash-attn", "on",8 "--cache-type-k", "q8_0",9 "--cache-type-v", "q8_0",10 "--no-mmap",11 "--metrics",12 "--jinja",13 "--reasoning", "on"14]
OpenAI API parameters
Standard parameters fully supported by the endpoint.
Names and shapes follow the official OpenAI Chat Completions contract; this stack documents where the server matches it and where extensions apply.
| Parameter | Type | Default | Description |
|---|---|---|---|
| model | str | required | Model name: "gemma-4-26b-a4b" |
| messages | list | required | Conversation messages array |
| stream | bool | false | Enable streaming responses |
| temperature | float | 1.0 | Sampling temperature. Gemma 4 recommended: 1.0 |
| top_p | float | 0.95 | Nucleus sampling. Gemma 4 recommended: 0.95 |
| max_tokens | int | model default | Maximum tokens to generate |
| stop | str|list | none | Stop sequences |
| seed | int | none | Reproducibility seed |
| frequency_penalty | float | 0 | Penalize frequent tokens |
| presence_penalty | float | 0 | Penalize already-used tokens |
| tools | list | none | Tool/function definitions |
| tool_choice | str|object | "auto" | Tool selection strategy (see limitations in Features tab) |
| response_format | object | none | {"type":"json_object"} for JSON mode |
Extension parameters
llama-server extensions beyond the OpenAI spec. Pass via extra_body.
These parameters are not part of the OpenAI Chat Completions spec. Pass them via extra_body in the Python SDK or directly in the JSON body for curl/HTTP clients.
| Parameter | Type | Default | Description |
|---|---|---|---|
| chat_template_kwargs.enable_thinking | bool | true | Per-request thinking toggle |
| thinking_budget_tokens | int | -1 | Max thinking tokens per request (-1 = unlimited, 0 = disable, N = limit) |
| thinking.type | str | -- | Anthropic-compatible: set to "enabled" |
| thinking.budget_tokens | int | 10000 | Anthropic-compatible thinking budget |
| top_k | int | 40 | Top-K sampling. Gemma 4 recommended: 64 |
| min_p | float | 0 | Minimum probability threshold |
| repeat_penalty | float | 1.0 | Repetition penalty |
1response = client.chat.completions.create(2 model="gemma-4-26b-a4b",3 messages=[{"role": "user", "content": "Solve this step by step"}],4 max_tokens=2048,5 extra_body={6 "chat_template_kwargs": {"enable_thinking": True},7 "thinking_budget_tokens": 256, # limit thinking to 256 tokens8 "top_k": 64, # Gemma 4 recommended9 "min_p": 0.05,10 "repeat_penalty": 1.1,11 },12)
Sampling presets
Recommended parameter combinations from the Gemma 4 model card.
| Preset | temperature | top_p | top_k | min_p | Use case |
|---|---|---|---|---|---|
| default | 1.0 | 0.95 | 64 | 0 | Gemma 4 recommended baseline |
| creative | 1.2 | 0.95 | 64 | 0 | Writing & brainstorming |
| balanced | 0.7 | 0.9 | 40 | 0 | General use |
| precise | 0.3 | 0.8 | 20 | 0.05 | Factual & deterministic |
From the Gemma 4 model card
The recommended default settings are temperature=1.0, top_p=0.95, top_k=64. These are intentionally different from typical OpenAI defaults (temperature=1.0, no top_k). Using these values produces output that matches the model's training distribution per the published GGUF catalog.
Thinking configuration
Per-request control of chain-of-thought reasoning with budget limits.
| enable_thinking | Template behavior | Result |
|---|---|---|
| True | Injects <|think|> in system turn | Model may produce reasoning_content (adaptive) |
| False | Prepends empty thinking block to model turn | Thinking reliably suppressed |
| Not set | Server default (thinking ON via interleaved template) | Same as True |
Adaptive reasoning: Setting enable_thinking: True does not guarantee thinking output. Gemma 4 decides adaptively whether reasoning is needed—it may skip thinking for trivial questions (see Google's thinking capabilities doc). Setting enable_thinking: False is deterministic and always suppresses thinking.
| thinking_budget_tokens | Behavior |
|---|---|
| Not set / -1 | Unlimited thinking (model decides when to stop) |
| 0 | Immediately end thinking (similar to enable_thinking: False) |
| N > 0 | Think for at most N tokens, then forced to answer |
Verified budget test results (prompt: "What is 23 * 47?")
| Budget | Reasoning chars | Tokens | Behavior |
|---|---|---|---|
| 0 | 44 | 26 | Only budget message, no actual thinking |
| 32 | 124 | 59 | Brief thinking + budget cutoff |
| 128 | 359 | 155 | Moderate thinking + budget cutoff |
| unlimited | 1004 | 522 | Full unconstrained thinking |
Build patch required
The Gemma 4 dedicated parser in llama.cpp (PR #21418) omits thinking_start_tag / thinking_end_tag that the reasoning budget sampler needs (reasoning budget wiring). The build_llama_server() function patches common/chat.cpp to add <|channel>thought\n and <channel|> as thinking tags before compilation. Without this patch, thinking_budget_tokens is silently ignored.
Response format
Understanding the response structure including reasoning_content.
1{2 "id": "chatcmpl-xxx",3 "object": "chat.completion",4 "model": "gemma-4-26b-a4b",5 "choices": [{6 "index": 0,7 "message": {8 "role": "assistant",9 "content": "The answer is 42.",10 "reasoning_content": "Let me think about this...",11 "tool_calls": null12 },13 "finish_reason": "stop"14 }],15 "usage": {16 "prompt_tokens": 25,17 "completion_tokens": 150,18 "total_tokens": 17519 }20}
1{"choices":[{"delta":{"reasoning_content":"Let me"},"index":0}]}2{"choices":[{"delta":{"reasoning_content":" think..."},"index":0}]}3{"choices":[{"delta":{"content":"The answer"},"index":0}]}4{"choices":[{"delta":{"content":" is 42."},"index":0}]}5{"choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}
| Field | Standard | Description |
|---|---|---|
| message.content | Yes | Response text |
| message.tool_calls | Yes | Tool/function calls |
| message.reasoning_content | Extension | Thinking/reasoning text (server behavior documented in the llama.cpp server README; field shape follows common OpenAI-compatible extensions) |
| delta.reasoning_content | Extension | Streaming thinking tokens |
| usage | Yes | Token counts |
Reading reasoning_content in Python
The OpenAI SDK doesn't have a typed reasoning_content field. Use model_dump():
# Non-streaming
raw = response.choices[0].message.model_dump()
thinking = raw.get("reasoning_content") or ""
# Streaming
raw_delta = chunk.choices[0].delta.model_dump()
thinking = raw_delta.get("reasoning_content") or ""Each streaming chunk contains either reasoning_content OR content, never both. Thinking chunks always precede response chunks.
VRAM and concurrency trade-offs
- At fixed
ctx-size, increasing--parallelmultiplies active KV cache pressure; see llama-server for how total context is divided across slots. - If you hit memory pressure, first keep
q8_0cache and reduce per-slot context before reducing slots. - When throughput is stable but tail latency climbs, tune batch and slots together; changing one in isolation usually misleads.
- Reference footprint is ~25 GB active usage, leaving practical safety margin on L40S 48 GB for workload spikes.
| KV cache format | VRAM per 16K ctx | Quality impact |
|---|---|---|
| f16 (default) | ~6 GB | Baseline |
| q8_0 (recommended) | ~3 GB | Negligible degradation |
| q4_0 | ~1.5 GB | Minor degradation |
Important: ctx-size / parallel interaction
llama-server divides --ctx-size by --parallel:
--ctx-size 65536 --parallel 4 → 16,384 tokens per slot
If you want 16K per slot with 4 slots, you must set --ctx-size to 4 × 16384 = 65536.
Modal runtime knobs
Snapshot and web-serving primitives are described in Modal memory snapshots and Modal web servers; volumes for cached binaries live in Modal Volumes.
Parameter
gpu
Value
L40S
Reason
Reference deployment baseline in deploy.py.
Parameter
enable_memory_snapshot
Value
true
Reason
Captures warmed state for faster idle restores.
Parameter
enable_gpu_snapshot
Value
true
Reason
Enables GPU-aware snapshot flow on Modal.
Parameter
@modal.enter(snap=True)
Value
startup()
Reason
Snapshot boundary after health + warmup completes.
Parameter
@modal.enter(snap=False)
Value
restore()
Reason
Post-restore health check before serving traffic.
Parameter
max_containers
Value
3
Reason
Burst handling with spend guardrail.
Parameter
scaledown_window
Value
300s
Reason
Reduces idle spend while still smoothing short traffic gaps.
Parameter
@modal.concurrent
Value
max_inputs=10
Reason
Keeps container-level queueing controlled before server saturation.
| Parameter | Value | Reason |
|---|---|---|
| gpu | L40S | Reference deployment baseline in deploy.py. |
| enable_memory_snapshot | true | Captures warmed state for faster idle restores. |
| enable_gpu_snapshot | true | Enables GPU-aware snapshot flow on Modal. |
| @modal.enter(snap=True) | startup() | Snapshot boundary after health + warmup completes. |
| @modal.enter(snap=False) | restore() | Post-restore health check before serving traffic. |
| max_containers | 3 | Burst handling with spend guardrail. |
| scaledown_window | 300s | Reduces idle spend while still smoothing short traffic gaps. |
| @modal.concurrent | max_inputs=10 | Keeps container-level queueing controlled before server saturation. |
Security and environment
- Inject
API_KEYusing Modal secrets, never hardcode in code or Docker layer. - Store
HF_TOKENinhuggingface-secretfor artifact pulls. - Use
huggingface-hub[hf_xet]+HF_XET_HIGH_PERFORMANCE=1for large-file download performance. - Set
HF_HUB_OFFLINE=1for serving paths that should never redownload artifacts at runtime.
Throughput and tuning loop
Treat these as directional checkpoints; always validate on your own prompt mix and concurrency profile.
| Profile | Token/s | TTFT |
|---|---|---|
| Warm single-user (L40S baseline) | Measure on your prompts | Measure per release |
| 4 parallel slots | Track aggregate tokens/s | Watch p95/p99 TTFT |
| Post-snapshot restore | N/A | ~1s ready state |
- Establish warm-path baseline on fixed prompt set.
- Tune KV cache + context, then retest.
- Tune parallel slots next; compare queueing and p95 latency.
- Inspect
/metricsfor token counters and error spikes before promoting.
Cost and release practices
- Scale-to-zero plus memory snapshots is the default for bursty internal traffic.
- Set always-on replicas only when first-token SLO is stricter than idle wake tolerance.
- Keep staging and production apps separate for binary upgrades and snapshot refreshes.
- Deep deployment internals are documented in Modal deployment.
External references
Sources cited inline on this tab.
Same URLs as the violet inline links above.
- OpenAI API: Create chat completionplatform.openai.com
- llama.cpp server READMEgithub.com
- llama.cpp GitHubgithub.com
- Modal Volumesmodal.com
- Modal memory snapshotsmodal.com
- Modal: Secretsmodal.com
- Modal Web Servermodal.com
- Gemma 4 Thinking Capabilitiesai.google.dev
- Gemma 4 Prompt Formatting Guideai.google.dev
- HuggingFace: unsloth/gemma-4-26B-A4B-it-GGUFhuggingface.co
- llama.cpp Gemma 4 Parser (PR #21418)github.com
- llama.cpp Reasoning Budget (PR #20297)github.com