GuideProduction

Gemma-4-26B-A4B-it-GGUF on Modal

Runtime tuning

llama.cpp flags, OpenAI and extension API parameters, sampling, thinking budgets, and Modal scaling knobs.

Runtime tuning

This page is the operator baseline for the reference deploy.py stack: llama-server as a subprocess, L40S baseline GPU, Modal memory snapshots, and exact runtime flags. Tune from these defaults before trying bespoke experiments.

01

Baseline constants from deploy.py

Constant

GPU

Value

L40S

Why this baseline exists

48 GB headroom supports weights + mmproj + KV + system overhead.

Constant

N_PARALLEL

Value

4

Why this baseline exists

Good interactive concurrency without saturating KV memory too early.

Constant

PER_SLOT_CTX

Value

65536

Why this baseline exists

Per-user context target before reducing slot count.

Constant

TOTAL_CTX

Value

N_PARALLEL * PER_SLOT_CTX

Why this baseline exists

llama-server divides this across parallel slots.

Constant

KV_CACHE_TYPE

Value

q8_0

Why this baseline exists

Reduces KV pressure compared with f16 cache.

Constant

BATCH_SIZE

Value

2048

Why this baseline exists

Balanced throughput without runaway memory pressure.

Constant

SERVED_NAME

Value

gemma-4-26b-a4b

Why this baseline exists

Stable alias for clients and IDE integrations.

02

llama-server flag reference

Flag semantics follow upstream llama-server documentation. The --no-mmap default here pairs with Modal snapshot restore.

Flag

--model

Value

/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf

Reason

Exact model artifact path used in deploy.py.

Flag

--mmproj

Value

/models/mmproj-F32.gguf

Reason

Enables multimodal image flow.

Flag

--chat-template-file

Value

/app/llama-server-bin/interleaved.jinja

Reason

Keeps interleaved thinking + tool flow stable.

Flag

--no-mmap

Value

enabled

Reason

Required for reliable snapshot restore behavior.

Flag

--n-gpu-layers

Value

999

Reason

Offload all possible layers to GPU.

Flag

--parallel

Value

4

Reason

Balanced throughput for a single replica.

Flag

--ctx-size

Value

262144

Reason

Computed as TOTAL_CTX (4 × 65536).

Flag

--batch-size

Value

2048

Reason

Maintains decode throughput while controlling memory growth.

Flag

--cache-type-k/v

Value

q8_0

Reason

Reduce KV memory pressure.

Flag

--flash-attn

Value

on

Reason

Newer builds require explicit value, not bare flag.

Flag

--jinja

Value

enabled

Reason

Required for Gemma chat templates and tools formatting.

Flag

--reasoning

Value

on

Reason

Expose reasoning traces in API responses.

Flag

--reasoning-budget-message

Value

"... reasoning budget reached, answering now."

Reason

Graceful cutoff text when thinking budget is exhausted.

Flag

--metrics

Value

enabled

Reason

Prometheus-style telemetry on /metrics.

Flag

--alias

Value

gemma-4-26b-a4b

Reason

Stable external model id.

deploy.py (cmd extract)
1[
2 "--model", "/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
3 "--mmproj", "/models/mmproj-F32.gguf",
4 "--chat-template-file", "/app/llama-server-bin/interleaved.jinja",
5 "--parallel", "4",
6 "--ctx-size", "262144",
7 "--flash-attn", "on",
8 "--cache-type-k", "q8_0",
9 "--cache-type-v", "q8_0",
10 "--no-mmap",
11 "--metrics",
12 "--jinja",
13 "--reasoning", "on"
14]
03

OpenAI API parameters

Standard parameters fully supported by the endpoint.

Names and shapes follow the official OpenAI Chat Completions contract; this stack documents where the server matches it and where extensions apply.

ParameterTypeDefaultDescription
modelstrrequiredModel name: "gemma-4-26b-a4b"
messageslistrequiredConversation messages array
streamboolfalseEnable streaming responses
temperaturefloat1.0Sampling temperature. Gemma 4 recommended: 1.0
top_pfloat0.95Nucleus sampling. Gemma 4 recommended: 0.95
max_tokensintmodel defaultMaximum tokens to generate
stopstr|listnoneStop sequences
seedintnoneReproducibility seed
frequency_penaltyfloat0Penalize frequent tokens
presence_penaltyfloat0Penalize already-used tokens
toolslistnoneTool/function definitions
tool_choicestr|object"auto"Tool selection strategy (see limitations in Features tab)
response_formatobjectnone{"type":"json_object"} for JSON mode
04

Extension parameters

llama-server extensions beyond the OpenAI spec. Pass via extra_body.

These parameters are not part of the OpenAI Chat Completions spec. Pass them via extra_body in the Python SDK or directly in the JSON body for curl/HTTP clients.

ParameterTypeDefaultDescription
chat_template_kwargs.enable_thinkingbooltruePer-request thinking toggle
thinking_budget_tokensint-1Max thinking tokens per request (-1 = unlimited, 0 = disable, N = limit)
thinking.typestr--Anthropic-compatible: set to "enabled"
thinking.budget_tokensint10000Anthropic-compatible thinking budget
top_kint40Top-K sampling. Gemma 4 recommended: 64
min_pfloat0Minimum probability threshold
repeat_penaltyfloat1.0Repetition penalty
extension-params.py
1response = client.chat.completions.create(
2 model="gemma-4-26b-a4b",
3 messages=[{"role": "user", "content": "Solve this step by step"}],
4 max_tokens=2048,
5 extra_body={
6 "chat_template_kwargs": {"enable_thinking": True},
7 "thinking_budget_tokens": 256, # limit thinking to 256 tokens
8 "top_k": 64, # Gemma 4 recommended
9 "min_p": 0.05,
10 "repeat_penalty": 1.1,
11 },
12)
05

Sampling presets

Recommended parameter combinations from the Gemma 4 model card.

Presettemperaturetop_ptop_kmin_pUse case
default1.00.95640Gemma 4 recommended baseline
creative1.20.95640Writing & brainstorming
balanced0.70.9400General use
precise0.30.8200.05Factual & deterministic

From the Gemma 4 model card

The recommended default settings are temperature=1.0, top_p=0.95, top_k=64. These are intentionally different from typical OpenAI defaults (temperature=1.0, no top_k). Using these values produces output that matches the model's training distribution per the published GGUF catalog.

06

Thinking configuration

Per-request control of chain-of-thought reasoning with budget limits.

enable_thinkingTemplate behaviorResult
TrueInjects <|think|> in system turnModel may produce reasoning_content (adaptive)
FalsePrepends empty thinking block to model turnThinking reliably suppressed
Not setServer default (thinking ON via interleaved template)Same as True

Adaptive reasoning: Setting enable_thinking: True does not guarantee thinking output. Gemma 4 decides adaptively whether reasoning is needed—it may skip thinking for trivial questions (see Google's thinking capabilities doc). Setting enable_thinking: False is deterministic and always suppresses thinking.

thinking_budget_tokensBehavior
Not set / -1Unlimited thinking (model decides when to stop)
0Immediately end thinking (similar to enable_thinking: False)
N > 0Think for at most N tokens, then forced to answer

Verified budget test results (prompt: "What is 23 * 47?")

BudgetReasoning charsTokensBehavior
04426Only budget message, no actual thinking
3212459Brief thinking + budget cutoff
128359155Moderate thinking + budget cutoff
unlimited1004522Full unconstrained thinking

Build patch required

The Gemma 4 dedicated parser in llama.cpp (PR #21418) omits thinking_start_tag / thinking_end_tag that the reasoning budget sampler needs (reasoning budget wiring). The build_llama_server() function patches common/chat.cpp to add <|channel>thought\n and <channel|> as thinking tags before compilation. Without this patch, thinking_budget_tokens is silently ignored.

07

Response format

Understanding the response structure including reasoning_content.

non-streaming-response.json
1{
2 "id": "chatcmpl-xxx",
3 "object": "chat.completion",
4 "model": "gemma-4-26b-a4b",
5 "choices": [{
6 "index": 0,
7 "message": {
8 "role": "assistant",
9 "content": "The answer is 42.",
10 "reasoning_content": "Let me think about this...",
11 "tool_calls": null
12 },
13 "finish_reason": "stop"
14 }],
15 "usage": {
16 "prompt_tokens": 25,
17 "completion_tokens": 150,
18 "total_tokens": 175
19 }
20}
streaming-chunks.json
1{"choices":[{"delta":{"reasoning_content":"Let me"},"index":0}]}
2{"choices":[{"delta":{"reasoning_content":" think..."},"index":0}]}
3{"choices":[{"delta":{"content":"The answer"},"index":0}]}
4{"choices":[{"delta":{"content":" is 42."},"index":0}]}
5{"choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}
FieldStandardDescription
message.contentYesResponse text
message.tool_callsYesTool/function calls
message.reasoning_contentExtensionThinking/reasoning text (server behavior documented in the llama.cpp server README; field shape follows common OpenAI-compatible extensions)
delta.reasoning_contentExtensionStreaming thinking tokens
usageYesToken counts

Reading reasoning_content in Python

The OpenAI SDK doesn't have a typed reasoning_content field. Use model_dump():

# Non-streaming
raw = response.choices[0].message.model_dump()
thinking = raw.get("reasoning_content") or ""

# Streaming
raw_delta = chunk.choices[0].delta.model_dump()
thinking = raw_delta.get("reasoning_content") or ""

Each streaming chunk contains either reasoning_content OR content, never both. Thinking chunks always precede response chunks.

08

VRAM and concurrency trade-offs

  • At fixed ctx-size, increasing --parallel multiplies active KV cache pressure; see llama-server for how total context is divided across slots.
  • If you hit memory pressure, first keep q8_0 cache and reduce per-slot context before reducing slots.
  • When throughput is stable but tail latency climbs, tune batch and slots together; changing one in isolation usually misleads.
  • Reference footprint is ~25 GB active usage, leaving practical safety margin on L40S 48 GB for workload spikes.
KV cache formatVRAM per 16K ctxQuality impact
f16 (default)~6 GBBaseline
q8_0 (recommended)~3 GBNegligible degradation
q4_0~1.5 GBMinor degradation

Important: ctx-size / parallel interaction

llama-server divides --ctx-size by --parallel:

--ctx-size 65536  --parallel 4  →  16,384 tokens per slot

If you want 16K per slot with 4 slots, you must set --ctx-size to 4 × 16384 = 65536.

10

Security and environment

  • Inject API_KEY using Modal secrets, never hardcode in code or Docker layer.
  • Store HF_TOKEN in huggingface-secret for artifact pulls.
  • Use huggingface-hub[hf_xet] + HF_XET_HIGH_PERFORMANCE=1 for large-file download performance.
  • Set HF_HUB_OFFLINE=1 for serving paths that should never redownload artifacts at runtime.
11

Throughput and tuning loop

Treat these as directional checkpoints; always validate on your own prompt mix and concurrency profile.

ProfileToken/sTTFT
Warm single-user (L40S baseline)Measure on your promptsMeasure per release
4 parallel slotsTrack aggregate tokens/sWatch p95/p99 TTFT
Post-snapshot restoreN/A~1s ready state
  1. Establish warm-path baseline on fixed prompt set.
  2. Tune KV cache + context, then retest.
  3. Tune parallel slots next; compare queueing and p95 latency.
  4. Inspect /metrics for token counters and error spikes before promoting.
12

Cost and release practices

  • Scale-to-zero plus memory snapshots is the default for bursty internal traffic.
  • Set always-on replicas only when first-token SLO is stricter than idle wake tolerance.
  • Keep staging and production apps separate for binary upgrades and snapshot refreshes.
  • Deep deployment internals are documented in Modal deployment.
13

External references

Sources cited inline on this tab.