GuideProduction

Gemma-4-26B-A4B-it-GGUF on Modal

Runtime tuning

llama.cpp flags, OpenAI and extension API parameters, sampling, thinking budgets, and Modal scaling knobs.

This page is the operator baseline for the reference deploy.py stack: llama-server as a subprocess, L40S baseline GPU, Modal memory snapshots, and exact runtime flags. Tune from these defaults before trying bespoke experiments.

Baseline constants from deploy.py

Constant

GPU

Value

L40S

Why this baseline exists

48 GB headroom supports weights + mmproj + KV + system overhead.

Constant

N_PARALLEL

Value

Why this baseline exists

Good interactive concurrency without saturating KV memory too early.

Constant

PER_SLOT_CTX

Value

65536

Why this baseline exists

Per-user context target before reducing slot count.

Constant

TOTAL_CTX

Value

N_PARALLEL * PER_SLOT_CTX

Why this baseline exists

llama-server divides this across parallel slots.

Constant

KV_CACHE_TYPE

Value

q8_0

Why this baseline exists

Reduces KV pressure compared with f16 cache.

Constant

BATCH_SIZE

Value

2048

Why this baseline exists

Balanced throughput without runaway memory pressure.

Constant

SERVED_NAME

Value

gemma-4-26b-a4b

Why this baseline exists

Stable alias for clients and IDE integrations.

Constant	Value	Why this baseline exists
GPU	L40S	48 GB headroom supports weights + mmproj + KV + system overhead.
N_PARALLEL	4	Good interactive concurrency without saturating KV memory too early.
PER_SLOT_CTX	65536	Per-user context target before reducing slot count.
TOTAL_CTX	N_PARALLEL * PER_SLOT_CTX	llama-server divides this across parallel slots.
KV_CACHE_TYPE	q8_0	Reduces KV pressure compared with f16 cache.
BATCH_SIZE	2048	Balanced throughput without runaway memory pressure.
SERVED_NAME	gemma-4-26b-a4b	Stable alias for clients and IDE integrations.

llama-server flag reference

Flag semantics follow upstream llama-server documentation. The --no-mmap default here pairs with Modal snapshot restore.

Flag

--model

Value

/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf

Reason

Exact model artifact path used in deploy.py.

Flag

--mmproj

Value

/models/mmproj-F32.gguf

Reason

Enables multimodal image flow.

Flag

--chat-template-file

Value

/app/llama-server-bin/interleaved.jinja

Reason

Keeps interleaved thinking + tool flow stable.

Flag

--no-mmap

Value

enabled

Reason

Required for reliable snapshot restore behavior.

Flag

--n-gpu-layers

Value

999

Reason

Offload all possible layers to GPU.

Flag

--parallel

Value

Reason

Balanced throughput for a single replica.

Flag

--ctx-size

Value

262144

Reason

Computed as TOTAL_CTX (4 × 65536).

Flag

--batch-size

Value

2048

Reason

Maintains decode throughput while controlling memory growth.

Flag

--cache-type-k/v

Value

q8_0

Reason

Reduce KV memory pressure.

Flag

--flash-attn

Value

Reason

Newer builds require explicit value, not bare flag.

Flag

--jinja

Value

enabled

Reason

Required for Gemma chat templates and tools formatting.

Flag

--reasoning

Value

Reason

Expose reasoning traces in API responses.

Flag

--reasoning-budget-message

Value

"... reasoning budget reached, answering now."

Reason

Graceful cutoff text when thinking budget is exhausted.

Flag

--metrics

Value

enabled

Reason

Prometheus-style telemetry on /metrics.

Flag

--alias

Value

gemma-4-26b-a4b

Reason

Stable external model id.

Flag	Value	Reason
--model	/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf	Exact model artifact path used in deploy.py.
--mmproj	/models/mmproj-F32.gguf	Enables multimodal image flow.
--chat-template-file	/app/llama-server-bin/interleaved.jinja	Keeps interleaved thinking + tool flow stable.
--no-mmap	enabled	Required for reliable snapshot restore behavior.
--n-gpu-layers	999	Offload all possible layers to GPU.
--parallel	4	Balanced throughput for a single replica.
--ctx-size	262144	Computed as TOTAL_CTX (4 × 65536).
--batch-size	2048	Maintains decode throughput while controlling memory growth.
--cache-type-k/v	q8_0	Reduce KV memory pressure.
--flash-attn	on	Newer builds require explicit value, not bare flag.
--jinja	enabled	Required for Gemma chat templates and tools formatting.
--reasoning	on	Expose reasoning traces in API responses.
--reasoning-budget-message	"... reasoning budget reached, answering now."	Graceful cutoff text when thinking budget is exhausted.
--metrics	enabled	Prometheus-style telemetry on /metrics.
--alias	gemma-4-26b-a4b	Stable external model id.

deploy.py (cmd extract)

1[
2  "--model", "/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
3  "--mmproj", "/models/mmproj-F32.gguf",
4  "--chat-template-file", "/app/llama-server-bin/interleaved.jinja",
5  "--parallel", "4",
6  "--ctx-size", "262144",
7  "--flash-attn", "on",
8  "--cache-type-k", "q8_0",
9  "--cache-type-v", "q8_0",
10  "--no-mmap",
11  "--metrics",
12  "--jinja",
13  "--reasoning", "on"
14]

OpenAI API parameters

Standard parameters fully supported by the endpoint.

Names and shapes follow the official OpenAI Chat Completions contract; this stack documents where the server matches it and where extensions apply.

Parameter	Type	Default	Description
model	str	required	Model name: `"gemma-4-26b-a4b"`
messages	list	required	Conversation messages array
stream	bool	false	Enable streaming responses
temperature	float	1.0	Sampling temperature. Gemma 4 recommended: 1.0
top_p	float	0.95	Nucleus sampling. Gemma 4 recommended: 0.95
max_tokens	int	model default	Maximum tokens to generate
stop	str\|list	none	Stop sequences
seed	int	none	Reproducibility seed
frequency_penalty	float	0	Penalize frequent tokens
presence_penalty	float	0	Penalize already-used tokens
tools	list	none	Tool/function definitions
tool_choice	str\|object	"auto"	Tool selection strategy (see limitations in Features tab)
response_format	object	none	`{"type":"json_object"}` for JSON mode

Extension parameters

llama-server extensions beyond the OpenAI spec. Pass via extra_body.

These parameters are not part of the OpenAI Chat Completions spec. Pass them via extra_body in the Python SDK or directly in the JSON body for curl/HTTP clients.

Parameter	Type	Default	Description
chat_template_kwargs.enable_thinking	bool	true	Per-request thinking toggle
thinking_budget_tokens	int	-1	Max thinking tokens per request (-1 = unlimited, 0 = disable, N = limit)
thinking.type	str	--	Anthropic-compatible: set to `"enabled"`
thinking.budget_tokens	int	10000	Anthropic-compatible thinking budget
top_k	int	40	Top-K sampling. Gemma 4 recommended: 64
min_p	float	0	Minimum probability threshold
repeat_penalty	float	1.0	Repetition penalty

extension-params.py

1response = client.chat.completions.create(
2    model="gemma-4-26b-a4b",
3    messages=[{"role": "user", "content": "Solve this step by step"}],
4    max_tokens=2048,
5    extra_body={
6        "chat_template_kwargs": {"enable_thinking": True},
7        "thinking_budget_tokens": 256,  # limit thinking to 256 tokens
8        "top_k": 64,                     # Gemma 4 recommended
9        "min_p": 0.05,
10        "repeat_penalty": 1.1,
11    },
12)

Sampling presets

Recommended parameter combinations from the Gemma 4 model card.

Preset	temperature	top_p	top_k	min_p	Use case
default	1.0	0.95	64	0	Gemma 4 recommended baseline
creative	1.2	0.95	64	0	Writing & brainstorming
balanced	0.7	0.9	40	0	General use
precise	0.3	0.8	20	0.05	Factual & deterministic

From the Gemma 4 model card

The recommended default settings are temperature=1.0, top_p=0.95, top_k=64. These are intentionally different from typical OpenAI defaults (temperature=1.0, no top_k). Using these values produces output that matches the model's training distribution per the published GGUF catalog.

Thinking configuration

Per-request control of chain-of-thought reasoning with budget limits.

enable_thinking	Template behavior	Result
True	Injects `<\|think\|>` in system turn	Model may produce `reasoning_content` (adaptive)
False	Prepends empty thinking block to model turn	Thinking reliably suppressed
Not set	Server default (thinking ON via interleaved template)	Same as True

Adaptive reasoning: Setting enable_thinking: True does not guarantee thinking output. Gemma 4 decides adaptively whether reasoning is needed—it may skip thinking for trivial questions (see Google's thinking capabilities doc). Setting enable_thinking: False is deterministic and always suppresses thinking.

thinking_budget_tokens	Behavior
Not set / -1	Unlimited thinking (model decides when to stop)
0	Immediately end thinking (similar to enable_thinking: False)
N > 0	Think for at most N tokens, then forced to answer

Verified budget test results (prompt: "What is 23 * 47?")

Budget	Reasoning chars	Tokens	Behavior
0	44	26	Only budget message, no actual thinking
32	124	59	Brief thinking + budget cutoff
128	359	155	Moderate thinking + budget cutoff
unlimited	1004	522	Full unconstrained thinking

Build patch required

The Gemma 4 dedicated parser in llama.cpp (PR #21418) omits thinking_start_tag / thinking_end_tag that the reasoning budget sampler needs (reasoning budget wiring). The build_llama_server() function patches common/chat.cpp to add <|channel>thought\n and <channel|> as thinking tags before compilation. Without this patch, thinking_budget_tokens is silently ignored.

Response format

Understanding the response structure including reasoning_content.

non-streaming-response.json

1{
2  "id": "chatcmpl-xxx",
3  "object": "chat.completion",
4  "model": "gemma-4-26b-a4b",
5  "choices": [{
6    "index": 0,
7    "message": {
8      "role": "assistant",
9      "content": "The answer is 42.",
10      "reasoning_content": "Let me think about this...",
11      "tool_calls": null
12    },
13    "finish_reason": "stop"
14  }],
15  "usage": {
16    "prompt_tokens": 25,
17    "completion_tokens": 150,
18    "total_tokens": 175
19  }
20}

streaming-chunks.json

1{"choices":[{"delta":{"reasoning_content":"Let me"},"index":0}]}
2{"choices":[{"delta":{"reasoning_content":" think..."},"index":0}]}
3{"choices":[{"delta":{"content":"The answer"},"index":0}]}
4{"choices":[{"delta":{"content":" is 42."},"index":0}]}
5{"choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}

Field	Standard	Description
message.content	Yes	Response text
message.tool_calls	Yes	Tool/function calls
message.reasoning_content	Extension	Thinking/reasoning text (server behavior documented in the llama.cpp server README; field shape follows common OpenAI-compatible extensions)
delta.reasoning_content	Extension	Streaming thinking tokens
usage	Yes	Token counts

Reading reasoning_content in Python

The OpenAI SDK doesn't have a typed reasoning_content field. Use model_dump():

# Non-streaming
raw = response.choices[0].message.model_dump()
thinking = raw.get("reasoning_content") or ""

# Streaming
raw_delta = chunk.choices[0].delta.model_dump()
thinking = raw_delta.get("reasoning_content") or ""

Each streaming chunk contains either reasoning_content OR content, never both. Thinking chunks always precede response chunks.

VRAM and concurrency trade-offs

At fixed ctx-size, increasing --parallel multiplies active KV cache pressure; see llama-server for how total context is divided across slots.
If you hit memory pressure, first keep q8_0 cache and reduce per-slot context before reducing slots.
When throughput is stable but tail latency climbs, tune batch and slots together; changing one in isolation usually misleads.
Reference footprint is ~25 GB active usage, leaving practical safety margin on L40S 48 GB for workload spikes.

KV cache format	VRAM per 16K ctx	Quality impact
f16 (default)	~6 GB	Baseline
q8_0 (recommended)	~3 GB	Negligible degradation
q4_0	~1.5 GB	Minor degradation

Important: ctx-size / parallel interaction

llama-server divides --ctx-size by --parallel:

--ctx-size 65536  --parallel 4  →  16,384 tokens per slot

If you want 16K per slot with 4 slots, you must set --ctx-size to 4 × 16384 = 65536.

Modal runtime knobs

Snapshot and web-serving primitives are described in Modal memory snapshots and Modal web servers; volumes for cached binaries live in Modal Volumes.

Parameter

gpu

Value

L40S

Reason

Reference deployment baseline in deploy.py.

Parameter

enable_memory_snapshot

Value

true

Reason

Captures warmed state for faster idle restores.

Parameter

enable_gpu_snapshot

Value

true

Reason

Enables GPU-aware snapshot flow on Modal.

Parameter

@modal.enter(snap=True)

Value

startup()

Reason

Snapshot boundary after health + warmup completes.

Parameter

@modal.enter(snap=False)

Value

restore()

Reason

Post-restore health check before serving traffic.

Parameter

max_containers

Value

Reason

Burst handling with spend guardrail.

Parameter

scaledown_window

Value

300s

Reason

Reduces idle spend while still smoothing short traffic gaps.

Parameter

@modal.concurrent

Value

max_inputs=10

Reason

Keeps container-level queueing controlled before server saturation.

Parameter	Value	Reason
gpu	L40S	Reference deployment baseline in deploy.py.
enable_memory_snapshot	true	Captures warmed state for faster idle restores.
enable_gpu_snapshot	true	Enables GPU-aware snapshot flow on Modal.
@modal.enter(snap=True)	startup()	Snapshot boundary after health + warmup completes.
@modal.enter(snap=False)	restore()	Post-restore health check before serving traffic.
max_containers	3	Burst handling with spend guardrail.
scaledown_window	300s	Reduces idle spend while still smoothing short traffic gaps.
@modal.concurrent	max_inputs=10	Keeps container-level queueing controlled before server saturation.

Security and environment

Inject API_KEY using Modal secrets, never hardcode in code or Docker layer.
Store HF_TOKEN in huggingface-secret for artifact pulls.
Use huggingface-hub[hf_xet] + HF_XET_HIGH_PERFORMANCE=1 for large-file download performance.
Set HF_HUB_OFFLINE=1 for serving paths that should never redownload artifacts at runtime.

Throughput and tuning loop

Treat these as directional checkpoints; always validate on your own prompt mix and concurrency profile.

Profile	Token/s	TTFT
Warm single-user (L40S baseline)	Measure on your prompts	Measure per release
4 parallel slots	Track aggregate tokens/s	Watch p95/p99 TTFT
Post-snapshot restore	N/A	~1s ready state

Establish warm-path baseline on fixed prompt set.
Tune KV cache + context, then retest.
Tune parallel slots next; compare queueing and p95 latency.
Inspect /metrics for token counters and error spikes before promoting.

Cost and release practices

Scale-to-zero plus memory snapshots is the default for bursty internal traffic.
Set always-on replicas only when first-token SLO is stricter than idle wake tolerance.
Keep staging and production apps separate for binary upgrades and snapshot refreshes.
Deep deployment internals are documented in Modal deployment.

External references

Sources cited inline on this tab.

Same URLs as the violet inline links above.

Related sections

Stack overview Modal deployment APIs & clients Operate & compare