GuideProduction

Gemma-4-26B-A4B-it-GGUF on Modal

Stack overview

Model (MoE, GGUF), VRAM, benchmarks, and how the serving stack is put together.

Production deployment of gemma-4-26B-A4B-it-GGUF on Modal with llama.cpp, GPU memory snapshots, and scale-to-zero lifecycle control. This overview explains why the stack works, what hardware budget it needs, and which design decisions matter before deep tuning.

At a glance

Core specs, runtime choices, and deployment envelope for this guide.

Item

Model

Value

gemma-4-26B-A4B-it-GGUF

Why it matters

MoE architecture tuned for strong reasoning with smaller active compute footprint.

Item

Repo

Value

unsloth/gemma-4-26B-A4B-it-GGUF

Why it matters

Published GGUF artifact used directly by llama.cpp serving flow.

Item

Engine

Value

llama-server (llama.cpp)

Why it matters

OpenAI-compatible HTTP surface and simple container process model.

Item

Quantization

Value

UD-Q4_K_XL

Why it matters

Aggressive memory reduction with practical quality retention.

Item

Cold-start strategy

Value

Modal memory snapshots

Why it matters

Restore warmed process instead of full weight-load startup every time.

Item

Critical launch flag

Value

--no-mmap

Why it matters

Mandatory for stable checkpoint/restore behavior in this deployment design.

Item	Value	Why it matters
Model	gemma-4-26B-A4B-it-GGUF	MoE architecture tuned for strong reasoning with smaller active compute footprint.
Repo	unsloth/gemma-4-26B-A4B-it-GGUF	Published GGUF artifact used directly by llama.cpp serving flow.
Engine	llama-server (llama.cpp)	OpenAI-compatible HTTP surface and simple container process model.
Quantization	UD-Q4_K_XL	Aggressive memory reduction with practical quality retention.
Cold-start strategy	Modal memory snapshots	Restore warmed process instead of full weight-load startup every time.
Critical launch flag	--no-mmap	Mandatory for stable checkpoint/restore behavior in this deployment design.

Model lineage

Gemma 4 instruction-tuned MoE (2026 release); licensing and catalog details on the Hugging Face model card.

Serving objective

Interactive response profile with scale-to-zero economics and deterministic operational behavior.

Default hardware envelope

L40S (48GB) is the baseline in deploy.py and keeps practical headroom for context, slots, and vision projection.

Full capabilities

Everything this deployment can do out of the box.

Core inference

Text chat — streaming and non-streaming, OpenAI-compatible API
256K token context — long-form documents, repos, conversations
140+ language support — multilingual out of the box
Flash attention — O(n) attention for efficient large-context inference
JSON structured output via response_format (Responses-style support in llama.cpp)

Vision / multimodal

Image input via image_url content blocks (URL or base64)
Multiple images in a single message
Image captioning, OCR, chart/diagram reasoning
Vision + thinking combined analysis
Audio not supported (E2B/E4B variants only)

Reasoning and thinking

Adaptive thinking — model decides when reasoning is needed (Google AI thinking guide).
Per-request toggle via enable_thinking
Separate reasoning_content field in responses
Thinking budget control with thinking_budget_tokens
Interleaved thinking — preserves reasoning between tool calls

Tool / function calling

Native Gemma 4 parser with OpenAI-format tool_calls (upstream parser PR)
Single and parallel tool calls
Multi-step tool chains
tool_choice="auto" works reliably
tool_choice="required" has known gaps

IDE and client integration

Cursor IDE Agent mode — full tool calling via /v1/responses
Cursor Ask mode — Q&A via /v1/chat/completions
Codex CLI compatible — OpenAI Responses API wire format
LangChain, LiteLLM, any OpenAI SDK client
TypeScript, Python, Go, curl — all supported

Deployment operations

GPU memory snapshots — 5-15s cold starts instead of 60-120s
Scale-to-zero economics with fast restore
Prometheus metrics on /metrics
Health endpoint for load balancers
Bearer token authentication

Core architecture rests on three pillars: GGUF portability, MoE efficiency, and quantization economics.

MoE architecture deep dive

Understanding how Mixture-of-Experts achieves frontier quality with 4B-model decode speed.

Mixture-of-Experts (MoE) is the key architectural innovation that makes Gemma 4 economically practical. Instead of activating all parameters for every token (like dense models), MoE routes each token through a subset of specialized "expert" networks.

Gemma 4 26B-A4B specifics: The model contains 128 total experts, but only 8 experts + 1 shared expert are activated per token. This means only 3.8B parameters fire per token out of 25.2B total.

Practical implication: You get the reasoning quality of a 26B model with the decode speed of a ~4B model. Memory bandwidth during generation is proportional to active parameters, not total parameters—this is why MoE models achieve much higher tokens/second than their total parameter count would suggest.

Property	Value	Implication
Total parameters	25.2B	Model capacity / knowledge stored
Active parameters per token	3.8B	Determines decode throughput and memory bandwidth
Total experts	128	Pool of specialized subnetworks
Active experts per token	8 + 1 shared	Router selects best 8, shared always active
Efficiency ratio	~6.6x	Quality of 25B, speed of 4B

Why this matters for serving

Token generation speed is limited by memory bandwidth, not compute. With only 3.8B active parameters, Gemma 4 achieves 50-60+ tokens/second on L40S—comparable to much smaller models while maintaining frontier-level quality.

VRAM consideration

All 25.2B parameters must still be loaded into VRAM (weights don't know which tokens will arrive). The memory savings come from the reduced KV cache growth and faster inference, not from smaller weight storage.

Model and precision context

GGUF metadata portability, Unsloth Dynamic quantization, and complete VRAM accounting.

GGUF format embeds architecture + tokenizer metadata directly with weights, which reduces deployment drift and keeps runtime assumptions explicit. The format is llama.cpp native, requiring no conversion steps.

Unsloth Dynamic 2.0 quantization selectively applies higher precision to attention and critical layers while aggressively quantizing less sensitive experts. The UD-Q4_K_XL variant (see the published GGUF catalog) achieves better quality than standard Q4_K_M at similar size—17.1 GB fits comfortably on L40S with room for large KV caches.

Vision encoder: The mmproj-F32.gguf (~2 GB) handles image encoding. It's loaded separately via --mmproj and adds to the total VRAM footprint.

VRAM Component	Size	Notes
Model weights (UD-Q4_K_XL)	~17.1 GB	All 25.2B parameters, quantized
Multimodal projector (F32)	~2.0 GB	Vision encoder for image input
KV cache (q8_0, 65K total ctx)	~3.0 GB	4 slots × 16K per slot baseline
CUDA buffers / scratch	~3.0 GB	Kernel working memory
Total active	~25.1 GB	Steady-state usage
Headroom on L40S (48 GB)	~23 GB	Buffer for burst/tuning

Component

Model weights

Estimated VRAM

~17.1 GB

Notes

Primary GGUF footprint.

Component

Vision projection (mmproj)

Estimated VRAM

~2.0 GB

Notes

Needed for multimodal image flow.

Component

KV cache

Estimated VRAM

~3.0 GB (context-dependent)

Notes

Scales with ctx-size and parallel slots.

Component

System buffers

Estimated VRAM

~3.0 GB

Notes

Runtime overhead and safety headroom.

Component

Total

Estimated VRAM

~25.1 GB

Notes

L40S (48GB) is the baseline in the reference deployment.

Component	Estimated VRAM	Notes
Model weights	~17.1 GB	Primary GGUF footprint.
Vision projection (mmproj)	~2.0 GB	Needed for multimodal image flow.
KV cache	~3.0 GB (context-dependent)	Scales with ctx-size and parallel slots.
System buffers	~3.0 GB	Runtime overhead and safety headroom.
Total	~25.1 GB	L40S (48GB) is the baseline in the reference deployment.

Pillar	Practical effect in production	Operational implication
GGUF packaging	Single artifact with model + tokenizer metadata	Lower config drift across local, staging, and Modal runtime.
MoE sparsity	High capacity with lower active compute per token	Better cost-quality position for interactive workloads.
UD-Q4_K_XL quantization	Large memory reduction while retaining utility	Makes single-GPU L40S-class serving feasible.

Benchmark results

Massive improvements over Gemma 3 across reasoning, math, and coding.

Gemma 4 26B-A4B represents a generational leap over Gemma 3 27B. The MoE architecture combined with improved training yields dramatic benchmark gains—particularly in math, coding, and scientific reasoning where the model shows 3-4x improvements in some categories. Figures in the table below follow the public reporting in the Gemma 4 launch post.

Benchmark	Gemma 4 26B-A4B	Gemma 3 27B	Improvement
MMLU Pro Multi-task language understanding	82.6%	67.6%	+15.0%
AIME 2026 (no tools) Competition mathematics	88.3%	20.8%	+67.5%
LiveCodeBench v6 Code generation	77.1%	29.1%	+48.0%
Codeforces ELO Competitive programming	1718	110	+1608
GPQA Diamond Graduate-level science	82.3%	42.4%	+39.9%

Math and reasoning

AIME improvement from 20.8% to 88.3% represents a 4.2x gain—moving from below-average to competition-level performance.

Code generation

LiveCodeBench nearly triples. Codeforces ELO jumps from negligible (110) to Expert-level (1718).

Scientific reasoning

GPQA Diamond doubles, indicating strong graduate-level STEM comprehension.

Infrastructure topology

Modal volumes, ephemeral containers, and a warmed snapshot loop.

Modal Cloud

Volume

model-cache

gemma-4-26B-A4B-it-*.gguf
mmproj-F32.gguf

Volume

llama-server-binary

pinned build
from LLAMA_CPP_TAG

Container

llama-server process

launch with --no-mmap
health check + warmup
capture GPU snapshot
restore after scale-down

Client -> /v1/chat/completions (OpenAI-compatible)

This topology intentionally separates long-lived artifacts (weights + pinned binary) from ephemeral compute using Modal Volumes. That split is what keeps redeploys and horizontal scaling from repeatedly paying full artifact fetch costs.

GPU memory snapshotting happens after health + warmup, so restores resume a ready process, not a partially initialized process.

Serving stack

Where latency/cost behavior comes from in this architecture.

llama-server subprocess model: clear health lifecycle and controlled restarts (llama.cpp server).
Pinned binary build: reproducible behavior tied to a known llama.cpp commit hash.
Volume-backed assets: no repeated model download during serve-time startup (Modal Volumes).
OpenAI-compatible API surface: straightforward client/SDK integration.

Two-image strategy (builder vs runtime)

Builder image carries full toolchain for reproducible compile; runtime image ships only serving dependencies. This reduces attack surface, cold boot overhead, and snapshot size.

Scale-to-zero and snapshots

Cold-path economics and why --no-mmap is non-negotiable in this setup.

Snapshot reliability depends on launching llama-server with --no-mmap. Memory-mapped startup may break restore semantics and collapse the cold-start strategy; see Modal's GPU memory snapshot guide for platform semantics.

Traditional cold path

Weight load + warmup each restart; often multi-minute startup.

Snapshot restore path

Restore pre-warmed process state, substantially reducing first-response wait after idle.

Phase	Typical duration class	Why it matters
Provision + mount	seconds	Common to both cold and restore paths.
Weight load + warmup	minutes	Dominant cost in traditional startup flow.
Snapshot restore	sub-minute class	Primary latency/cost win of this architecture.

Performance targets

Throughput optimizations and expected results on L40S.

Optimization	Flag	Impact
Flash attention	--flash-attn on	O(n) attention instead of O(n²)
Full GPU offload	--n-gpu-layers 999	10-50x vs CPU-only inference
Batch processing	--batch-size 2048	Maximizes prefill throughput
Quantized KV cache	--cache-type-k/v q8_0	50% VRAM savings vs f16
Parallel slots	--parallel 4	4 concurrent requests per container
Zero-proxy routing	@modal.web_server	No FastAPI/httpx overhead
MoE efficiency	Architecture	3.8B active / 25.2B total → decode speed of a ~4B model

Metric

Decode throughput

Expected range

~50-60+ tok/s

Context

Per slot on L40S (864 GB/s memory bandwidth). MoE architecture enables this despite 26B total params.

Metric

Prefill (prompt processing)

Expected range

~2000+ tok/s

Context

With batch-size 2048. Scales with prompt length.

Metric

Warm TTFT

Expected range

sub-second

Context

Time to first token on warm container. Varies with prompt size.

Metric

Cold start (with snapshots)

Expected range

5-15 seconds

Context

Snapshot restore path. Without snapshots: 60-120 seconds.

Metric

Concurrent capacity

Expected range

4 slots × 3 containers = 12

Context

Per default configuration. Tunable via --parallel and max_containers.

Metric	Expected range	Context
Decode throughput	~50-60+ tok/s	Per slot on L40S (864 GB/s memory bandwidth). MoE architecture enables this despite 26B total params.
Prefill (prompt processing)	~2000+ tok/s	With batch-size 2048. Scales with prompt length.
Warm TTFT	sub-second	Time to first token on warm container. Varies with prompt size.
Cold start (with snapshots)	5-15 seconds	Snapshot restore path. Without snapshots: 60-120 seconds.
Concurrent capacity	4 slots × 3 containers = 12	Per default configuration. Tunable via --parallel and max_containers.

Tuning levers

More concurrency, less context: --parallel 8 --ctx-size 131072 (8 slots × 16K each)
More context, less concurrency: --parallel 2 --ctx-size 131072 (2 slots × 64K each)
Aggressive KV cache: --cache-type-k q4_0 --cache-type-v q4_0 (75% VRAM savings vs f16)

These values are directional. Real numbers shift with prompt length, tool-calling patterns, vision usage, and concurrency configuration. Always benchmark on your workload before setting SLOs.

Strategic positioning

Quick framing versus larger online-serving stacks; full detail lives in Operate & compare.

Model stack	Strength	Typical trade-off
Gemma-4 GGUF + llama.cpp	Portable, cost-aware, low-idle overhead with snapshots	Less raw max throughput than very large multi-GPU online stacks
GLM-5.1 FP8 + SGLang	High-concurrency large-scale serving	Significantly higher infra complexity and spend envelope
Dense large-model serving	Mature optimization ecosystem	Less parameter efficiency than MoE for similar capacity classes

Practical framing: GLM-scale stacks optimize for maximum throughput at very high infra complexity; this Gemma stack optimizes for deployability, responsiveness after idle, and lower operational surface area.

Design choices

The deliberate defaults this guide encodes.

Engine: llama.cpp for GGUF-native operation and straightforward process control.
Cold starts: scale-to-zero with snapshot restore instead of always-on baseline spend.
Launch safety: --no-mmap is treated as required, not optional.
Integration surface: OpenAI-compatible endpoint for SDK/client compatibility.
Reasoning mode support: align with Google's thinking docs and server-side budget wiring; keep compatibility with enable_thinking and budget controls, but treat response shape differences as a client concern.
Tool-calling safeguards: assume partial spec mismatch and enforce server-side validation/retry patterns.

References

Same sources as the inline links above; open in a new tab for full context.

Continue the guide

Next sections cover deployment lifecycle, configuration and performance targets, multimodal and client integrations, and operations plus model comparisons.

Modal deployment Runtime tuning APIs & clients Operate & compare