Gemma-4-26B-A4B-it-GGUF on Modal
Stack overview
Model (MoE, GGUF), VRAM, benchmarks, and how the serving stack is put together.
Stack overview
Production deployment of gemma-4-26B-A4B-it-GGUF on Modal with llama.cpp, GPU memory snapshots, and scale-to-zero lifecycle control. This overview explains why the stack works, what hardware budget it needs, and which design decisions matter before deep tuning.
At a glance
Core specs, runtime choices, and deployment envelope for this guide.
Item
Model
Value
gemma-4-26B-A4B-it-GGUF
Why it matters
MoE architecture tuned for strong reasoning with smaller active compute footprint.
Item
Repo
Value
unsloth/gemma-4-26B-A4B-it-GGUF
Why it matters
Published GGUF artifact used directly by llama.cpp serving flow.
Item
Engine
Value
llama-server (llama.cpp)
Why it matters
OpenAI-compatible HTTP surface and simple container process model.
Item
Quantization
Value
UD-Q4_K_XL
Why it matters
Aggressive memory reduction with practical quality retention.
Item
Cold-start strategy
Value
Modal memory snapshots
Why it matters
Restore warmed process instead of full weight-load startup every time.
Item
Critical launch flag
Value
--no-mmap
Why it matters
Mandatory for stable checkpoint/restore behavior in this deployment design.
| Item | Value | Why it matters |
|---|---|---|
| Model | gemma-4-26B-A4B-it-GGUF | MoE architecture tuned for strong reasoning with smaller active compute footprint. |
| Repo | unsloth/gemma-4-26B-A4B-it-GGUF | Published GGUF artifact used directly by llama.cpp serving flow. |
| Engine | llama-server (llama.cpp) | OpenAI-compatible HTTP surface and simple container process model. |
| Quantization | UD-Q4_K_XL | Aggressive memory reduction with practical quality retention. |
| Cold-start strategy | Modal memory snapshots | Restore warmed process instead of full weight-load startup every time. |
| Critical launch flag | --no-mmap | Mandatory for stable checkpoint/restore behavior in this deployment design. |
Model lineage
Gemma 4 instruction-tuned MoE (2026 release); licensing and catalog details on the Hugging Face model card.
Serving objective
Interactive response profile with scale-to-zero economics and deterministic operational behavior.
Default hardware envelope
L40S (48GB) is the baseline in deploy.py and keeps practical headroom for context, slots, and vision projection.
Full capabilities
Everything this deployment can do out of the box.
Core inference
- Text chat — streaming and non-streaming, OpenAI-compatible API
- 256K token context — long-form documents, repos, conversations
- 140+ language support — multilingual out of the box
- Flash attention — O(n) attention for efficient large-context inference
- JSON structured output via
response_format(Responses-style support in llama.cpp)
Vision / multimodal
- Image input via
image_urlcontent blocks (URL or base64) - Multiple images in a single message
- Image captioning, OCR, chart/diagram reasoning
- Vision + thinking combined analysis
- Audio not supported (E2B/E4B variants only)
Reasoning and thinking
- Adaptive thinking — model decides when reasoning is needed (Google AI thinking guide).
- Per-request toggle via
enable_thinking - Separate
reasoning_contentfield in responses - Thinking budget control with
thinking_budget_tokens - Interleaved thinking — preserves reasoning between tool calls
Tool / function calling
- Native Gemma 4 parser with OpenAI-format
tool_calls(upstream parser PR) - Single and parallel tool calls
- Multi-step tool chains
tool_choice="auto"works reliablytool_choice="required"has known gaps
IDE and client integration
- Cursor IDE Agent mode — full tool calling via
/v1/responses - Cursor Ask mode — Q&A via
/v1/chat/completions - Codex CLI compatible — OpenAI Responses API wire format
- LangChain, LiteLLM, any OpenAI SDK client
- TypeScript, Python, Go, curl — all supported
Deployment operations
- GPU memory snapshots — 5-15s cold starts instead of 60-120s
- Scale-to-zero economics with fast restore
- Prometheus metrics on
/metrics - Health endpoint for load balancers
- Bearer token authentication
Core architecture rests on three pillars: GGUF portability, MoE efficiency, and quantization economics.
MoE architecture deep dive
Understanding how Mixture-of-Experts achieves frontier quality with 4B-model decode speed.
Mixture-of-Experts (MoE) is the key architectural innovation that makes Gemma 4 economically practical. Instead of activating all parameters for every token (like dense models), MoE routes each token through a subset of specialized "expert" networks.
Gemma 4 26B-A4B specifics: The model contains 128 total experts, but only 8 experts + 1 shared expert are activated per token. This means only 3.8B parameters fire per token out of 25.2B total.
Practical implication: You get the reasoning quality of a 26B model with the decode speed of a ~4B model. Memory bandwidth during generation is proportional to active parameters, not total parameters—this is why MoE models achieve much higher tokens/second than their total parameter count would suggest.
| Property | Value | Implication |
|---|---|---|
| Total parameters | 25.2B | Model capacity / knowledge stored |
| Active parameters per token | 3.8B | Determines decode throughput and memory bandwidth |
| Total experts | 128 | Pool of specialized subnetworks |
| Active experts per token | 8 + 1 shared | Router selects best 8, shared always active |
| Efficiency ratio | ~6.6x | Quality of 25B, speed of 4B |
Why this matters for serving
Token generation speed is limited by memory bandwidth, not compute. With only 3.8B active parameters, Gemma 4 achieves 50-60+ tokens/second on L40S—comparable to much smaller models while maintaining frontier-level quality.
VRAM consideration
All 25.2B parameters must still be loaded into VRAM (weights don't know which tokens will arrive). The memory savings come from the reduced KV cache growth and faster inference, not from smaller weight storage.
Model and precision context
GGUF metadata portability, Unsloth Dynamic quantization, and complete VRAM accounting.
GGUF format embeds architecture + tokenizer metadata directly with weights, which reduces deployment drift and keeps runtime assumptions explicit. The format is llama.cpp native, requiring no conversion steps.
Unsloth Dynamic 2.0 quantization selectively applies higher precision to attention and critical layers while aggressively quantizing less sensitive experts. The UD-Q4_K_XL variant (see the published GGUF catalog) achieves better quality than standard Q4_K_M at similar size—17.1 GB fits comfortably on L40S with room for large KV caches.
Vision encoder: The mmproj-F32.gguf (~2 GB) handles image encoding. It's loaded separately via --mmproj and adds to the total VRAM footprint.
| VRAM Component | Size | Notes |
|---|---|---|
| Model weights (UD-Q4_K_XL) | ~17.1 GB | All 25.2B parameters, quantized |
| Multimodal projector (F32) | ~2.0 GB | Vision encoder for image input |
| KV cache (q8_0, 65K total ctx) | ~3.0 GB | 4 slots × 16K per slot baseline |
| CUDA buffers / scratch | ~3.0 GB | Kernel working memory |
| Total active | ~25.1 GB | Steady-state usage |
| Headroom on L40S (48 GB) | ~23 GB | Buffer for burst/tuning |
Component
Model weights
Estimated VRAM
~17.1 GB
Notes
Primary GGUF footprint.
Component
Vision projection (mmproj)
Estimated VRAM
~2.0 GB
Notes
Needed for multimodal image flow.
Component
KV cache
Estimated VRAM
~3.0 GB (context-dependent)
Notes
Scales with ctx-size and parallel slots.
Component
System buffers
Estimated VRAM
~3.0 GB
Notes
Runtime overhead and safety headroom.
Component
Total
Estimated VRAM
~25.1 GB
Notes
L40S (48GB) is the baseline in the reference deployment.
| Pillar | Practical effect in production | Operational implication |
|---|---|---|
| GGUF packaging | Single artifact with model + tokenizer metadata | Lower config drift across local, staging, and Modal runtime. |
| MoE sparsity | High capacity with lower active compute per token | Better cost-quality position for interactive workloads. |
| UD-Q4_K_XL quantization | Large memory reduction while retaining utility | Makes single-GPU L40S-class serving feasible. |
Benchmark results
Massive improvements over Gemma 3 across reasoning, math, and coding.
Gemma 4 26B-A4B represents a generational leap over Gemma 3 27B. The MoE architecture combined with improved training yields dramatic benchmark gains—particularly in math, coding, and scientific reasoning where the model shows 3-4x improvements in some categories. Figures in the table below follow the public reporting in the Gemma 4 launch post.
| Benchmark | Gemma 4 26B-A4B | Gemma 3 27B | Improvement |
|---|---|---|---|
| MMLU Pro Multi-task language understanding | 82.6% | 67.6% | +15.0% |
| AIME 2026 (no tools) Competition mathematics | 88.3% | 20.8% | +67.5% |
| LiveCodeBench v6 Code generation | 77.1% | 29.1% | +48.0% |
| Codeforces ELO Competitive programming | 1718 | 110 | +1608 |
| GPQA Diamond Graduate-level science | 82.3% | 42.4% | +39.9% |
Math and reasoning
AIME improvement from 20.8% to 88.3% represents a 4.2x gain—moving from below-average to competition-level performance.
Code generation
LiveCodeBench nearly triples. Codeforces ELO jumps from negligible (110) to Expert-level (1718).
Scientific reasoning
GPQA Diamond doubles, indicating strong graduate-level STEM comprehension.
Infrastructure topology
Modal volumes, ephemeral containers, and a warmed snapshot loop.
Volume
model-cache
- gemma-4-26B-A4B-it-*.gguf
- mmproj-F32.gguf
Volume
llama-server-binary
- pinned build
- from LLAMA_CPP_TAG
Container
llama-server process
- launch with --no-mmap
- health check + warmup
- capture GPU snapshot
- restore after scale-down
/v1/chat/completions (OpenAI-compatible)This topology intentionally separates long-lived artifacts (weights + pinned binary) from ephemeral compute using Modal Volumes. That split is what keeps redeploys and horizontal scaling from repeatedly paying full artifact fetch costs.
GPU memory snapshotting happens after health + warmup, so restores resume a ready process, not a partially initialized process.
Serving stack
Where latency/cost behavior comes from in this architecture.
- llama-server subprocess model: clear health lifecycle and controlled restarts (llama.cpp server).
- Pinned binary build: reproducible behavior tied to a known llama.cpp commit hash.
- Volume-backed assets: no repeated model download during serve-time startup (Modal Volumes).
- OpenAI-compatible API surface: straightforward client/SDK integration.
Two-image strategy (builder vs runtime)
Builder image carries full toolchain for reproducible compile; runtime image ships only serving dependencies. This reduces attack surface, cold boot overhead, and snapshot size.
Scale-to-zero and snapshots
Cold-path economics and why --no-mmap is non-negotiable in this setup.
Snapshot reliability depends on launching llama-server with --no-mmap. Memory-mapped startup may break restore semantics and collapse the cold-start strategy; see Modal's GPU memory snapshot guide for platform semantics.
Traditional cold path
Weight load + warmup each restart; often multi-minute startup.
Snapshot restore path
Restore pre-warmed process state, substantially reducing first-response wait after idle.
| Phase | Typical duration class | Why it matters |
|---|---|---|
| Provision + mount | seconds | Common to both cold and restore paths. |
| Weight load + warmup | minutes | Dominant cost in traditional startup flow. |
| Snapshot restore | sub-minute class | Primary latency/cost win of this architecture. |
Performance targets
Throughput optimizations and expected results on L40S.
| Optimization | Flag | Impact |
|---|---|---|
| Flash attention | --flash-attn on | O(n) attention instead of O(n²) |
| Full GPU offload | --n-gpu-layers 999 | 10-50x vs CPU-only inference |
| Batch processing | --batch-size 2048 | Maximizes prefill throughput |
| Quantized KV cache | --cache-type-k/v q8_0 | 50% VRAM savings vs f16 |
| Parallel slots | --parallel 4 | 4 concurrent requests per container |
| Zero-proxy routing | @modal.web_server | No FastAPI/httpx overhead |
| MoE efficiency | Architecture | 3.8B active / 25.2B total → decode speed of a ~4B model |
Metric
Decode throughput
Expected range
~50-60+ tok/s
Context
Per slot on L40S (864 GB/s memory bandwidth). MoE architecture enables this despite 26B total params.
Metric
Prefill (prompt processing)
Expected range
~2000+ tok/s
Context
With batch-size 2048. Scales with prompt length.
Metric
Warm TTFT
Expected range
sub-second
Context
Time to first token on warm container. Varies with prompt size.
Metric
Cold start (with snapshots)
Expected range
5-15 seconds
Context
Snapshot restore path. Without snapshots: 60-120 seconds.
Metric
Concurrent capacity
Expected range
4 slots × 3 containers = 12
Context
Per default configuration. Tunable via --parallel and max_containers.
| Metric | Expected range | Context |
|---|---|---|
| Decode throughput | ~50-60+ tok/s | Per slot on L40S (864 GB/s memory bandwidth). MoE architecture enables this despite 26B total params. |
| Prefill (prompt processing) | ~2000+ tok/s | With batch-size 2048. Scales with prompt length. |
| Warm TTFT | sub-second | Time to first token on warm container. Varies with prompt size. |
| Cold start (with snapshots) | 5-15 seconds | Snapshot restore path. Without snapshots: 60-120 seconds. |
| Concurrent capacity | 4 slots × 3 containers = 12 | Per default configuration. Tunable via --parallel and max_containers. |
Tuning levers
- More concurrency, less context:
--parallel 8 --ctx-size 131072(8 slots × 16K each) - More context, less concurrency:
--parallel 2 --ctx-size 131072(2 slots × 64K each) - Aggressive KV cache:
--cache-type-k q4_0 --cache-type-v q4_0(75% VRAM savings vs f16)
These values are directional. Real numbers shift with prompt length, tool-calling patterns, vision usage, and concurrency configuration. Always benchmark on your workload before setting SLOs.
Strategic positioning
Quick framing versus larger online-serving stacks; full detail lives in Operate & compare.
| Model stack | Strength | Typical trade-off |
|---|---|---|
| Gemma-4 GGUF + llama.cpp | Portable, cost-aware, low-idle overhead with snapshots | Less raw max throughput than very large multi-GPU online stacks |
| GLM-5.1 FP8 + SGLang | High-concurrency large-scale serving | Significantly higher infra complexity and spend envelope |
| Dense large-model serving | Mature optimization ecosystem | Less parameter efficiency than MoE for similar capacity classes |
Practical framing: GLM-scale stacks optimize for maximum throughput at very high infra complexity; this Gemma stack optimizes for deployability, responsiveness after idle, and lower operational surface area.
Design choices
The deliberate defaults this guide encodes.
- Engine: llama.cpp for GGUF-native operation and straightforward process control.
- Cold starts: scale-to-zero with snapshot restore instead of always-on baseline spend.
- Launch safety:
--no-mmapis treated as required, not optional. - Integration surface: OpenAI-compatible endpoint for SDK/client compatibility.
- Reasoning mode support: align with Google's thinking docs and server-side budget wiring; keep compatibility with
enable_thinkingand budget controls, but treat response shape differences as a client concern. - Tool-calling safeguards: assume partial spec mismatch and enforce server-side validation/retry patterns.
References
Same sources as the inline links above; open in a new tab for full context.
- Google DeepMind Blog: Gemma 4deepmind.google
- Gemma 4 Prompt Formatting Guideai.google.dev
- Gemma 4 Thinking Capabilitiesai.google.dev
- HuggingFace: unsloth/gemma-4-26B-A4B-it-GGUFhuggingface.co
- llama.cpp GitHubgithub.com
- llama.cpp Gemma 4 Parser (PR #21418)github.com
- llama.cpp Reasoning Budget (PR #20297)github.com
- llama.cpp Responses API (PR #18486)github.com
- Modal GPU Memory Snapshotsmodal.com
- Modal Volumesmodal.com
- Modal Web Servermodal.com