GuideProduction

Gemma-4-26B-A4B-it-GGUF on Modal

Stack overview

Model (MoE, GGUF), VRAM, benchmarks, and how the serving stack is put together.

Stack overview

Production deployment of gemma-4-26B-A4B-it-GGUF on Modal with llama.cpp, GPU memory snapshots, and scale-to-zero lifecycle control. This overview explains why the stack works, what hardware budget it needs, and which design decisions matter before deep tuning.

01

At a glance

Core specs, runtime choices, and deployment envelope for this guide.

Item

Model

Value

gemma-4-26B-A4B-it-GGUF

Why it matters

MoE architecture tuned for strong reasoning with smaller active compute footprint.

Item

Repo

Value

unsloth/gemma-4-26B-A4B-it-GGUF

Why it matters

Published GGUF artifact used directly by llama.cpp serving flow.

Item

Engine

Value

llama-server (llama.cpp)

Why it matters

OpenAI-compatible HTTP surface and simple container process model.

Item

Quantization

Value

UD-Q4_K_XL

Why it matters

Aggressive memory reduction with practical quality retention.

Item

Cold-start strategy

Value

Modal memory snapshots

Why it matters

Restore warmed process instead of full weight-load startup every time.

Item

Critical launch flag

Value

--no-mmap

Why it matters

Mandatory for stable checkpoint/restore behavior in this deployment design.

Model lineage

Gemma 4 instruction-tuned MoE (2026 release); licensing and catalog details on the Hugging Face model card.

Serving objective

Interactive response profile with scale-to-zero economics and deterministic operational behavior.

Default hardware envelope

L40S (48GB) is the baseline in deploy.py and keeps practical headroom for context, slots, and vision projection.

02

Full capabilities

Everything this deployment can do out of the box.

Core inference

  • Text chat — streaming and non-streaming, OpenAI-compatible API
  • 256K token context — long-form documents, repos, conversations
  • 140+ language support — multilingual out of the box
  • Flash attention — O(n) attention for efficient large-context inference
  • JSON structured output via response_format (Responses-style support in llama.cpp)

Vision / multimodal

  • Image input via image_url content blocks (URL or base64)
  • Multiple images in a single message
  • Image captioning, OCR, chart/diagram reasoning
  • Vision + thinking combined analysis
  • Audio not supported (E2B/E4B variants only)

Reasoning and thinking

  • Adaptive thinking — model decides when reasoning is needed (Google AI thinking guide).
  • Per-request toggle via enable_thinking
  • Separate reasoning_content field in responses
  • Thinking budget control with thinking_budget_tokens
  • Interleaved thinking — preserves reasoning between tool calls

Tool / function calling

  • Native Gemma 4 parser with OpenAI-format tool_calls (upstream parser PR)
  • Single and parallel tool calls
  • Multi-step tool chains
  • tool_choice="auto" works reliably
  • tool_choice="required" has known gaps

IDE and client integration

  • Cursor IDE Agent mode — full tool calling via /v1/responses
  • Cursor Ask mode — Q&A via /v1/chat/completions
  • Codex CLI compatible — OpenAI Responses API wire format
  • LangChain, LiteLLM, any OpenAI SDK client
  • TypeScript, Python, Go, curl — all supported

Deployment operations

  • GPU memory snapshots — 5-15s cold starts instead of 60-120s
  • Scale-to-zero economics with fast restore
  • Prometheus metrics on /metrics
  • Health endpoint for load balancers
  • Bearer token authentication

Core architecture rests on three pillars: GGUF portability, MoE efficiency, and quantization economics.

03

MoE architecture deep dive

Understanding how Mixture-of-Experts achieves frontier quality with 4B-model decode speed.

Mixture-of-Experts (MoE) is the key architectural innovation that makes Gemma 4 economically practical. Instead of activating all parameters for every token (like dense models), MoE routes each token through a subset of specialized "expert" networks.

Gemma 4 26B-A4B specifics: The model contains 128 total experts, but only 8 experts + 1 shared expert are activated per token. This means only 3.8B parameters fire per token out of 25.2B total.

Practical implication: You get the reasoning quality of a 26B model with the decode speed of a ~4B model. Memory bandwidth during generation is proportional to active parameters, not total parameters—this is why MoE models achieve much higher tokens/second than their total parameter count would suggest.

PropertyValueImplication
Total parameters25.2BModel capacity / knowledge stored
Active parameters per token3.8BDetermines decode throughput and memory bandwidth
Total experts128Pool of specialized subnetworks
Active experts per token8 + 1 sharedRouter selects best 8, shared always active
Efficiency ratio~6.6xQuality of 25B, speed of 4B

Why this matters for serving

Token generation speed is limited by memory bandwidth, not compute. With only 3.8B active parameters, Gemma 4 achieves 50-60+ tokens/second on L40S—comparable to much smaller models while maintaining frontier-level quality.

VRAM consideration

All 25.2B parameters must still be loaded into VRAM (weights don't know which tokens will arrive). The memory savings come from the reduced KV cache growth and faster inference, not from smaller weight storage.

04

Model and precision context

GGUF metadata portability, Unsloth Dynamic quantization, and complete VRAM accounting.

GGUF format embeds architecture + tokenizer metadata directly with weights, which reduces deployment drift and keeps runtime assumptions explicit. The format is llama.cpp native, requiring no conversion steps.

Unsloth Dynamic 2.0 quantization selectively applies higher precision to attention and critical layers while aggressively quantizing less sensitive experts. The UD-Q4_K_XL variant (see the published GGUF catalog) achieves better quality than standard Q4_K_M at similar size—17.1 GB fits comfortably on L40S with room for large KV caches.

Vision encoder: The mmproj-F32.gguf (~2 GB) handles image encoding. It's loaded separately via --mmproj and adds to the total VRAM footprint.

VRAM ComponentSizeNotes
Model weights (UD-Q4_K_XL)~17.1 GBAll 25.2B parameters, quantized
Multimodal projector (F32)~2.0 GBVision encoder for image input
KV cache (q8_0, 65K total ctx)~3.0 GB4 slots × 16K per slot baseline
CUDA buffers / scratch~3.0 GBKernel working memory
Total active~25.1 GBSteady-state usage
Headroom on L40S (48 GB)~23 GBBuffer for burst/tuning

Component

Model weights

Estimated VRAM

~17.1 GB

Notes

Primary GGUF footprint.

Component

Vision projection (mmproj)

Estimated VRAM

~2.0 GB

Notes

Needed for multimodal image flow.

Component

KV cache

Estimated VRAM

~3.0 GB (context-dependent)

Notes

Scales with ctx-size and parallel slots.

Component

System buffers

Estimated VRAM

~3.0 GB

Notes

Runtime overhead and safety headroom.

Component

Total

Estimated VRAM

~25.1 GB

Notes

L40S (48GB) is the baseline in the reference deployment.

PillarPractical effect in productionOperational implication
GGUF packagingSingle artifact with model + tokenizer metadataLower config drift across local, staging, and Modal runtime.
MoE sparsityHigh capacity with lower active compute per tokenBetter cost-quality position for interactive workloads.
UD-Q4_K_XL quantizationLarge memory reduction while retaining utilityMakes single-GPU L40S-class serving feasible.
05

Benchmark results

Massive improvements over Gemma 3 across reasoning, math, and coding.

Gemma 4 26B-A4B represents a generational leap over Gemma 3 27B. The MoE architecture combined with improved training yields dramatic benchmark gains—particularly in math, coding, and scientific reasoning where the model shows 3-4x improvements in some categories. Figures in the table below follow the public reporting in the Gemma 4 launch post.

BenchmarkGemma 4 26B-A4BGemma 3 27BImprovement
MMLU Pro

Multi-task language understanding

82.6%67.6%+15.0%
AIME 2026 (no tools)

Competition mathematics

88.3%20.8%+67.5%
LiveCodeBench v6

Code generation

77.1%29.1%+48.0%
Codeforces ELO

Competitive programming

1718110+1608
GPQA Diamond

Graduate-level science

82.3%42.4%+39.9%

Math and reasoning

AIME improvement from 20.8% to 88.3% represents a 4.2x gain—moving from below-average to competition-level performance.

Code generation

LiveCodeBench nearly triples. Codeforces ELO jumps from negligible (110) to Expert-level (1718).

Scientific reasoning

GPQA Diamond doubles, indicating strong graduate-level STEM comprehension.

06

Infrastructure topology

Modal volumes, ephemeral containers, and a warmed snapshot loop.

Modal Cloud

Volume

model-cache

  • gemma-4-26B-A4B-it-*.gguf
  • mmproj-F32.gguf

Volume

llama-server-binary

  • pinned build
  • from LLAMA_CPP_TAG

Container

llama-server process

  • launch with --no-mmap
  • health check + warmup
  • capture GPU snapshot
  • restore after scale-down
Client -> /v1/chat/completions (OpenAI-compatible)

This topology intentionally separates long-lived artifacts (weights + pinned binary) from ephemeral compute using Modal Volumes. That split is what keeps redeploys and horizontal scaling from repeatedly paying full artifact fetch costs.

GPU memory snapshotting happens after health + warmup, so restores resume a ready process, not a partially initialized process.

07

Serving stack

Where latency/cost behavior comes from in this architecture.

  • llama-server subprocess model: clear health lifecycle and controlled restarts (llama.cpp server).
  • Pinned binary build: reproducible behavior tied to a known llama.cpp commit hash.
  • Volume-backed assets: no repeated model download during serve-time startup (Modal Volumes).
  • OpenAI-compatible API surface: straightforward client/SDK integration.

Two-image strategy (builder vs runtime)

Builder image carries full toolchain for reproducible compile; runtime image ships only serving dependencies. This reduces attack surface, cold boot overhead, and snapshot size.

08

Scale-to-zero and snapshots

Cold-path economics and why --no-mmap is non-negotiable in this setup.

Snapshot reliability depends on launching llama-server with --no-mmap. Memory-mapped startup may break restore semantics and collapse the cold-start strategy; see Modal's GPU memory snapshot guide for platform semantics.

Traditional cold path

Weight load + warmup each restart; often multi-minute startup.

Snapshot restore path

Restore pre-warmed process state, substantially reducing first-response wait after idle.

PhaseTypical duration classWhy it matters
Provision + mountsecondsCommon to both cold and restore paths.
Weight load + warmupminutesDominant cost in traditional startup flow.
Snapshot restoresub-minute classPrimary latency/cost win of this architecture.
09

Performance targets

Throughput optimizations and expected results on L40S.

OptimizationFlagImpact
Flash attention--flash-attn onO(n) attention instead of O(n²)
Full GPU offload--n-gpu-layers 99910-50x vs CPU-only inference
Batch processing--batch-size 2048Maximizes prefill throughput
Quantized KV cache--cache-type-k/v q8_050% VRAM savings vs f16
Parallel slots--parallel 44 concurrent requests per container
Zero-proxy routing@modal.web_serverNo FastAPI/httpx overhead
MoE efficiencyArchitecture3.8B active / 25.2B total → decode speed of a ~4B model

Metric

Decode throughput

Expected range

~50-60+ tok/s

Context

Per slot on L40S (864 GB/s memory bandwidth). MoE architecture enables this despite 26B total params.

Metric

Prefill (prompt processing)

Expected range

~2000+ tok/s

Context

With batch-size 2048. Scales with prompt length.

Metric

Warm TTFT

Expected range

sub-second

Context

Time to first token on warm container. Varies with prompt size.

Metric

Cold start (with snapshots)

Expected range

5-15 seconds

Context

Snapshot restore path. Without snapshots: 60-120 seconds.

Metric

Concurrent capacity

Expected range

4 slots × 3 containers = 12

Context

Per default configuration. Tunable via --parallel and max_containers.

Tuning levers

  • More concurrency, less context: --parallel 8 --ctx-size 131072 (8 slots × 16K each)
  • More context, less concurrency: --parallel 2 --ctx-size 131072 (2 slots × 64K each)
  • Aggressive KV cache: --cache-type-k q4_0 --cache-type-v q4_0 (75% VRAM savings vs f16)

These values are directional. Real numbers shift with prompt length, tool-calling patterns, vision usage, and concurrency configuration. Always benchmark on your workload before setting SLOs.

10

Strategic positioning

Quick framing versus larger online-serving stacks; full detail lives in Operate & compare.

Model stackStrengthTypical trade-off
Gemma-4 GGUF + llama.cppPortable, cost-aware, low-idle overhead with snapshotsLess raw max throughput than very large multi-GPU online stacks
GLM-5.1 FP8 + SGLangHigh-concurrency large-scale servingSignificantly higher infra complexity and spend envelope
Dense large-model servingMature optimization ecosystemLess parameter efficiency than MoE for similar capacity classes

Practical framing: GLM-scale stacks optimize for maximum throughput at very high infra complexity; this Gemma stack optimizes for deployability, responsiveness after idle, and lower operational surface area.

11

Design choices

The deliberate defaults this guide encodes.

  • Engine: llama.cpp for GGUF-native operation and straightforward process control.
  • Cold starts: scale-to-zero with snapshot restore instead of always-on baseline spend.
  • Launch safety: --no-mmap is treated as required, not optional.
  • Integration surface: OpenAI-compatible endpoint for SDK/client compatibility.
  • Reasoning mode support: align with Google's thinking docs and server-side budget wiring; keep compatibility with enable_thinking and budget controls, but treat response shape differences as a client concern.
  • Tool-calling safeguards: assume partial spec mismatch and enforce server-side validation/retry patterns.
12

References