Gemma-4-26B-A4B-it-GGUF on Modal
APIs & clients
Vision, thinking, tool calling, Responses API, Cursor/Codex, frameworks, and limits.
APIs & clients
Complete guide to Gemma 4's production feature surface: multimodal vision, adaptive thinking with budget control, native tool calling, Cursor IDE integration, and framework examples for every client. Client wire formats follow OpenAI Chat Completions where noted.
Vision / multimodal
Image input via the multimodal projector (mmproj-F32.gguf).
How it works: Vision is handled by a separate mmproj-F32.gguf (~2 GB) loaded via --mmproj from the published GGUF bundle. llama-server's internal libmtmd library handles image encoding.
Input formats: Standard OpenAI image_url content blocks work directly (OpenAI vision guide). Both remote URLs and base64 data URLs are supported.
What works
- ● Image captioning and description
- ● OCR / text extraction from images
- ● Object detection and spatial understanding
- ● Chart, diagram, and document analysis
- ● Multiple images in a single message
- ● Vision + thinking combined analysis
What doesn't work
- ● Audio input (E2B/E4B variants only)
- ● Video input
- ● Very large images may exceed context limits
1from openai import OpenAI23client = OpenAI(base_url="https://<app>.modal.run/v1", api_key="YOUR_KEY")45# Single image via URL6response = client.chat.completions.create(7 model="gemma-4-26b-a4b",8 messages=[{9 "role": "user",10 "content": [11 {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},12 {"type": "text", "text": "What's in this image?"},13 ],14 }],15 max_tokens=512,16)17print(response.choices[0].message.content)
1import base6423# Base64 encoding (recommended for reliability)4with open("photo.jpg", "rb") as f:5 b64 = base64.b64encode(f.read()).decode()67response = client.chat.completions.create(8 model="gemma-4-26b-a4b",9 messages=[{10 "role": "user",11 "content": [12 {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},13 {"type": "text", "text": "Describe this image in detail."},14 ],15 }],16 max_tokens=1024,17)
1# Multiple images in one message2messages = [{3 "role": "user",4 "content": [5 {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64_1}"}},6 {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64_2}"}},7 {"type": "text", "text": "Compare these two images."},8 ],9}]
1# Vision + thinking combined2response = client.chat.completions.create(3 model="gemma-4-26b-a4b",4 messages=[{5 "role": "user",6 "content": [7 {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},8 {"type": "text", "text": "What breed is this dog? Analyze carefully."},9 ],10 }],11 max_tokens=2048,12 extra_body={"chat_template_kwargs": {"enable_thinking": True}},13)1415raw = response.choices[0].message.model_dump()16print(f"Thinking: {raw.get('reasoning_content', '')[:200]}")17print(f"Answer: {response.choices[0].message.content}")
Tip: Base64 encoding client-side is more reliable than passing remote URLs (avoids server-side fetch issues). Disable thinking for simple image descriptions to get faster, more direct responses.
Adaptive thinking
Chain-of-thought reasoning with per-request control and budget limits.
How it works: Gemma 4 uses special tokens <|channel>thought\n and <channel|> to delimit thinking blocks (prompt formatting reference). llama-server parses these and exposes them via a separate reasoning_content field—never mixed into content.
Adaptive behavior: Setting enable_thinking: True does not guarantee thinking output. Gemma 4 uses adaptive reasoning (Google AI thinking doc) and may skip thinking for trivial questions. enable_thinking: False is deterministic and always suppresses thinking.
1# ENABLE thinking (explicit)2response = client.chat.completions.create(3 model="gemma-4-26b-a4b",4 messages=[{"role": "user", "content": "Prove sqrt(2) is irrational."}],5 max_tokens=4096,6 extra_body={"chat_template_kwargs": {"enable_thinking": True}},7)89msg = response.choices[0].message10raw = msg.model_dump()1112thinking = raw.get("reasoning_content") or "" # chain-of-thought13answer = msg.content or "" # final response1415print(f"Thinking ({len(thinking)} chars): {thinking[:200]}...")16print(f"Answer: {answer}")
1# DISABLE thinking (explicit)2response = client.chat.completions.create(3 model="gemma-4-26b-a4b",4 messages=[{"role": "user", "content": "What is 2+2?"}],5 max_tokens=64,6 extra_body={"chat_template_kwargs": {"enable_thinking": False}},7)89# reasoning_content will be empty10print(response.choices[0].message.content)
1# Thinking with token budget — limit thinking to 256 tokens2response = client.chat.completions.create(3 model="gemma-4-26b-a4b",4 messages=[{"role": "user", "content": "Solve 23 * 47"}],5 max_tokens=1024,6 extra_body={7 "chat_template_kwargs": {"enable_thinking": True},8 "thinking_budget_tokens": 256,9 },10)1112# Anthropic-compatible format also works:13response = client.chat.completions.create(14 model="gemma-4-26b-a4b",15 messages=[{"role": "user", "content": "Solve 23 * 47"}],16 max_tokens=1024,17 extra_body={18 "thinking": {"type": "enabled", "budget_tokens": 256},19 },20)
1# Streaming with reasoning_content2stream = client.chat.completions.create(3 model="gemma-4-26b-a4b",4 messages=[{"role": "user", "content": "Explain quantum computing."}],5 stream=True,6 max_tokens=2048,7 extra_body={"chat_template_kwargs": {"enable_thinking": True}},8)910for chunk in stream:11 if not chunk.choices:12 continue13 delta = chunk.choices[0].delta14 raw = delta.model_dump()1516 reasoning = raw.get("reasoning_content") or ""17 content = delta.content or ""1819 if reasoning:20 # Thinking tokens stream here first21 print(f"[think] {reasoning}", end="", flush=True)22 if content:23 # Response tokens stream after thinking completes24 print(content, end="", flush=True)
Tips for thinking
- ● For math, logic, coding, and multi-step problems: enable thinking and set
max_tokenshigh (2048+) - ● For simple factual questions, translations, formatting: disable thinking for faster response
- ● If you get empty
contentwith non-emptyreasoning_content, the model spent all its budget on thinking—increasemax_tokens - ● Use
thinking_budget_tokensto cap reasoning time for latency-sensitive use cases - ● System prompts like "Think step by step" can encourage deeper thinking when enabled
Tool calling
Native Gemma 4 tool parser with OpenAI-format tool_calls.
How it works: With --jinja, llama-server enables Gemma 4's native tool calling parser (upstream Gemma 4 parser PR). The model uses special tokens <|tool_call> which are parsed and returned as standard OpenAI-format tool_calls.
Interleaved thinking: With the interleaved template, the model preserves its chain of thought between tool calls. Previous thinking is stripped from conversation history automatically to prevent context bloat.
1# Define tools in standard OpenAI format2tools = [3 {4 "type": "function",5 "function": {6 "name": "get_weather",7 "description": "Get the current weather for a location.",8 "parameters": {9 "type": "object",10 "properties": {11 "location": {"type": "string", "description": "City name"},12 "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},13 },14 "required": ["location"],15 },16 },17 },18]
1import json23messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]45# Step 1: Send query with tools6response = client.chat.completions.create(7 model="gemma-4-26b-a4b",8 messages=messages,9 tools=tools,10 max_tokens=512,11)1213msg = response.choices[0].message1415if msg.tool_calls:16 # Step 2: Execute tool(s) locally17 messages.append(msg)18 for tc in msg.tool_calls:19 print(f"Tool call: {tc.function.name}({tc.function.arguments})")2021 # Your tool implementation here22 result = {"location": "Tokyo", "temp": 22, "condition": "Sunny"}2324 messages.append({25 "role": "tool",26 "tool_call_id": tc.id,27 "content": json.dumps(result),28 })2930 # Step 3: Get final answer31 response = client.chat.completions.create(32 model="gemma-4-26b-a4b",33 messages=messages,34 tools=tools,35 max_tokens=512,36 )37 print(response.choices[0].message.content)38else:39 # Model answered directly without tools40 print(msg.content)
Multi-step tool calls: The model can make multiple tool calls in sequence. Wrap the loop above in while msg.tool_calls: to handle chains automatically.
tool_choice limitations
What works, what doesn't, and workarounds for reliable tool calling.
Compare behavior against the official tool_choice contract; this server has known gaps called out in the table.
| Value | Status | Behavior | Root cause |
|---|---|---|---|
| "auto" | ● Reliable | Model decides whether to call tools | Default server/model path |
| "none" | ● Partial | No parsed tool_calls, but may leak raw tokens into content | Parser suppresses extraction but not generation |
| "required" | ● Fails | Empty tool_calls and empty content | Grammar constraint activates but output can't be parsed |
| {"type":"function",...} | ● Ignored | Falls back to "auto" silently | Server parses tool_choice as string only |
Workaround: Force a specific tool
Only pass that one tool in the tools array. The model reliably calls the only available tool when the query needs it.
Workaround: Force any tool call
Use a system message: "You MUST use one of the provided tools. Do NOT answer directly." with tool_choice="auto".
Test results (19 tests, test_tools.py)
| Category | Tests | Result |
|---|---|---|
| Single tool calls (weather, calculator) | 2 | PASS |
| Parallel tool calls | 1 | PASS |
| Multi-step chains | 1 | PASS |
| No tool needed (answer directly) | 1 | PASS |
| tool_choice="none" | 1 | PASS (partial) |
| tool_choice="required" | 1 | FAIL (known) |
| Specific function object | 1 | FAIL (known) |
| Large tool set (select from 8) | 2 | PASS |
| Argument types (arrays, objects, numbers) | 3 | PASS |
| Thinking + tools combined | 1 | PASS |
| Streaming tool calls | 1 | PASS |
| Error recovery / tool_call_id roundtrip | 2 | PASS |
Cursor IDE integration
Full Agent mode support with tool calling via the Responses API.
This endpoint works as a custom model in Cursor IDE, including full Agent mode with tool calling (file read, file write, code search, terminal commands, etc.).
Setup steps
- Open Cursor Settings (
Cmd+,/Ctrl+,) - Go to Models section
- Click + Add Model, enter model name:
gemma-4-26b-a4b - Set Override OpenAI Base URL to your Modal deployment URL (e.g.,
https://scalewaveai--gemma-4-26b-a4b-gguf-server-serve.modal.run/v1) - Set the OpenAI API Key to your
API_KEY - Enable the model toggle
| Mode | Works | Notes |
|---|---|---|
| Agent (default) | Yes | Full tool calling—file read/write, code search, terminal. Uses /v1/responses |
| Ask | Yes | Read-only Q&A mode. Uses /v1/chat/completions |
| Issue | Fix |
|---|---|
| "Errored, no charge" | Ensure Base URL ends with /v1. Enable HTTP/1.1 in Cursor Settings → Network → HTTP Compatibility Mode |
| Tools not working | Verify Agent mode (not Ask). Check llama-server built from master |
| Slow responses | First request triggers cold start (~5-15s). Subsequent requests are fast |
| Model not appearing | Toggle the model ON in the Models list after adding it |
Responses API
The OpenAI Responses API format used by Cursor and Codex CLI.
Cursor's Agent mode and Codex CLI send requests using the OpenAI Responses API shape (POST /v1/responses with input instead of messages, flat tool definitions).
Our llama-server build (from master) includes /v1/responses (llama.cpp PR #18486), which converts Responses API requests to Chat Completions internally and translates responses back.
| Endpoint | Method | Description |
|---|---|---|
| /v1/chat/completions | POST | Standard chat completions (streaming and non-streaming) |
| /v1/responses | POST | OpenAI Responses API (Cursor Agent mode, Codex CLI) |
| /v1/models | GET | List available models |
| /health | GET | Server health check ({"status":"ok"}) |
| /metrics | GET | Prometheus metrics |
Build requirement: The /v1/responses endpoint landed in PR #18486 (merged Jan 21, 2026). Older tagged builds may not include it or may have incomplete tool-calling round-trips. Build from master for full compatibility.
Framework examples
Integration examples for popular frameworks and languages.
1from langchain_openai import ChatOpenAI23llm = ChatOpenAI(4 base_url="https://<app>.modal.run/v1",5 api_key="YOUR_KEY",6 model="gemma-4-26b-a4b",7 temperature=1.0,8 max_tokens=512,9)1011result = llm.invoke("Explain quantum computing.")12print(result.content)
1import litellm23response = litellm.completion(4 model="openai/gemma-4-26b-a4b",5 api_base="https://<app>.modal.run/v1",6 api_key="YOUR_KEY",7 messages=[{"role": "user", "content": "Hello"}],8)9print(response.choices[0].message.content)
1import OpenAI from "openai";23const client = new OpenAI({4 baseURL: "https://<app>.modal.run/v1",5 apiKey: "YOUR_KEY",6});78const response = await client.chat.completions.create({9 model: "gemma-4-26b-a4b",10 messages: [{ role: "user", content: "Hello!" }],11 max_tokens: 256,12});1314console.log(response.choices[0].message.content);
1curl -X POST https://<app>.modal.run/v1/chat/completions \2 -H "Authorization: Bearer YOUR_KEY" \3 -H "Content-Type: application/json" \4 -d '{5 "model": "gemma-4-26b-a4b",6 "messages": [{"role": "user", "content": "Hello!"}],7 "stream": true,8 "temperature": 1.0,9 "top_p": 0.95,10 "max_tokens": 25611 }'
Limits and constraints
Hard limits and known constraints for this deployment.
| Constraint | Value | Notes |
|---|---|---|
| Max context | 256K tokens | Model limit; server configured for 16K per slot |
| Parallel slots | 4 | 4 concurrent requests per container |
| Max containers | 3 | Up to 12 concurrent requests total |
| Image formats | JPEG, PNG, GIF, WebP | Via image_url content blocks |
| Audio | Not supported | Audio is E2B/E4B variants only |
| Video | Not supported | Not available in this deployment |
| Thinking budget | Per-request | thinking_budget_tokens in extra_body |
For operational caveats, monitoring, and upgrade procedures, see Operate & compare.
External references
Sources cited inline on this tab.
Same URLs as the violet inline links above.
- OpenAI API: Create chat completionplatform.openai.com
- OpenAI API: Create responseplatform.openai.com
- OpenAI docs: Images and visionplatform.openai.com
- OpenAI: Function callingplatform.openai.com
- Google DeepMind Blog: Gemma 4deepmind.google
- Gemma 4 Prompt Formatting Guideai.google.dev
- Gemma 4 Thinking Capabilitiesai.google.dev
- HuggingFace: unsloth/gemma-4-26B-A4B-it-GGUFhuggingface.co
- llama.cpp Gemma 4 Parser (PR #21418)github.com
- llama.cpp Responses API (PR #18486)github.com
- llama.cpp server READMEgithub.com