GuideProduction

Gemma-4-26B-A4B-it-GGUF on Modal

APIs & clients

Vision, thinking, tool calling, Responses API, Cursor/Codex, frameworks, and limits.

Complete guide to Gemma 4's production feature surface: multimodal vision, adaptive thinking with budget control, native tool calling, Cursor IDE integration, and framework examples for every client. Client wire formats follow OpenAI Chat Completions where noted.

Vision / multimodal

Image input via the multimodal projector (mmproj-F32.gguf).

How it works: Vision is handled by a separate mmproj-F32.gguf (~2 GB) loaded via --mmproj from the published GGUF bundle. llama-server's internal libmtmd library handles image encoding.

Input formats: Standard OpenAI image_url content blocks work directly (OpenAI vision guide). Both remote URLs and base64 data URLs are supported.

What works

● Image captioning and description
● OCR / text extraction from images
● Object detection and spatial understanding
● Chart, diagram, and document analysis
● Multiple images in a single message
● Vision + thinking combined analysis

What doesn't work

● Audio input (E2B/E4B variants only)
● Video input
● Very large images may exceed context limits

vision-single-image.py

1from openai import OpenAI
2
3client = OpenAI(base_url="https://<app>.modal.run/v1", api_key="YOUR_KEY")
4
5# Single image via URL
6response = client.chat.completions.create(
7    model="gemma-4-26b-a4b",
8    messages=[{
9        "role": "user",
10        "content": [
11            {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
12            {"type": "text", "text": "What's in this image?"},
13        ],
14    }],
15    max_tokens=512,
16)
17print(response.choices[0].message.content)

vision-base64.py

1import base64
2
3# Base64 encoding (recommended for reliability)
4with open("photo.jpg", "rb") as f:
5    b64 = base64.b64encode(f.read()).decode()
6
7response = client.chat.completions.create(
8    model="gemma-4-26b-a4b",
9    messages=[{
10        "role": "user",
11        "content": [
12            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
13            {"type": "text", "text": "Describe this image in detail."},
14        ],
15    }],
16    max_tokens=1024,
17)

vision-multiple-images.py

1# Multiple images in one message
2messages = [{
3    "role": "user",
4    "content": [
5        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64_1}"}},
6        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64_2}"}},
7        {"type": "text", "text": "Compare these two images."},
8    ],
9}]

vision-with-thinking.py

1# Vision + thinking combined
2response = client.chat.completions.create(
3    model="gemma-4-26b-a4b",
4    messages=[{
5        "role": "user",
6        "content": [
7            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
8            {"type": "text", "text": "What breed is this dog? Analyze carefully."},
9        ],
10    }],
11    max_tokens=2048,
12    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
13)
14
15raw = response.choices[0].message.model_dump()
16print(f"Thinking: {raw.get('reasoning_content', '')[:200]}")
17print(f"Answer: {response.choices[0].message.content}")

Tip: Base64 encoding client-side is more reliable than passing remote URLs (avoids server-side fetch issues). Disable thinking for simple image descriptions to get faster, more direct responses.

Adaptive thinking

Chain-of-thought reasoning with per-request control and budget limits.

How it works: Gemma 4 uses special tokens <|channel>thought\n and <channel|> to delimit thinking blocks (prompt formatting reference). llama-server parses these and exposes them via a separate reasoning_content field—never mixed into content.

Adaptive behavior: Setting enable_thinking: True does not guarantee thinking output. Gemma 4 uses adaptive reasoning (Google AI thinking doc) and may skip thinking for trivial questions. enable_thinking: False is deterministic and always suppresses thinking.

thinking-enable.py

1# ENABLE thinking (explicit)
2response = client.chat.completions.create(
3    model="gemma-4-26b-a4b",
4    messages=[{"role": "user", "content": "Prove sqrt(2) is irrational."}],
5    max_tokens=4096,
6    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
7)
8
9msg = response.choices[0].message
10raw = msg.model_dump()
11
12thinking = raw.get("reasoning_content") or ""  # chain-of-thought
13answer = msg.content or ""                      # final response
14
15print(f"Thinking ({len(thinking)} chars): {thinking[:200]}...")
16print(f"Answer: {answer}")

thinking-disable.py

1# DISABLE thinking (explicit)
2response = client.chat.completions.create(
3    model="gemma-4-26b-a4b",
4    messages=[{"role": "user", "content": "What is 2+2?"}],
5    max_tokens=64,
6    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
7)
8
9# reasoning_content will be empty
10print(response.choices[0].message.content)

thinking-budget.py

1# Thinking with token budget — limit thinking to 256 tokens
2response = client.chat.completions.create(
3    model="gemma-4-26b-a4b",
4    messages=[{"role": "user", "content": "Solve 23 * 47"}],
5    max_tokens=1024,
6    extra_body={
7        "chat_template_kwargs": {"enable_thinking": True},
8        "thinking_budget_tokens": 256,
9    },
10)
11
12# Anthropic-compatible format also works:
13response = client.chat.completions.create(
14    model="gemma-4-26b-a4b",
15    messages=[{"role": "user", "content": "Solve 23 * 47"}],
16    max_tokens=1024,
17    extra_body={
18        "thinking": {"type": "enabled", "budget_tokens": 256},
19    },
20)

thinking-streaming.py

1# Streaming with reasoning_content
2stream = client.chat.completions.create(
3    model="gemma-4-26b-a4b",
4    messages=[{"role": "user", "content": "Explain quantum computing."}],
5    stream=True,
6    max_tokens=2048,
7    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
8)
9
10for chunk in stream:
11    if not chunk.choices:
12        continue
13    delta = chunk.choices[0].delta
14    raw = delta.model_dump()
15
16    reasoning = raw.get("reasoning_content") or ""
17    content = delta.content or ""
18
19    if reasoning:
20        # Thinking tokens stream here first
21        print(f"[think] {reasoning}", end="", flush=True)
22    if content:
23        # Response tokens stream after thinking completes
24        print(content, end="", flush=True)

Tips for thinking

● For math, logic, coding, and multi-step problems: enable thinking and set max_tokens high (2048+)
● For simple factual questions, translations, formatting: disable thinking for faster response
● If you get empty content with non-empty reasoning_content, the model spent all its budget on thinking—increase max_tokens
● Use thinking_budget_tokens to cap reasoning time for latency-sensitive use cases
● System prompts like "Think step by step" can encourage deeper thinking when enabled

Tool calling

Native Gemma 4 tool parser with OpenAI-format tool_calls.

How it works: With --jinja, llama-server enables Gemma 4's native tool calling parser (upstream Gemma 4 parser PR). The model uses special tokens <|tool_call> which are parsed and returned as standard OpenAI-format tool_calls.

Interleaved thinking: With the interleaved template, the model preserves its chain of thought between tool calls. Previous thinking is stripped from conversation history automatically to prevent context bloat.

tools-define.py

1# Define tools in standard OpenAI format
2tools = [
3    {
4        "type": "function",
5        "function": {
6            "name": "get_weather",
7            "description": "Get the current weather for a location.",
8            "parameters": {
9                "type": "object",
10                "properties": {
11                    "location": {"type": "string", "description": "City name"},
12                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
13                },
14                "required": ["location"],
15            },
16        },
17    },
18]

tools-loop.py

1import json
2
3messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]
4
5# Step 1: Send query with tools
6response = client.chat.completions.create(
7    model="gemma-4-26b-a4b",
8    messages=messages,
9    tools=tools,
10    max_tokens=512,
11)
12
13msg = response.choices[0].message
14
15if msg.tool_calls:
16    # Step 2: Execute tool(s) locally
17    messages.append(msg)
18    for tc in msg.tool_calls:
19        print(f"Tool call: {tc.function.name}({tc.function.arguments})")
20
21        # Your tool implementation here
22        result = {"location": "Tokyo", "temp": 22, "condition": "Sunny"}
23
24        messages.append({
25            "role": "tool",
26            "tool_call_id": tc.id,
27            "content": json.dumps(result),
28        })
29
30    # Step 3: Get final answer
31    response = client.chat.completions.create(
32        model="gemma-4-26b-a4b",
33        messages=messages,
34        tools=tools,
35        max_tokens=512,
36    )
37    print(response.choices[0].message.content)
38else:
39    # Model answered directly without tools
40    print(msg.content)

Multi-step tool calls: The model can make multiple tool calls in sequence. Wrap the loop above in while msg.tool_calls: to handle chains automatically.

tool_choice limitations

What works, what doesn't, and workarounds for reliable tool calling.

Compare behavior against the official tool_choice contract; this server has known gaps called out in the table.

Value	Status	Behavior	Root cause
"auto"	● Reliable	Model decides whether to call tools	Default server/model path
"none"	● Partial	No parsed tool_calls, but may leak raw tokens into content	Parser suppresses extraction but not generation
"required"	● Fails	Empty tool_calls and empty content	Grammar constraint activates but output can't be parsed
{"type":"function",...}	● Ignored	Falls back to "auto" silently	Server parses tool_choice as string only

Workaround: Force a specific tool

Only pass that one tool in the tools array. The model reliably calls the only available tool when the query needs it.

Workaround: Force any tool call

Use a system message: "You MUST use one of the provided tools. Do NOT answer directly." with tool_choice="auto".

Test results (19 tests, test_tools.py)

Category	Tests	Result
Single tool calls (weather, calculator)	2	PASS
Parallel tool calls	1	PASS
Multi-step chains	1	PASS
No tool needed (answer directly)	1	PASS
tool_choice="none"	1	PASS (partial)
tool_choice="required"	1	FAIL (known)
Specific function object	1	FAIL (known)
Large tool set (select from 8)	2	PASS
Argument types (arrays, objects, numbers)	3	PASS
Thinking + tools combined	1	PASS
Streaming tool calls	1	PASS
Error recovery / tool_call_id roundtrip	2	PASS

Cursor IDE integration

Full Agent mode support with tool calling via the Responses API.

This endpoint works as a custom model in Cursor IDE, including full Agent mode with tool calling (file read, file write, code search, terminal commands, etc.).

Setup steps

Open Cursor Settings (Cmd+, / Ctrl+,)
Go to Models section
Click + Add Model, enter model name: gemma-4-26b-a4b
Set Override OpenAI Base URL to your Modal deployment URL (e.g., https://scalewaveai--gemma-4-26b-a4b-gguf-server-serve.modal.run/v1)
Set the OpenAI API Key to your API_KEY
Enable the model toggle

Mode	Works	Notes
Agent (default)	Yes	Full tool calling—file read/write, code search, terminal. Uses `/v1/responses`
Ask	Yes	Read-only Q&A mode. Uses `/v1/chat/completions`

Issue	Fix
"Errored, no charge"	Ensure Base URL ends with `/v1`. Enable HTTP/1.1 in Cursor Settings → Network → HTTP Compatibility Mode
Tools not working	Verify Agent mode (not Ask). Check llama-server built from `master`
Slow responses	First request triggers cold start (~5-15s). Subsequent requests are fast
Model not appearing	Toggle the model ON in the Models list after adding it

Responses API

The OpenAI Responses API format used by Cursor and Codex CLI.

Cursor's Agent mode and Codex CLI send requests using the OpenAI Responses API shape (POST /v1/responses with input instead of messages, flat tool definitions).

Our llama-server build (from master) includes /v1/responses (llama.cpp PR #18486), which converts Responses API requests to Chat Completions internally and translates responses back.

Endpoint	Method	Description
/v1/chat/completions	POST	Standard chat completions (streaming and non-streaming)
/v1/responses	POST	OpenAI Responses API (Cursor Agent mode, Codex CLI)
/v1/models	GET	List available models
/health	GET	Server health check (`{"status":"ok"}`)
/metrics	GET	Prometheus metrics

Build requirement: The /v1/responses endpoint landed in PR #18486 (merged Jan 21, 2026). Older tagged builds may not include it or may have incomplete tool-calling round-trips. Build from master for full compatibility.

Framework examples

Integration examples for popular frameworks and languages.

langchain.py

1from langchain_openai import ChatOpenAI
2
3llm = ChatOpenAI(
4    base_url="https://<app>.modal.run/v1",
5    api_key="YOUR_KEY",
6    model="gemma-4-26b-a4b",
7    temperature=1.0,
8    max_tokens=512,
9)
10
11result = llm.invoke("Explain quantum computing.")
12print(result.content)

litellm.py

1import litellm
2
3response = litellm.completion(
4    model="openai/gemma-4-26b-a4b",
5    api_base="https://<app>.modal.run/v1",
6    api_key="YOUR_KEY",
7    messages=[{"role": "user", "content": "Hello"}],
8)
9print(response.choices[0].message.content)

typescript.ts

1import OpenAI from "openai";
2
3const client = new OpenAI({
4  baseURL: "https://<app>.modal.run/v1",
5  apiKey: "YOUR_KEY",
6});
7
8const response = await client.chat.completions.create({
9  model: "gemma-4-26b-a4b",
10  messages: [{ role: "user", content: "Hello!" }],
11  max_tokens: 256,
12});
13
14console.log(response.choices[0].message.content);

curl.sh

1curl -X POST https://<app>.modal.run/v1/chat/completions \
2  -H "Authorization: Bearer YOUR_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "model": "gemma-4-26b-a4b",
6    "messages": [{"role": "user", "content": "Hello!"}],
7    "stream": true,
8    "temperature": 1.0,
9    "top_p": 0.95,
10    "max_tokens": 256
11  }'

Limits and constraints

Hard limits and known constraints for this deployment.

Constraint	Value	Notes
Max context	256K tokens	Model limit; server configured for 16K per slot
Parallel slots	4	4 concurrent requests per container
Max containers	3	Up to 12 concurrent requests total
Image formats	JPEG, PNG, GIF, WebP	Via image_url content blocks
Audio	Not supported	Audio is E2B/E4B variants only
Video	Not supported	Not available in this deployment
Thinking budget	Per-request	thinking_budget_tokens in extra_body

For operational caveats, monitoring, and upgrade procedures, see Operate & compare.

External references

Sources cited inline on this tab.

Same URLs as the violet inline links above.

Related sections

Stack overview Modal deployment Runtime tuning Operate & compare