GuideProduction

Gemma-4-26B-A4B-it-GGUF on Modal

APIs & clients

Vision, thinking, tool calling, Responses API, Cursor/Codex, frameworks, and limits.

APIs & clients

Complete guide to Gemma 4's production feature surface: multimodal vision, adaptive thinking with budget control, native tool calling, Cursor IDE integration, and framework examples for every client. Client wire formats follow OpenAI Chat Completions where noted.

01

Vision / multimodal

Image input via the multimodal projector (mmproj-F32.gguf).

How it works: Vision is handled by a separate mmproj-F32.gguf (~2 GB) loaded via --mmproj from the published GGUF bundle. llama-server's internal libmtmd library handles image encoding.

Input formats: Standard OpenAI image_url content blocks work directly (OpenAI vision guide). Both remote URLs and base64 data URLs are supported.

What works

  • ● Image captioning and description
  • ● OCR / text extraction from images
  • ● Object detection and spatial understanding
  • ● Chart, diagram, and document analysis
  • ● Multiple images in a single message
  • ● Vision + thinking combined analysis

What doesn't work

  • ● Audio input (E2B/E4B variants only)
  • ● Video input
  • ● Very large images may exceed context limits
vision-single-image.py
1from openai import OpenAI
2
3client = OpenAI(base_url="https://<app>.modal.run/v1", api_key="YOUR_KEY")
4
5# Single image via URL
6response = client.chat.completions.create(
7 model="gemma-4-26b-a4b",
8 messages=[{
9 "role": "user",
10 "content": [
11 {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
12 {"type": "text", "text": "What's in this image?"},
13 ],
14 }],
15 max_tokens=512,
16)
17print(response.choices[0].message.content)
vision-base64.py
1import base64
2
3# Base64 encoding (recommended for reliability)
4with open("photo.jpg", "rb") as f:
5 b64 = base64.b64encode(f.read()).decode()
6
7response = client.chat.completions.create(
8 model="gemma-4-26b-a4b",
9 messages=[{
10 "role": "user",
11 "content": [
12 {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
13 {"type": "text", "text": "Describe this image in detail."},
14 ],
15 }],
16 max_tokens=1024,
17)
vision-multiple-images.py
1# Multiple images in one message
2messages = [{
3 "role": "user",
4 "content": [
5 {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64_1}"}},
6 {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64_2}"}},
7 {"type": "text", "text": "Compare these two images."},
8 ],
9}]
vision-with-thinking.py
1# Vision + thinking combined
2response = client.chat.completions.create(
3 model="gemma-4-26b-a4b",
4 messages=[{
5 "role": "user",
6 "content": [
7 {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
8 {"type": "text", "text": "What breed is this dog? Analyze carefully."},
9 ],
10 }],
11 max_tokens=2048,
12 extra_body={"chat_template_kwargs": {"enable_thinking": True}},
13)
14
15raw = response.choices[0].message.model_dump()
16print(f"Thinking: {raw.get('reasoning_content', '')[:200]}")
17print(f"Answer: {response.choices[0].message.content}")

Tip: Base64 encoding client-side is more reliable than passing remote URLs (avoids server-side fetch issues). Disable thinking for simple image descriptions to get faster, more direct responses.

02

Adaptive thinking

Chain-of-thought reasoning with per-request control and budget limits.

How it works: Gemma 4 uses special tokens <|channel>thought\n and <channel|> to delimit thinking blocks (prompt formatting reference). llama-server parses these and exposes them via a separate reasoning_content field—never mixed into content.

Adaptive behavior: Setting enable_thinking: True does not guarantee thinking output. Gemma 4 uses adaptive reasoning (Google AI thinking doc) and may skip thinking for trivial questions. enable_thinking: False is deterministic and always suppresses thinking.

thinking-enable.py
1# ENABLE thinking (explicit)
2response = client.chat.completions.create(
3 model="gemma-4-26b-a4b",
4 messages=[{"role": "user", "content": "Prove sqrt(2) is irrational."}],
5 max_tokens=4096,
6 extra_body={"chat_template_kwargs": {"enable_thinking": True}},
7)
8
9msg = response.choices[0].message
10raw = msg.model_dump()
11
12thinking = raw.get("reasoning_content") or "" # chain-of-thought
13answer = msg.content or "" # final response
14
15print(f"Thinking ({len(thinking)} chars): {thinking[:200]}...")
16print(f"Answer: {answer}")
thinking-disable.py
1# DISABLE thinking (explicit)
2response = client.chat.completions.create(
3 model="gemma-4-26b-a4b",
4 messages=[{"role": "user", "content": "What is 2+2?"}],
5 max_tokens=64,
6 extra_body={"chat_template_kwargs": {"enable_thinking": False}},
7)
8
9# reasoning_content will be empty
10print(response.choices[0].message.content)
thinking-budget.py
1# Thinking with token budget — limit thinking to 256 tokens
2response = client.chat.completions.create(
3 model="gemma-4-26b-a4b",
4 messages=[{"role": "user", "content": "Solve 23 * 47"}],
5 max_tokens=1024,
6 extra_body={
7 "chat_template_kwargs": {"enable_thinking": True},
8 "thinking_budget_tokens": 256,
9 },
10)
11
12# Anthropic-compatible format also works:
13response = client.chat.completions.create(
14 model="gemma-4-26b-a4b",
15 messages=[{"role": "user", "content": "Solve 23 * 47"}],
16 max_tokens=1024,
17 extra_body={
18 "thinking": {"type": "enabled", "budget_tokens": 256},
19 },
20)
thinking-streaming.py
1# Streaming with reasoning_content
2stream = client.chat.completions.create(
3 model="gemma-4-26b-a4b",
4 messages=[{"role": "user", "content": "Explain quantum computing."}],
5 stream=True,
6 max_tokens=2048,
7 extra_body={"chat_template_kwargs": {"enable_thinking": True}},
8)
9
10for chunk in stream:
11 if not chunk.choices:
12 continue
13 delta = chunk.choices[0].delta
14 raw = delta.model_dump()
15
16 reasoning = raw.get("reasoning_content") or ""
17 content = delta.content or ""
18
19 if reasoning:
20 # Thinking tokens stream here first
21 print(f"[think] {reasoning}", end="", flush=True)
22 if content:
23 # Response tokens stream after thinking completes
24 print(content, end="", flush=True)

Tips for thinking

  • ● For math, logic, coding, and multi-step problems: enable thinking and set max_tokens high (2048+)
  • ● For simple factual questions, translations, formatting: disable thinking for faster response
  • ● If you get empty content with non-empty reasoning_content, the model spent all its budget on thinking—increase max_tokens
  • ● Use thinking_budget_tokens to cap reasoning time for latency-sensitive use cases
  • ● System prompts like "Think step by step" can encourage deeper thinking when enabled
03

Tool calling

Native Gemma 4 tool parser with OpenAI-format tool_calls.

How it works: With --jinja, llama-server enables Gemma 4's native tool calling parser (upstream Gemma 4 parser PR). The model uses special tokens <|tool_call> which are parsed and returned as standard OpenAI-format tool_calls.

Interleaved thinking: With the interleaved template, the model preserves its chain of thought between tool calls. Previous thinking is stripped from conversation history automatically to prevent context bloat.

tools-define.py
1# Define tools in standard OpenAI format
2tools = [
3 {
4 "type": "function",
5 "function": {
6 "name": "get_weather",
7 "description": "Get the current weather for a location.",
8 "parameters": {
9 "type": "object",
10 "properties": {
11 "location": {"type": "string", "description": "City name"},
12 "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
13 },
14 "required": ["location"],
15 },
16 },
17 },
18]
tools-loop.py
1import json
2
3messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]
4
5# Step 1: Send query with tools
6response = client.chat.completions.create(
7 model="gemma-4-26b-a4b",
8 messages=messages,
9 tools=tools,
10 max_tokens=512,
11)
12
13msg = response.choices[0].message
14
15if msg.tool_calls:
16 # Step 2: Execute tool(s) locally
17 messages.append(msg)
18 for tc in msg.tool_calls:
19 print(f"Tool call: {tc.function.name}({tc.function.arguments})")
20
21 # Your tool implementation here
22 result = {"location": "Tokyo", "temp": 22, "condition": "Sunny"}
23
24 messages.append({
25 "role": "tool",
26 "tool_call_id": tc.id,
27 "content": json.dumps(result),
28 })
29
30 # Step 3: Get final answer
31 response = client.chat.completions.create(
32 model="gemma-4-26b-a4b",
33 messages=messages,
34 tools=tools,
35 max_tokens=512,
36 )
37 print(response.choices[0].message.content)
38else:
39 # Model answered directly without tools
40 print(msg.content)

Multi-step tool calls: The model can make multiple tool calls in sequence. Wrap the loop above in while msg.tool_calls: to handle chains automatically.

04

tool_choice limitations

What works, what doesn't, and workarounds for reliable tool calling.

Compare behavior against the official tool_choice contract; this server has known gaps called out in the table.

ValueStatusBehaviorRoot cause
"auto" ReliableModel decides whether to call toolsDefault server/model path
"none" PartialNo parsed tool_calls, but may leak raw tokens into contentParser suppresses extraction but not generation
"required" FailsEmpty tool_calls and empty contentGrammar constraint activates but output can't be parsed
{"type":"function",...} IgnoredFalls back to "auto" silentlyServer parses tool_choice as string only

Workaround: Force a specific tool

Only pass that one tool in the tools array. The model reliably calls the only available tool when the query needs it.

Workaround: Force any tool call

Use a system message: "You MUST use one of the provided tools. Do NOT answer directly." with tool_choice="auto".

Test results (19 tests, test_tools.py)

CategoryTestsResult
Single tool calls (weather, calculator)2PASS
Parallel tool calls1PASS
Multi-step chains1PASS
No tool needed (answer directly)1PASS
tool_choice="none"1PASS (partial)
tool_choice="required"1FAIL (known)
Specific function object1FAIL (known)
Large tool set (select from 8)2PASS
Argument types (arrays, objects, numbers)3PASS
Thinking + tools combined1PASS
Streaming tool calls1PASS
Error recovery / tool_call_id roundtrip2PASS
05

Cursor IDE integration

Full Agent mode support with tool calling via the Responses API.

This endpoint works as a custom model in Cursor IDE, including full Agent mode with tool calling (file read, file write, code search, terminal commands, etc.).

Setup steps

  1. Open Cursor Settings (Cmd+, / Ctrl+,)
  2. Go to Models section
  3. Click + Add Model, enter model name: gemma-4-26b-a4b
  4. Set Override OpenAI Base URL to your Modal deployment URL (e.g., https://scalewaveai--gemma-4-26b-a4b-gguf-server-serve.modal.run/v1)
  5. Set the OpenAI API Key to your API_KEY
  6. Enable the model toggle
ModeWorksNotes
Agent (default)YesFull tool calling—file read/write, code search, terminal. Uses /v1/responses
AskYesRead-only Q&A mode. Uses /v1/chat/completions
IssueFix
"Errored, no charge"Ensure Base URL ends with /v1. Enable HTTP/1.1 in Cursor Settings → Network → HTTP Compatibility Mode
Tools not workingVerify Agent mode (not Ask). Check llama-server built from master
Slow responsesFirst request triggers cold start (~5-15s). Subsequent requests are fast
Model not appearingToggle the model ON in the Models list after adding it
06

Responses API

The OpenAI Responses API format used by Cursor and Codex CLI.

Cursor's Agent mode and Codex CLI send requests using the OpenAI Responses API shape (POST /v1/responses with input instead of messages, flat tool definitions).

Our llama-server build (from master) includes /v1/responses (llama.cpp PR #18486), which converts Responses API requests to Chat Completions internally and translates responses back.

EndpointMethodDescription
/v1/chat/completionsPOSTStandard chat completions (streaming and non-streaming)
/v1/responsesPOSTOpenAI Responses API (Cursor Agent mode, Codex CLI)
/v1/modelsGETList available models
/healthGETServer health check ({"status":"ok"})
/metricsGETPrometheus metrics

Build requirement: The /v1/responses endpoint landed in PR #18486 (merged Jan 21, 2026). Older tagged builds may not include it or may have incomplete tool-calling round-trips. Build from master for full compatibility.

07

Framework examples

Integration examples for popular frameworks and languages.

langchain.py
1from langchain_openai import ChatOpenAI
2
3llm = ChatOpenAI(
4 base_url="https://<app>.modal.run/v1",
5 api_key="YOUR_KEY",
6 model="gemma-4-26b-a4b",
7 temperature=1.0,
8 max_tokens=512,
9)
10
11result = llm.invoke("Explain quantum computing.")
12print(result.content)
litellm.py
1import litellm
2
3response = litellm.completion(
4 model="openai/gemma-4-26b-a4b",
5 api_base="https://<app>.modal.run/v1",
6 api_key="YOUR_KEY",
7 messages=[{"role": "user", "content": "Hello"}],
8)
9print(response.choices[0].message.content)
typescript.ts
1import OpenAI from "openai";
2
3const client = new OpenAI({
4 baseURL: "https://<app>.modal.run/v1",
5 apiKey: "YOUR_KEY",
6});
7
8const response = await client.chat.completions.create({
9 model: "gemma-4-26b-a4b",
10 messages: [{ role: "user", content: "Hello!" }],
11 max_tokens: 256,
12});
13
14console.log(response.choices[0].message.content);
curl.sh
1curl -X POST https://<app>.modal.run/v1/chat/completions \
2 -H "Authorization: Bearer YOUR_KEY" \
3 -H "Content-Type: application/json" \
4 -d '{
5 "model": "gemma-4-26b-a4b",
6 "messages": [{"role": "user", "content": "Hello!"}],
7 "stream": true,
8 "temperature": 1.0,
9 "top_p": 0.95,
10 "max_tokens": 256
11 }'
08

Limits and constraints

Hard limits and known constraints for this deployment.

ConstraintValueNotes
Max context256K tokensModel limit; server configured for 16K per slot
Parallel slots44 concurrent requests per container
Max containers3Up to 12 concurrent requests total
Image formatsJPEG, PNG, GIF, WebPVia image_url content blocks
AudioNot supportedAudio is E2B/E4B variants only
VideoNot supportedNot available in this deployment
Thinking budgetPer-requestthinking_budget_tokens in extra_body

For operational caveats, monitoring, and upgrade procedures, see Operate & compare.

09

External references

Sources cited inline on this tab.