> **Note:** The canonical experience is the interactive HTML tab: [APIs & clients](https://www.quantml.org/guides/gemma-4-gguf/features). This file is a text mirror for search engines and AI tools.

# Gemma 4 GGUF — APIs & clients

Complete guide to [Gemma 4](https://deepmind.google/blog/gemma-4-byte-for-byte-the-most-capable-open-models/)'s production feature surface: multimodal vision, adaptive thinking with budget control, native tool calling, Cursor IDE integration, and framework examples for every client. Client wire formats follow [OpenAI Chat Completions](https://platform.openai.com/docs/api-reference/chat/create) where noted.

## 1. Vision / multimodal {#vision-stack}

_Image input via the multimodal projector (`mmproj-F32.gguf`)._

**How it works:** Vision is handled by a separate `mmproj-F32.gguf` (~2 GB) loaded via `--mmproj` from the [published GGUF bundle](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF). [llama-server](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md)'s internal `libmtmd` library handles image encoding.

**Input formats:** Standard OpenAI `image_url` content blocks work directly ([OpenAI vision guide](https://platform.openai.com/docs/guides/images-vision)). Both remote URLs and base64 data URLs are supported.

**What works**
- Image captioning and description
- OCR / text extraction from images
- Object detection and spatial understanding
- Chart, diagram, and document analysis
- Multiple images in a single message
- Vision + thinking combined analysis

**What doesn't work**
- Audio input (E2B/E4B variants only)
- Video input
- Very large images may exceed context limits

```python
from openai import OpenAI

client = OpenAI(base_url="https://<app>.modal.run/v1", api_key="YOUR_KEY")

# Single image via URL
response = client.chat.completions.create(
    model="gemma-4-26b-a4b",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
            {"type": "text", "text": "What's in this image?"},
        ],
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)
```

```python
import base64

# Base64 encoding (recommended for reliability)
with open("photo.jpg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gemma-4-26b-a4b",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }],
    max_tokens=1024,
)
```

```python
# Multiple images in one message
messages = [{
    "role": "user",
    "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64_1}"}},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64_2}"}},
        {"type": "text", "text": "Compare these two images."},
    ],
}]
```

```python
# Vision + thinking combined
response = client.chat.completions.create(
    model="gemma-4-26b-a4b",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
            {"type": "text", "text": "What breed is this dog? Analyze carefully."},
        ],
    }],
    max_tokens=2048,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

raw = response.choices[0].message.model_dump()
print(f"Thinking: {raw.get('reasoning_content', '')[:200]}")
print(f"Answer: {response.choices[0].message.content}")
```

> **Tip:** Base64 encoding client-side is more reliable than passing remote URLs (avoids server-side fetch issues). Disable thinking for simple image descriptions to get faster, more direct responses.

## 2. Adaptive thinking {#thinking}

_Chain-of-thought reasoning with per-request control and budget limits._

**How it works:** Gemma 4 uses special tokens `<|channel>thought\n` and `<channel|>` to delimit thinking blocks ([prompt formatting reference](https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4)). [llama-server](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md) parses these and exposes them via a separate `reasoning_content` field—never mixed into `content`.

**Adaptive behavior:** Setting `enable_thinking: True` does not _guarantee_ thinking output. Gemma 4 uses adaptive reasoning ([Google AI thinking doc](https://ai.google.dev/gemma/docs/capabilities/thinking)) and may skip thinking for trivial questions. `enable_thinking: False` is deterministic and always suppresses thinking.

```python
# ENABLE thinking (explicit)
response = client.chat.completions.create(
    model="gemma-4-26b-a4b",
    messages=[{"role": "user", "content": "Prove sqrt(2) is irrational."}],
    max_tokens=4096,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

msg = response.choices[0].message
raw = msg.model_dump()

thinking = raw.get("reasoning_content") or ""  # chain-of-thought
answer = msg.content or ""                      # final response

print(f"Thinking ({len(thinking)} chars): {thinking[:200]}...")
print(f"Answer: {answer}")
```

```python
# DISABLE thinking (explicit)
response = client.chat.completions.create(
    model="gemma-4-26b-a4b",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    max_tokens=64,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

# reasoning_content will be empty
print(response.choices[0].message.content)
```

```python
# Thinking with token budget — limit thinking to 256 tokens
response = client.chat.completions.create(
    model="gemma-4-26b-a4b",
    messages=[{"role": "user", "content": "Solve 23 * 47"}],
    max_tokens=1024,
    extra_body={
        "chat_template_kwargs": {"enable_thinking": True},
        "thinking_budget_tokens": 256,
    },
)

# Anthropic-compatible format also works:
response = client.chat.completions.create(
    model="gemma-4-26b-a4b",
    messages=[{"role": "user", "content": "Solve 23 * 47"}],
    max_tokens=1024,
    extra_body={
        "thinking": {"type": "enabled", "budget_tokens": 256},
    },
)
```

```python
# Streaming with reasoning_content
stream = client.chat.completions.create(
    model="gemma-4-26b-a4b",
    messages=[{"role": "user", "content": "Explain quantum computing."}],
    stream=True,
    max_tokens=2048,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

for chunk in stream:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta
    raw = delta.model_dump()

    reasoning = raw.get("reasoning_content") or ""
    content = delta.content or ""

    if reasoning:
        # Thinking tokens stream here first
        print(f"[think] {reasoning}", end="", flush=True)
    if content:
        # Response tokens stream after thinking completes
        print(content, end="", flush=True)
```

**Tips for thinking**
- For math, logic, coding, and multi-step problems: **enable thinking** and set `max_tokens` high (2048+)
- For simple factual questions, translations, formatting: **disable thinking** for faster response
- If you get empty `content` with non-empty `reasoning_content`, the model spent all its budget on thinking—increase `max_tokens`
- Use `thinking_budget_tokens` to cap reasoning time for latency-sensitive use cases
- System prompts like "Think step by step" can encourage deeper thinking when enabled

## 3. Tool calling {#tools}

_Native Gemma 4 tool parser with OpenAI-format `tool_calls`._

**How it works:** With `--jinja`, [llama-server](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md) enables Gemma 4's native tool calling parser ([upstream Gemma 4 parser PR](https://github.com/ggml-org/llama.cpp/pull/21418)). The model uses special tokens `<|tool_call|>` which are parsed and returned as standard [OpenAI-format](https://platform.openai.com/docs/guides/function-calling) `tool_calls`.

**Interleaved thinking:** With the interleaved template, the model preserves its chain of thought between tool calls. Previous thinking is stripped from conversation history automatically to prevent context bloat.

```python
# Define tools in standard OpenAI format
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    },
]
```

```python
import json

messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]

# Step 1: Send query with tools
response = client.chat.completions.create(
    model="gemma-4-26b-a4b",
    messages=messages,
    tools=tools,
    max_tokens=512,
)

msg = response.choices[0].message

if msg.tool_calls:
    # Step 2: Execute tool(s) locally
    messages.append(msg)
    for tc in msg.tool_calls:
        print(f"Tool call: {tc.function.name}({tc.function.arguments})")

        # Your tool implementation here
        result = {"location": "Tokyo", "temp": 22, "condition": "Sunny"}

        messages.append({
            "role": "tool",
            "tool_call_id": tc.id,
            "content": json.dumps(result),
        })

    # Step 3: Get final answer
    response = client.chat.completions.create(
        model="gemma-4-26b-a4b",
        messages=messages,
        tools=tools,
        max_tokens=512,
    )
    print(response.choices[0].message.content)
else:
    # Model answered directly without tools
    print(msg.content)
```

> **Multi-step tool calls:** The model can make multiple tool calls in sequence. Wrap the loop above in `while msg.tool_calls:` to handle chains automatically.

## 4. tool_choice limitations {#tool-choice}

_What works, what doesn't, and workarounds for reliable tool calling._

Compare behavior against the official [tool_choice](https://platform.openai.com/docs/guides/function-calling) contract; this server has known gaps called out in the table.

| Value | Status | Behavior | Root cause |
|---|---|---|---|
| `"auto"` | Reliable | Model decides whether to call tools | Default server/model path |
| `"none"` | Partial | No parsed `tool_calls`, but may leak raw tokens into content | Parser suppresses extraction but not generation |
| `"required"` | Fails | Empty `tool_calls` and empty content | Grammar constraint activates but output can't be parsed |
| `{"type":"function",...}` | Ignored | Falls back to `"auto"` silently | Server parses `tool_choice` as string only |

**Workaround: Force a specific tool**  
Only pass that one tool in the `tools` array. The model reliably calls the only available tool when the query needs it.

**Workaround: Force any tool call**  
Use a system message: `"You MUST use one of the provided tools. Do NOT answer directly."` with `tool_choice="auto"`.

**Test results (19 tests, `test_tools.py`)**

| Category | Tests | Result |
|---|---:|---|
| Single tool calls (weather, calculator) | 2 | PASS |
| Parallel tool calls | 1 | PASS |
| Multi-step chains | 1 | PASS |
| No tool needed (answer directly) | 1 | PASS |
| `tool_choice="none"` | 1 | PASS (partial) |
| `tool_choice="required"` | 1 | FAIL (known) |
| Specific function object | 1 | FAIL (known) |
| Large tool set (select from 8) | 2 | PASS |
| Argument types (arrays, objects, numbers) | 3 | PASS |
| Thinking + tools combined | 1 | PASS |
| Streaming tool calls | 1 | PASS |
| Error recovery / `tool_call_id` roundtrip | 2 | PASS |

## 5. Cursor IDE integration {#cursor-ide}

_Full Agent mode support with tool calling via the Responses API._

This endpoint works as a **custom model in Cursor IDE**, including full Agent mode with tool calling (file read, file write, code search, terminal commands, etc.).

**Setup steps**
1. Open **Cursor Settings** (`Cmd+,` / `Ctrl+,`)
2. Go to **Models** section
3. Click **+ Add Model**, enter model name: `gemma-4-26b-a4b`
4. Set **Override OpenAI Base URL** to your Modal deployment URL (e.g., `https://scalewaveai--gemma-4-26b-a4b-gguf-server-serve.modal.run/v1`)
5. Set the **OpenAI API Key** to your `API_KEY`
6. Enable the model toggle

| Mode | Works | Notes |
|---|---|---|
| Agent (default) | Yes | Full tool calling—file read/write, code search, terminal. Uses `/v1/responses` |
| Ask | Yes | Read-only Q&A mode. Uses `/v1/chat/completions` |

| Issue | Fix |
|---|---|
| "Errored, no charge" | Ensure Base URL ends with `/v1`. Enable HTTP/1.1 in Cursor Settings -> Network -> HTTP Compatibility Mode |
| Tools not working | Verify Agent mode (not Ask). Check llama-server built from `master` |
| Slow responses | First request triggers cold start (~5-15s). Subsequent requests are fast |
| Model not appearing | Toggle the model ON in the Models list after adding it |

## 6. Responses API {#responses-api}

_The OpenAI Responses API format used by Cursor and Codex CLI._

Cursor's Agent mode and Codex CLI send requests using the OpenAI [Responses API](https://platform.openai.com/docs/api-reference/responses/create) shape (`POST /v1/responses` with `input` instead of `messages`, flat tool definitions).

Our [llama-server](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md) build (from `master`) includes `/v1/responses` ([llama.cpp PR #18486](https://github.com/ggml-org/llama.cpp/pull/18486)), which converts Responses API requests to Chat Completions internally and translates responses back.

| Endpoint | Method | Description |
|---|---|---|
| `/v1/chat/completions` | POST | Standard chat completions (streaming and non-streaming) |
| `/v1/responses` | POST | OpenAI Responses API (Cursor Agent mode, Codex CLI) |
| `/v1/models` | GET | List available models |
| `/health` | GET | Server health check (`{"status":"ok"}`) |
| `/metrics` | GET | Prometheus metrics |

> **Build requirement:** The `/v1/responses` endpoint landed in [PR #18486](https://github.com/ggml-org/llama.cpp/pull/18486) (merged Jan 21, 2026). Older tagged builds may not include it or may have incomplete tool-calling round-trips. Build from `master` for full compatibility.

## 7. Framework examples {#frameworks}

_Integration examples for popular frameworks and languages._

```python
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="https://<app>.modal.run/v1",
    api_key="YOUR_KEY",
    model="gemma-4-26b-a4b",
    temperature=1.0,
    max_tokens=512,
)

result = llm.invoke("Explain quantum computing.")
print(result.content)
```

```python
import litellm

response = litellm.completion(
    model="openai/gemma-4-26b-a4b",
    api_base="https://<app>.modal.run/v1",
    api_key="YOUR_KEY",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)
```

```typescript
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://<app>.modal.run/v1",
  apiKey: "YOUR_KEY",
});

const response = await client.chat.completions.create({
  model: "gemma-4-26b-a4b",
  messages: [{ role: "user", content: "Hello!" }],
  max_tokens: 256,
});

console.log(response.choices[0].message.content);
```

```bash
curl -X POST https://<app>.modal.run/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-26b-a4b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true,
    "temperature": 1.0,
    "top_p": 0.95,
    "max_tokens": 256
  }'
```

## 8. Limits and constraints {#limits}

_Hard limits and known constraints for this deployment._

| Constraint | Value | Notes |
|---|---|---|
| Max context | 256K tokens | Model limit; server configured for 65K per slot |
| Parallel slots | 4 | 4 concurrent requests per container |
| Max containers | 3 | Up to 12 concurrent requests total |
| Image formats | JPEG, PNG, GIF, WebP | Via `image_url` content blocks |
| Audio | Not supported | Audio is E2B/E4B variants only |
| Video | Not supported | Not available in this deployment |
| Thinking budget | Per-request | `thinking_budget_tokens` in `extra_body` |

For operational caveats, monitoring, and upgrade procedures, see [Operate & compare](https://www.quantml.org/guides/gemma-4-gguf/operations).

## 9. External references {#references}

_Sources cited inline on this tab. Same URLs as the inline links above._

1. [OpenAI API: Create chat completion](https://platform.openai.com/docs/api-reference/chat/create)
2. [OpenAI API: Create response](https://platform.openai.com/docs/api-reference/responses/create)
3. [OpenAI docs: Images and vision](https://platform.openai.com/docs/guides/images-vision)
4. [OpenAI: Function calling](https://platform.openai.com/docs/guides/function-calling)
5. [DeepMind: Gemma 4 launch](https://deepmind.google/blog/gemma-4-byte-for-byte-the-most-capable-open-models/)
6. [Gemma prompt formatting](https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4)
7. [Gemma thinking docs](https://ai.google.dev/gemma/docs/capabilities/thinking)
8. [Hugging Face GGUF bundle](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF)
9. [llama.cpp Gemma parser PR #21418](https://github.com/ggml-org/llama.cpp/pull/21418)
10. [llama.cpp Responses API PR #18486](https://github.com/ggml-org/llama.cpp/pull/18486)
11. [llama.cpp server README](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md)

## Related sections

- [Stack overview](https://www.quantml.org/guides/gemma-4-gguf)
- [Modal deployment](https://www.quantml.org/guides/gemma-4-gguf/deployment)
- [Runtime tuning](https://www.quantml.org/guides/gemma-4-gguf/configuration)
- [Operate & compare](https://www.quantml.org/guides/gemma-4-gguf/operations)
