[GH-ISSUE #15260] think=false breaks format (structured output) for gemma4 — format constraint silently ignored #71819

Closed
opened 2026-05-05 02:37:31 -05:00 by GiteaMirror · 15 comments
Owner

Originally created by @AIVTDevPKevin on GitHub (Apr 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15260

Originally assigned to: @ParthSareen on GitHub.

What is the issue?

Description

When using gemma4:26b-a4b-it-q4_K_M with the format parameter (JSON schema structured output), setting think=false causes the format constraint to be completely ignored. The model outputs plain text instead of the requested JSON structure.

If think is omitted (not sent at all), the format works correctly — but the model then defaults to thinking mode, adding unwanted latency.

This is the same class of bug as #14645 (qwen3.5 series), but confirmed to also affect gemma4. gemma4 uses <|think|> tokens in its chat template for thinking control, similar to how qwen3.5 models handle thinking.

Environment

  • Ollama version: 0.20.0
  • Model: gemma4:26b-a4b-it-q4_K_M (SHA: 7121486771cb)
  • OS: Windows 11 (10.0.26200)
  • GPU: NVIDIA GeForce RTX 4090 (Driver 582.32)
  • CPU: Intel Core i9-14900
  • Tested via: Direct HTTP API calls (curl / requests.post) — not SDK-specific

Minimal Reproduction (via HTTP)

# ❌ FAIL: think=false + format → format is silently IGNORED, outputs plain text
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b-a4b-it-q4_K_M",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false,
  "think": false,
  "format": {
    "type": "object",
    "properties": {
      "emotion": {"type": "string", "enum": ["happy","sad","neutral"]},
      "response_text": {"type": "string"}
    },
    "required": ["emotion", "response_text"]
  }
}' | python -m json.tool
# → message.content = plain text (NOT JSON), format completely ignored

# ✅ OK: think omitted + format → format works, but model defaults to thinking (extra latency)
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b-a4b-it-q4_K_M",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false,
  "format": {
    "type": "object",
    "properties": {
      "emotion": {"type": "string", "enum": ["happy","sad","neutral"]},
      "response_text": {"type": "string"}
    },
    "required": ["emotion", "response_text"]
  }
}' | python -m json.tool
# → message.content = valid JSON: {"emotion": "happy", "response_text": "..."}

Test Results (4 scenarios, all via HTTP)

# Mode think format Result
1 non-stream false JSON schema Plain text — format ignored
2 non-stream (omitted) JSON schema Valid JSON (emotion=happy)
3 stream false JSON schema Plain text — format ignored
4 stream (omitted) JSON schema Valid JSON (emotion=happy)

Expected Behavior

think=false + format should produce valid JSON matching the schema (same as when think is omitted, but without the thinking overhead).

Actual Behavior

When think=false is sent, the format constraint is silently dropped. The model generates unconstrained plain text as if format was never specified.

Root Cause Analysis

Same as described in #14645: Ollama appears to defer format probability masking until it sees the end-of-thinking token. When think=false is set, the thinking tags are closed in the template and the model never outputs the end-of-thinking token, so the masking is never applied.

Notes

  • Not SDK-specific: Tested with both ollama Python SDK and raw HTTP POST to /api/chat — identical behavior.
  • Not model-specific to qwen3.5: This affects gemma4 as well. Other models without thinking templates (e.g., gpt-oss:20b) work correctly with think=false + format.
  • Related: #14645 (qwen3.5), #14850 (qwen3.5:27b, closed as dup), #10929 (invalid JSON with think=true), #10538 (feature request for thinking + structured output)

Relevant log output

Server log shows no errors — all 4 requests returned HTTP 200:

[GIN] 2026/04/03 - 14:27:27 | 200 | 787.3221ms | 127.0.0.1 | POST "/api/chat"  ← Test 1 (think=false, ~0.8s, format IGNORED)
[GIN] 2026/04/03 - 14:27:33 | 200 | 3.8770242s | 127.0.0.1 | POST "/api/chat"  ← Test 2 (think omitted -> default: true, ~3.9s, format OK)
[GIN] 2026/04/03 - 14:27:36 | 200 | 645.5903ms | 127.0.0.1 | POST "/api/chat"  ← Test 3 (think=false, ~0.6s, format IGNORED)
[GIN] 2026/04/03 - 14:27:42 | 200 | 4.0513075s | 127.0.0.1 | POST "/api/chat"  ← Test 4 (think omitted -> default: true, ~4.1s, format OK)

Note: think=false requests complete much faster (~0.7s vs ~4s) because the model skips
thinking — but format constraint is silently ignored, producing plain text instead of JSON.
No warnings or errors are logged server-side when format is ignored.

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.20.0

Originally created by @AIVTDevPKevin on GitHub (Apr 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15260 Originally assigned to: @ParthSareen on GitHub. ### What is the issue? ### Description When using `gemma4:26b-a4b-it-q4_K_M` with the `format` parameter (JSON schema structured output), setting `think=false` causes the `format` constraint to be **completely ignored**. The model outputs plain text instead of the requested JSON structure. If `think` is **omitted** (not sent at all), the `format` works correctly — but the model then defaults to thinking mode, adding unwanted latency. This is the same class of bug as #14645 (qwen3.5 series), but confirmed to also affect **gemma4**. gemma4 uses `<|think|>` tokens in its chat template for thinking control, similar to how qwen3.5 models handle thinking. ### Environment - **Ollama version**: 0.20.0 - **Model**: `gemma4:26b-a4b-it-q4_K_M` (SHA: `7121486771cb`) - **OS**: Windows 11 (10.0.26200) - **GPU**: NVIDIA GeForce RTX 4090 (Driver 582.32) - **CPU**: Intel Core i9-14900 - **Tested via**: Direct HTTP API calls (`curl` / `requests.post`) — not SDK-specific ### Minimal Reproduction (via HTTP) ```bash # ❌ FAIL: think=false + format → format is silently IGNORED, outputs plain text curl -s http://localhost:11434/api/chat -d '{ "model": "gemma4:26b-a4b-it-q4_K_M", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"} ], "stream": false, "think": false, "format": { "type": "object", "properties": { "emotion": {"type": "string", "enum": ["happy","sad","neutral"]}, "response_text": {"type": "string"} }, "required": ["emotion", "response_text"] } }' | python -m json.tool # → message.content = plain text (NOT JSON), format completely ignored # ✅ OK: think omitted + format → format works, but model defaults to thinking (extra latency) curl -s http://localhost:11434/api/chat -d '{ "model": "gemma4:26b-a4b-it-q4_K_M", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"} ], "stream": false, "format": { "type": "object", "properties": { "emotion": {"type": "string", "enum": ["happy","sad","neutral"]}, "response_text": {"type": "string"} }, "required": ["emotion", "response_text"] } }' | python -m json.tool # → message.content = valid JSON: {"emotion": "happy", "response_text": "..."} ``` ### Test Results (4 scenarios, all via HTTP) | # | Mode | `think` | `format` | Result | |---|------|---------|----------|--------| | 1 | non-stream | `false` | ✅ JSON schema | ❌ **Plain text** — format ignored | | 2 | non-stream | *(omitted)* | ✅ JSON schema | ✅ Valid JSON (`emotion=happy`) | | 3 | stream | `false` | ✅ JSON schema | ❌ **Plain text** — format ignored | | 4 | stream | *(omitted)* | ✅ JSON schema | ✅ Valid JSON (`emotion=happy`) | ### Expected Behavior `think=false` + `format` should produce valid JSON matching the schema (same as when `think` is omitted, but without the thinking overhead). ### Actual Behavior When `think=false` is sent, the `format` constraint is silently dropped. The model generates unconstrained plain text as if `format` was never specified. ### Root Cause Analysis Same as described in #14645: Ollama appears to defer format probability masking until it sees the end-of-thinking token. When `think=false` is set, the thinking tags are closed in the template and the model never outputs the end-of-thinking token, so the masking is **never applied**. ### Notes - **Not SDK-specific**: Tested with both `ollama` Python SDK and raw HTTP `POST` to `/api/chat` — identical behavior. - **Not model-specific to qwen3.5**: This affects `gemma4` as well. Other models without thinking templates (e.g., `gpt-oss:20b`) work correctly with `think=false` + `format`. - Related: #14645 (qwen3.5), #14850 (qwen3.5:27b, closed as dup), #10929 (invalid JSON with think=true), #10538 (feature request for thinking + structured output) ### Relevant log output ```shell Server log shows no errors — all 4 requests returned HTTP 200: [GIN] 2026/04/03 - 14:27:27 | 200 | 787.3221ms | 127.0.0.1 | POST "/api/chat" ← Test 1 (think=false, ~0.8s, format IGNORED) [GIN] 2026/04/03 - 14:27:33 | 200 | 3.8770242s | 127.0.0.1 | POST "/api/chat" ← Test 2 (think omitted -> default: true, ~3.9s, format OK) [GIN] 2026/04/03 - 14:27:36 | 200 | 645.5903ms | 127.0.0.1 | POST "/api/chat" ← Test 3 (think=false, ~0.6s, format IGNORED) [GIN] 2026/04/03 - 14:27:42 | 200 | 4.0513075s | 127.0.0.1 | POST "/api/chat" ← Test 4 (think omitted -> default: true, ~4.1s, format OK) Note: think=false requests complete much faster (~0.7s vs ~4s) because the model skips thinking — but format constraint is silently ignored, producing plain text instead of JSON. No warnings or errors are logged server-side when format is ignored. ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.20.0
GiteaMirror added the bug label 2026-05-05 02:37:31 -05:00
Author
Owner

@johnnyxwan commented on GitHub (Apr 3, 2026):

qwen3.5 series has a similar problem #14645, not sure if they are related

<!-- gh-comment-id:4184013995 --> @johnnyxwan commented on GitHub (Apr 3, 2026): qwen3.5 series has a similar problem #14645, not sure if they are related
Author
Owner

@toutjavascript commented on GitHub (Apr 4, 2026):

Thank you for describing this bug. I thought I was going crazy.

<!-- gh-comment-id:4187332356 --> @toutjavascript commented on GitHub (Apr 4, 2026): Thank you for describing this bug. I thought I was going crazy.
Author
Owner

@VladimirGav commented on GitHub (Apr 4, 2026):

I have the same problem. JSON breaks when using VladimirGav/gemma4-26b-16GB-VRAM

<!-- gh-comment-id:4187417688 --> @VladimirGav commented on GitHub (Apr 4, 2026): I have the same problem. JSON breaks when using VladimirGav/gemma4-26b-16GB-VRAM
Author
Owner

@thiswillbeyourgithub commented on GitHub (Apr 6, 2026):

I think it's really bad that this kind of error is not caught by tests. Structured output is both a major selling point of LLMs and often broken.

<!-- gh-comment-id:4193514538 --> @thiswillbeyourgithub commented on GitHub (Apr 6, 2026): I think it's really bad that this kind of error is not caught by tests. Structured output is both a major selling point of LLMs and often broken.
Author
Owner

@MMaturax commented on GitHub (Apr 7, 2026):

In both models 26b and 31b, structured output doesn't work correctly. This isn't related to Ollama; the same problem exists in Google AI Studio. It's impossible to use the Gemma 4 family for structured output and agentic tool calls in this way.

<!-- gh-comment-id:4195697914 --> @MMaturax commented on GitHub (Apr 7, 2026): In both models 26b and 31b, structured output doesn't work correctly. This isn't related to Ollama; the same problem exists in Google AI Studio. It's impossible to use the Gemma 4 family for structured output and agentic tool calls in this way.
Author
Owner

@thiswillbeyourgithub commented on GitHub (Apr 7, 2026):

But the model i'm using is gemma4:e4b-it-q8_0

<!-- gh-comment-id:4196887089 --> @thiswillbeyourgithub commented on GitHub (Apr 7, 2026): But the model i'm using is gemma4:e4b-it-q8_0
Author
Owner

@AIVTDevPKevin commented on GitHub (Apr 8, 2026):

In both models 26b and 31b, structured output doesn't work correctly. This isn't related to Ollama; the same problem exists in Google AI Studio. It's impossible to use the Gemma 4 family for structured output and agentic tool calls in this way.

Actually, I deployed gemma-4-26B-A4B-it using vLLM to test this out. When interacting with it via the OpenAI-compatible API—strictly enforcing structured outputs while disabling the thinking phase—it successfully returned the correct structured response without any issues.

Given this behavior, it strongly suggests that the root cause still lies within Ollama's implementation rather than an upstream issue with the model itself.

<!-- gh-comment-id:4205734863 --> @AIVTDevPKevin commented on GitHub (Apr 8, 2026): > In both models 26b and 31b, structured output doesn't work correctly. This isn't related to Ollama; the same problem exists in Google AI Studio. It's impossible to use the Gemma 4 family for structured output and agentic tool calls in this way. Actually, I deployed [`gemma-4-26B-A4B-it`](https://huggingface.co/google/gemma-4-26B-A4B-it) using vLLM to test this out. When interacting with it via the OpenAI-compatible API—strictly enforcing structured outputs while disabling the thinking phase—it successfully returned the correct structured response without any issues. Given this behavior, it strongly suggests that the root cause still lies within Ollama's implementation rather than an upstream issue with the model itself.
Author
Owner

@johnnyxwan commented on GitHub (Apr 9, 2026):

It has been over a month since the problem is stated in #14645, with at least 2 pull request candidates ( #14660, #14923 ) which can potentially fix this issue. I would appreciate some insight into why this system-breaking, clearly stated, and easily fixable bug has remained unresolved for so long.

<!-- gh-comment-id:4210782720 --> @johnnyxwan commented on GitHub (Apr 9, 2026): It has been over a month since the problem is stated in #14645, with at least 2 pull request candidates ( #14660, #14923 ) which can potentially fix this issue. I would appreciate some insight into why this system-breaking, clearly stated, and easily fixable bug has remained unresolved for so long.
Author
Owner

@MingStar commented on GitHub (Apr 12, 2026):

It has been over a month since the problem is stated in #14645, with at least 2 pull request candidates ( #14660, #14923 ) which can potentially fix this issue. I would appreciate some insight into why this system-breaking, clearly stated, and easily fixable bug has remained unresolved for so long.

Was pulling my hair for a day with Ollama v0.20.5.

Could we fix this issue sooner for at least 2 popular model series (qwen and gemma)?

<!-- gh-comment-id:4231171216 --> @MingStar commented on GitHub (Apr 12, 2026): > It has been over a month since the problem is stated in [#14645](https://github.com/ollama/ollama/issues/14645), with at least 2 pull request candidates ( [#14660](https://github.com/ollama/ollama/pull/14660), [#14923](https://github.com/ollama/ollama/pull/14923) ) which can potentially fix this issue. I would appreciate some insight into why this system-breaking, clearly stated, and easily fixable bug has remained unresolved for so long. Was pulling my hair for a day with Ollama v0.20.5. Could we fix this issue sooner for at least 2 popular model series (qwen and gemma)?
Author
Owner

@johnnyxwan commented on GitHub (Apr 18, 2026):

The new qwen3.6 is also affected, the problem persists in 0.21.0 as expected.

<!-- gh-comment-id:4273112482 --> @johnnyxwan commented on GitHub (Apr 18, 2026): The new qwen3.6 is also affected, the problem persists in 0.21.0 as expected.
Author
Owner

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15260
Analyzed: 2026-04-18T18:22:49.467732

Analysis

  • Type: unknown
  • Severity: medium
  • Components: unknown

Implementation Plan

  • Effort: medium
  • Steps:

This issue has been triaged and marked for implementation.

<!-- gh-comment-id:4274310714 --> @PureBlissAK commented on GitHub (Apr 18, 2026): <!-- ollama-issue-orchestrator:v1 issue:15260 --> ## 🤖 Automated Triage & Analysis Report **Issue**: #15260 **Analyzed**: 2026-04-18T18:22:49.467732 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*
Author
Owner

@johnnyxwan commented on GitHub (Apr 19, 2026):

@ParthSareen Thank you

<!-- gh-comment-id:4276231795 --> @johnnyxwan commented on GitHub (Apr 19, 2026): @ParthSareen Thank you
Author
Owner

@johnnyxwan commented on GitHub (Apr 23, 2026):

The fix is shipped in v0.21.1 and it worked, however, it is scoped only for gemma4. @ParthSareen is there a hope that it can include qwen3.5 and qwen3.6? The very same problem is described in #14645, I am quite sure it can be fixed by just including qwen3.5 and qwen3.6 to the scope. Thank you.

<!-- gh-comment-id:4301800240 --> @johnnyxwan commented on GitHub (Apr 23, 2026): The fix is shipped in v0.21.1 and it worked, however, it is scoped only for gemma4. @ParthSareen is there a hope that it can include qwen3.5 and qwen3.6? The very same problem is described in #14645, I am quite sure it can be fixed by just including qwen3.5 and qwen3.6 to the scope. Thank you.
Author
Owner

@Shedletsky commented on GitHub (Apr 23, 2026):

Just came to this thread from Google after wondering if structured output was working with thinking yet for Qwen3:*b in Ollama.

I hacked around this limitation for a project 4 months ago, was hoping to be able to fix it.

It's not clear to me if this is an Ollama issue or a model limitation?

<!-- gh-comment-id:4305865626 --> @Shedletsky commented on GitHub (Apr 23, 2026): Just came to this thread from Google after wondering if structured output was working with thinking yet for Qwen3:*b in Ollama. I hacked around this limitation for a project 4 months ago, was hoping to be able to fix it. It's not clear to me if this is an Ollama issue or a model limitation?
Author
Owner

@johnnyxwan commented on GitHub (Apr 24, 2026):

Just came to this thread from Google after wondering if structured output was working with thinking yet for Qwen3:*b in Ollama.

I hacked around this limitation for a project 4 months ago, was hoping to be able to fix it.

It's not clear to me if this is an Ollama issue or a model limitation?

gemma4 is fixed with the merge. For qwen3, models without seperate instruct/thinking such as 8b, latest are also affected. For affected qwen3, or qwen3.5/3.6, please follow #14645.

<!-- gh-comment-id:4311033850 --> @johnnyxwan commented on GitHub (Apr 24, 2026): > Just came to this thread from Google after wondering if structured output was working with thinking yet for Qwen3:*b in Ollama. > > I hacked around this limitation for a project 4 months ago, was hoping to be able to fix it. > > It's not clear to me if this is an Ollama issue or a model limitation? gemma4 is fixed with the merge. For qwen3, models without seperate instruct/thinking such as `8b`, `latest` are also affected. For affected qwen3, or qwen3.5/3.6, please follow #14645.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71819