[GH-ISSUE #15539] [Bug] gemma4 parser fails to extract tool_calls when combining system prompt + think:false + tools #71988

Closed
opened 2026-05-05 03:15:19 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @vfreysz on GitHub (Apr 13, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15539

Originally assigned to: @drifkin on GitHub.

What is the issue?

The gemma4 parser in Ollama 0.20.6 fails to extract tool calls from the model response when a system prompt is combined with think: false and tools. The model correctly generates the tool call JSON, but the parser does not intercept it — the raw JSON leaks into the content field instead of being placed in the tool_calls field.

This breaks Home Assistant's Ollama integration, which always sends a system prompt (containing assistant instructions and exposed entity definitions) along with tool definitions.

Environment

  • Ollama version: 0.20.6
  • Model: gemma4:e4b (official, pulled via ollama pull gemma4:e4b)
  • OS: Ubuntu 24.04 (LXC container on Proxmox VE)
  • Hardware: AMD Ryzen 7 8745HS, Radeon 780M iGPU (ROCm), 32 GB RAM
  • Client: Home Assistant OS (Core 2026.4.2, Supervisor 2026.03.3, OS 17.2, Frontend 20260325.7) Ollama integration + direct curl testing

Reproduction steps

Run the following three curl commands against a fresh gemma4:e4b model. They demonstrate that the bug only occurs with a specific combination.

Test 1 — No system prompt + think: false WORKS

curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:e4b",
  "messages": [{"role": "user", "content": "What is the weather in Talence?"}],
  "tools": [{"type": "function", "function": {"name": "get_weather", "description": "Get weather info", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"]}}}],
  "stream": false,
  "think": false
}' | python3 -m json.tool

Result: content is empty, tool_calls is correctly populated:

{
  "message": {
    "role": "assistant",
    "content": "",
    "tool_calls": [
      {
        "id": "call_g76u5xbz",
        "function": {
          "index": 0,
          "name": "get_weather",
          "arguments": {"location": "Talence"}
        }
      }
    ]
  }
}

Test 2 — System prompt + thinking active (default) → ⚠️ tool_calls OK but thinking leaks

curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:e4b",
  "messages": [
    {"role": "system", "content": "Tu es Jarvis, assistant domotique. Réponds en français."},
    {"role": "user", "content": "Je veux la météo"}
  ],
  "tools": [{"type": "function", "function": {"name": "GetLiveContext", "description": "Get live context", "parameters": {"type": "object", "properties": {}, "required": []}}}],
  "stream": false
}' | python3 -m json.tool

Result: tool_calls is correctly populated, but thinking field contains a long reasoning chain (14 seconds). The tool calling itself works:

{
  "message": {
    "role": "assistant",
    "content": "",
    "thinking": "1. **Analyze the Request:** ... (long reasoning) ... 6. **Generate the tool call:** Call GetLiveContext.",
    "tool_calls": [
      {
        "id": "call_bgl0bmz2",
        "function": {
          "index": 0,
          "name": "GetLiveContext",
          "arguments": {}
        }
      }
    ]
  }
}

Test 3 — System prompt + think: false BUG — tool_calls not parsed

curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:e4b",
  "messages": [
    {"role": "system", "content": "Tu es Jarvis, assistant domotique. Réponds en français."},
    {"role": "user", "content": "Je veux la météo"}
  ],
  "tools": [{"type": "function", "function": {"name": "GetLiveContext", "description": "Get live context", "parameters": {"type": "object", "properties": {}, "required": []}}}],
  "stream": false,
  "think": false
}' | python3 -m json.tool

Result: The model generates the correct tool call JSON, but the parser does NOT intercept it. The raw JSON leaks into content with a trailing <channel|> token:

{
  "message": {
    "role": "assistant",
    "content": "{\n  \"tool_calls\": [\n    {\n      \"function\": \"GetLiveContext\",\n      \"args\": {}\n    }\n  ]\n}\n<channel|>"
  }
}

No tool_calls field is present. No thinking field.

Summary

Test System prompt think: false tool_calls parsed Duration
1 No Yes Yes ~2s
2 Yes No Yes (but thinking leaks) ~14s
3 Yes Yes No — JSON in content ~2s

Expected behavior

Test 3 should produce the same structured tool_calls output as Test 1, since the only difference is the addition of a system prompt. The think: false flag should disable thinking without breaking tool call parsing.

Impact

This bug makes gemma4:e4b unusable with any client that sends a system prompt alongside tools and think: false, including:

  • Home Assistant Ollama integration (always sends a system prompt with entity definitions)
  • Any OpenAI-compatible client using system prompts with tool definitions

The workaround of leaving thinking enabled (Test 2) works for tool calling but adds 10+ seconds of latency and causes thinking tokens to leak into streaming clients.

Possibly related issues

  • #15241 — gemma4 tool call parsing fails
  • #15315 — gemma4:e4b tool parsing errors persist in 0.20.1
  • #15254 — fix gemma4 arg parsing with quoted strings
  • #15306 — rework gemma4 tool call handling
</html> </html>
Originally created by @vfreysz on GitHub (Apr 13, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15539 Originally assigned to: @drifkin on GitHub. <h3>What is the issue?</h3> <p>The <code>gemma4</code> parser in Ollama 0.20.6 fails to extract tool calls from the model response when a <strong>system prompt</strong> is combined with <strong><code>think: false</code></strong> and <strong>tools</strong>. The model correctly generates the tool call JSON, but the parser does not intercept it — the raw JSON leaks into the <code>content</code> field instead of being placed in the <code>tool_calls</code> field.</p> <p>This breaks Home Assistant's Ollama integration, which always sends a system prompt (containing assistant instructions and exposed entity definitions) along with tool definitions.</p> <h3>Environment</h3> <ul> <li><strong>Ollama version:</strong> 0.20.6</li> <li><strong>Model:</strong> <code>gemma4:e4b</code> (official, pulled via <code>ollama pull gemma4:e4b</code>)</li> <li><strong>OS:</strong> Ubuntu 24.04 (LXC container on Proxmox VE)</li> <li><strong>Hardware:</strong> AMD Ryzen 7 8745HS, Radeon 780M iGPU (ROCm), 32 GB RAM</li> <li><strong>Client:</strong> Home Assistant OS (Core 2026.4.2, Supervisor 2026.03.3, OS 17.2, Frontend 20260325.7) Ollama integration + direct curl testing</li> </ul> <h3>Reproduction steps</h3> <p>Run the following three curl commands against a fresh <code>gemma4:e4b</code> model. They demonstrate that the bug only occurs with a specific combination.</p> <p><strong>Test 1 — No system prompt + <code>think: false</code> → ✅ WORKS</strong></p> <pre><code class="language-bash">curl -s http://localhost:11434/api/chat -d '{ "model": "gemma4:e4b", "messages": [{"role": "user", "content": "What is the weather in Talence?"}], "tools": [{"type": "function", "function": {"name": "get_weather", "description": "Get weather info", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"]}}}], "stream": false, "think": false }' | python3 -m json.tool </code></pre> <p><strong>Result:</strong> <code>content</code> is empty, <code>tool_calls</code> is correctly populated:</p> <pre><code class="language-json">{ "message": { "role": "assistant", "content": "", "tool_calls": [ { "id": "call_g76u5xbz", "function": { "index": 0, "name": "get_weather", "arguments": {"location": "Talence"} } } ] } } </code></pre> <p><strong>Test 2 — System prompt + thinking active (default) → ⚠️ tool_calls OK but thinking leaks</strong></p> <pre><code class="language-bash">curl -s http://localhost:11434/api/chat -d '{ "model": "gemma4:e4b", "messages": [ {"role": "system", "content": "Tu es Jarvis, assistant domotique. Réponds en français."}, {"role": "user", "content": "Je veux la météo"} ], "tools": [{"type": "function", "function": {"name": "GetLiveContext", "description": "Get live context", "parameters": {"type": "object", "properties": {}, "required": []}}}], "stream": false }' | python3 -m json.tool </code></pre> <p><strong>Result:</strong> <code>tool_calls</code> is correctly populated, but <code>thinking</code> field contains a long reasoning chain (14 seconds). The tool calling itself works:</p> <pre><code class="language-json">{ "message": { "role": "assistant", "content": "", "thinking": "1. **Analyze the Request:** ... (long reasoning) ... 6. **Generate the tool call:** Call GetLiveContext.", "tool_calls": [ { "id": "call_bgl0bmz2", "function": { "index": 0, "name": "GetLiveContext", "arguments": {} } } ] } } </code></pre> <p><strong>Test 3 — System prompt + <code>think: false</code> → ❌ BUG — tool_calls not parsed</strong></p> <pre><code class="language-bash">curl -s http://localhost:11434/api/chat -d '{ "model": "gemma4:e4b", "messages": [ {"role": "system", "content": "Tu es Jarvis, assistant domotique. Réponds en français."}, {"role": "user", "content": "Je veux la météo"} ], "tools": [{"type": "function", "function": {"name": "GetLiveContext", "description": "Get live context", "parameters": {"type": "object", "properties": {}, "required": []}}}], "stream": false, "think": false }' | python3 -m json.tool </code></pre> <p><strong>Result:</strong> The model generates the correct tool call JSON, but the parser does NOT intercept it. The raw JSON leaks into <code>content</code> with a trailing <code>&lt;channel|&gt;</code> token:</p> <pre><code class="language-json">{ "message": { "role": "assistant", "content": "{\n \"tool_calls\": [\n {\n \"function\": \"GetLiveContext\",\n \"args\": {}\n }\n ]\n}\n&lt;channel|&gt;" } } </code></pre> <p>No <code>tool_calls</code> field is present. No <code>thinking</code> field.</p> <h3>Summary</h3> Test | System prompt | think: false | tool_calls parsed | Duration -- | -- | -- | -- | -- 1 | ❌ No | ✅ Yes | ✅ Yes | ~2s 2 | ✅ Yes | ❌ No | ✅ Yes (but thinking leaks) | ~14s 3 | ✅ Yes | ✅ Yes | ❌ No — JSON in content | ~2s <h3>Expected behavior</h3> <p>Test 3 should produce the same structured <code>tool_calls</code> output as Test 1, since the only difference is the addition of a system prompt. The <code>think: false</code> flag should disable thinking without breaking tool call parsing.</p> <h3>Impact</h3> <p>This bug makes <code>gemma4:e4b</code> unusable with any client that sends a system prompt alongside tools and <code>think: false</code>, including:</p> <ul> <li><strong>Home Assistant</strong> Ollama integration (always sends a system prompt with entity definitions)</li> <li>Any OpenAI-compatible client using system prompts with tool definitions</li> </ul> <p>The workaround of leaving thinking enabled (Test 2) works for tool calling but adds 10+ seconds of latency and causes thinking tokens to leak into streaming clients.</p> <h3>Possibly related issues</h3> <ul> <li>#15241 — gemma4 tool call parsing fails</li> <li>#15315 — gemma4:e4b tool parsing errors persist in 0.20.1</li> <li>#15254 — fix gemma4 arg parsing with quoted strings</li> <li>#15306 — rework gemma4 tool call handling</li> </ul></body></html><!--EndFragment--> </body> </html>
Author
Owner

@drifkin commented on GitHub (Apr 13, 2026):

I suspect this is related to https://github.com/ollama/ollama/issues/15536, working on a fix for that right now.

I think what's going on is I started passing an empty think block to the model when think is false, but that should only be done for the larger two gemma4 models.

Did you happen to notice if this is a regression on v0.20.6 and it worked previously?

(also for Test 2, isn't that correct? Is the warning sign because waiting for thinking is especially inconvenient for this use case?)

<!-- gh-comment-id:4239390746 --> @drifkin commented on GitHub (Apr 13, 2026): I suspect this is related to https://github.com/ollama/ollama/issues/15536, working on a fix for that right now. I think what's going on is I started passing an empty think block to the model when `think` is `false`, but that should _only_ be done for the larger two gemma4 models. Did you happen to notice if this is a regression on v0.20.6 and it worked previously? (also for Test 2, isn't that correct? Is the warning sign because waiting for thinking is especially inconvenient for this use case?)
Author
Owner

@vfreysz commented on GitHub (Apr 13, 2026):

Thanks for the fast response!

Regarding the regression: Kind of, on v0.20.5 with gemma4:e4b, weather tool calls worked approximately 2 out of 4 times (intermittent). On v0.20.6, the behavior changed — Test 4 (system prompt + think: false) now consistently fails with raw JSON in content, whereas before it would sometimes work. So the consistent failure in Test 4 appears to be a regression introduced in v0.20.6.

Regarding Test 2: Yes, the tool call itself works correctly in Test 2 — the tool_calls field is properly populated. The ⚠️ is because in my use case (Home Assistant voice assistant with Piper TTS), the thinking tokens leak into the streaming response and get read aloud by the text-to-speech engine before the actual answer. The workaround I found is enabling "Think before responding" in the Home Assistant Ollama config, which tells HA to filter thinking tokens from the output. So Test 2 is technically functional, just slow (~4s vs ~2s) due to the thinking overhead. But it makes the experience much less smooth.

Current workaround: Using gemma4:e2b with "Think before responding" enabled in Home Assistant, which filters the <channel|> tokens properly. Would love to get back to e4b once #15536 is fixed.

<!-- gh-comment-id:4240275317 --> @vfreysz commented on GitHub (Apr 13, 2026): Thanks for the fast response! Regarding the regression: Kind of, on v0.20.5 with gemma4:e4b, weather tool calls worked approximately 2 out of 4 times (intermittent). On v0.20.6, the behavior changed — Test 4 (system prompt + think: false) now consistently fails with raw JSON in content, whereas before it would sometimes work. So the consistent failure in Test 4 appears to be a regression introduced in v0.20.6. Regarding Test 2: Yes, the tool call itself works correctly in Test 2 — the tool_calls field is properly populated. The ⚠️ is because in my use case (Home Assistant voice assistant with Piper TTS), the thinking tokens leak into the streaming response and get read aloud by the text-to-speech engine before the actual answer. The workaround I found is enabling "Think before responding" in the Home Assistant Ollama config, which tells HA to filter thinking tokens from the output. So Test 2 is technically functional, just slow (~4s vs ~2s) due to the thinking overhead. But it makes the experience much less smooth. Current workaround: Using gemma4:e2b with "Think before responding" enabled in Home Assistant, which filters the <channel|> tokens properly. Would love to get back to e4b once #15536 is fixed.
Author
Owner

@drifkin commented on GitHub (Apr 13, 2026):

https://github.com/ollama/ollama/releases/tag/v0.20.7-rc1 is up with the fix for #15536, if you're able to give that a try I'm curious if it fixes it.

<!-- gh-comment-id:4240307929 --> @drifkin commented on GitHub (Apr 13, 2026): https://github.com/ollama/ollama/releases/tag/v0.20.7-rc1 is up with the fix for #15536, if you're able to give that a try I'm curious if it fixes it.
Author
Owner

@vfreysz commented on GitHub (Apr 14, 2026):

v0.20.7-rc1 fixes the JSON-in-content bug (Test 4 no longer leaks raw JSON). However, with think: false, the E4B model doesn't make tool calls at all — it asks clarifying questions instead of calling the tool. With thinking enabled (no think parameter), tool calls work perfectly: clean tool_calls, empty content, 3 seconds.
It seems the E4B model needs thinking to reason about when to use tools. So for my use case, the working config is: thinking enabled + Home Assistant filtering the thinking tokens from the TTS output. The critical fix in rc1 is that the <channel|> tokens no longer leak into content.
Thanks for the quick turnaround!

<!-- gh-comment-id:4240397561 --> @vfreysz commented on GitHub (Apr 14, 2026): v0.20.7-rc1 fixes the JSON-in-content bug (Test 4 no longer leaks raw JSON). However, with think: false, the E4B model doesn't make tool calls at all — it asks clarifying questions instead of calling the tool. With thinking enabled (no think parameter), tool calls work perfectly: clean tool_calls, empty content, 3 seconds. It seems the E4B model needs thinking to reason about when to use tools. So for my use case, the working config is: thinking enabled + Home Assistant filtering the thinking tokens from the TTS output. The critical fix in rc1 is that the <channel|> tokens no longer leak into content. Thanks for the quick turnaround!
Author
Owner

@drifkin commented on GitHub (Apr 14, 2026):

awesome, thanks so much for testing! Small models often do better tool calling with reasoning, so that makes sense to me. You might be able to get it to tool call even without thinking if you're more prescriptive in telling it explicitly to make a tool call, but the quality of the tool call might be worse.

<!-- gh-comment-id:4240414736 --> @drifkin commented on GitHub (Apr 14, 2026): awesome, thanks so much for testing! Small models often do better tool calling with reasoning, so that makes sense to me. You might be able to get it to tool call even without thinking if you're more prescriptive in telling it explicitly to make a tool call, but the quality of the tool call might be worse.
Author
Owner

@vfreysz commented on GitHub (Apr 14, 2026):

You were right! I managed to get tool calling working with think: false by being more prescriptive in the system prompt. Here's what made the difference:
Adding this at the very top of the prompt, before any other instructions:

IMPORTANT: NEVER generate text before calling a tool. When a question requires a tool, call it immediately without saying anything. After receiving the result, give your answer directly.
Forbidden examples: "Let me check", "I'll look that up", "One moment"
Correct example: call the tool silently then say "It's 14 degrees in Talence, cloudy skies."

The key was providing explicit positive/negative examples of expected behavior. Without these examples, the E4B with think: false would say "Let me check" instead of calling the tool. With the examples, it calls GetLiveContext immediately — clean content: "", proper tool_calls, 9 tokens in 0.3s.
So the final working config is: v0.20.7-rc1 + think: false + prescriptive prompt with examples. Best of both worlds — fast responses without thinking overhead, and reliable tool calls.
Thanks again for the quick fix!

<!-- gh-comment-id:4240463110 --> @vfreysz commented on GitHub (Apr 14, 2026): You were right! I managed to get tool calling working with think: false by being more prescriptive in the system prompt. Here's what made the difference: Adding this at the very top of the prompt, before any other instructions: IMPORTANT: NEVER generate text before calling a tool. When a question requires a tool, call it immediately without saying anything. After receiving the result, give your answer directly. Forbidden examples: "Let me check", "I'll look that up", "One moment" Correct example: call the tool silently then say "It's 14 degrees in Talence, cloudy skies." The key was providing explicit positive/negative examples of expected behavior. Without these examples, the E4B with think: false would say "Let me check" instead of calling the tool. With the examples, it calls GetLiveContext immediately — clean content: "", proper tool_calls, 9 tokens in 0.3s. So the final working config is: v0.20.7-rc1 + think: false + prescriptive prompt with examples. Best of both worlds — fast responses without thinking overhead, and reliable tool calls. Thanks again for the quick fix!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71988