[GH-ISSUE #15461] gemma4:26b thinking mode causes passive/incomplete tool-calling behavior on CUDA via /v1/chat/completions #56397

Closed
opened 2026-04-29 10:46:10 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @mikejbuckingham on GitHub (Apr 9, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15461

Description

When using gemma4:26b with thinking enabled (chat_template_kwargs: {"thinking": true}) via the OpenAI-compatible /v1/chat/completions endpoint on a Linux CUDA backend, the model reads files correctly but fails to follow through with subsequent tool calls (e.g. writing code). Instead it produces a brief conversational summary and asks "How can I help?" — effectively becoming passive mid-task.

Environment

  • OS: Linux (WSL2)
  • GPU: RTX 5090
  • Ollama version: 0.20.4
  • Model: gemma4:26b (Q4_K_M)
  • Endpoint: /v1/chat/completions
  • Thinking: enabled, reasoning_effort: low

Steps to Reproduce

  1. Pull gemma4:26b
  2. Enable thinking mode via chat_template_kwargs: {"thinking": true}
  3. Use the /v1/chat/completions endpoint in an agentic loop with tools defined
  4. Ask the model to perform a multi-step task (e.g. clone a repo, then add a new API endpoint to a file)

Expected Behavior

Model completes agentic tasks end-to-end — reads files, writes code, uses tools — as observed on macOS Metal with identical Ollama version, model, and thinking settings.

Actual Behavior

Model reads a file via tool call, summarizes the contents, then responds conversationally asking for further instructions instead of proceeding with the requested task. Subsequent tool calls (e.g. write_file) are never made.

Workaround

Disabling thinking ("thinking": false) restores correct agentic behavior on CUDA.

Notes

The same setup (identical Ollama version 0.20.4, same model tag, same thinking settings, same harness and prompts) works correctly on macOS Metal (Apple M-series, Mac Mini 64GB RAM). This points to a platform-specific difference in how thinking output is handled on CUDA vs Metal.

Related to #15288 (thinking output routed to reasoning field on /v1/chat/completions).

Originally created by @mikejbuckingham on GitHub (Apr 9, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15461 ## Description When using `gemma4:26b` with thinking enabled (`chat_template_kwargs: {"thinking": true}`) via the OpenAI-compatible `/v1/chat/completions` endpoint on a Linux CUDA backend, the model reads files correctly but fails to follow through with subsequent tool calls (e.g. writing code). Instead it produces a brief conversational summary and asks "How can I help?" — effectively becoming passive mid-task. ## Environment - OS: Linux (WSL2) - GPU: RTX 5090 - Ollama version: 0.20.4 - Model: `gemma4:26b` (Q4_K_M) - Endpoint: `/v1/chat/completions` - Thinking: enabled, reasoning_effort: low ## Steps to Reproduce 1. Pull `gemma4:26b` 2. Enable thinking mode via `chat_template_kwargs: {"thinking": true}` 3. Use the `/v1/chat/completions` endpoint in an agentic loop with tools defined 4. Ask the model to perform a multi-step task (e.g. clone a repo, then add a new API endpoint to a file) ## Expected Behavior Model completes agentic tasks end-to-end — reads files, writes code, uses tools — as observed on macOS Metal with identical Ollama version, model, and thinking settings. ## Actual Behavior Model reads a file via tool call, summarizes the contents, then responds conversationally asking for further instructions instead of proceeding with the requested task. Subsequent tool calls (e.g. write_file) are never made. ## Workaround Disabling thinking (`"thinking": false`) restores correct agentic behavior on CUDA. ## Notes The same setup (identical Ollama version 0.20.4, same model tag, same thinking settings, same harness and prompts) works correctly on macOS Metal (Apple M-series, Mac Mini 64GB RAM). This points to a platform-specific difference in how thinking output is handled on CUDA vs Metal. Related to #15288 (thinking output routed to `reasoning` field on `/v1/chat/completions`).
Author
Owner

@rick-github commented on GitHub (Apr 9, 2026):

Server logs will aid in debugging.

<!-- gh-comment-id:4217216128 --> @rick-github commented on GitHub (Apr 9, 2026): [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging.
Author
Owner

@mikejbuckingham commented on GitHub (Apr 9, 2026):

Apologies for the noise — after extensive testing we've identified the root cause as a misconfigured context window on our end, not an Ollama CUDA bug.

Our agent harness never passes num_ctx in API requests, so Ollama was defaulting to a KvSize of 32768. Our task involves reading a large file (~54KB) plus a recursive directory listing, which pushes the actual token count at or above that limit. When context is truncated, the model silently loses the task instructions and stops generating useful output.

Our colleague's machine was working because their UI client had a context length slider set to 256k, which was being sent as num_ctx in every request. We weren't aware of this difference.

Setting OLLAMA_CONTEXT_LENGTH=65536 before starting ollama serve resolved the issue consistently across 3/3 runs. We'll fix our harness to pass num_ctx explicitly going forward.

Sorry for the wild goose chase. Feel free to close this.

<!-- gh-comment-id:4218244795 --> @mikejbuckingham commented on GitHub (Apr 9, 2026): Apologies for the noise — after extensive testing we've identified the root cause as a misconfigured context window on our end, not an Ollama CUDA bug. Our agent harness never passes `num_ctx` in API requests, so Ollama was defaulting to a KvSize of 32768. Our task involves reading a large file (~54KB) plus a recursive directory listing, which pushes the actual token count at or above that limit. When context is truncated, the model silently loses the task instructions and stops generating useful output. Our colleague's machine was working because their UI client had a context length slider set to 256k, which was being sent as `num_ctx` in every request. We weren't aware of this difference. Setting `OLLAMA_CONTEXT_LENGTH=65536` before starting `ollama serve` resolved the issue consistently across 3/3 runs. We'll fix our harness to pass `num_ctx` explicitly going forward. Sorry for the wild goose chase. Feel free to close this.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56397