[GH-ISSUE #14793] generate API ignores think=false for qwen3.5 (chat API works) #35315

Closed
opened 2026-04-22 19:45:20 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @andreamoro on GitHub (Mar 12, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14793

What is the issue?

The generate API completely ignores think: false passed in options for qwen3.5:9b. The model still produces thinking tokens that consume the entire num_predict budget, resulting in an empty response field. The chat API with think=False as a top-level parameter works correctly.

This asymmetry causes silent failures for applications using the generate endpoint with thinking models — they get empty responses with no error.

OS / Ollama version

  • OS: Linux 6.17.0-14-generic (Ubuntu)
  • Ollama: 0.17.7
  • Model: qwen3.5:9b

Steps to reproduce

import ollama

# Test 1: generate API — think=false IGNORED
r1 = ollama.generate(
    model='qwen3.5:9b',
    prompt='Say hello in 2 words. Reply with ONLY the words.',
    options={'num_predict': 30, 'think': False}
)
print('generate think=false:')
print('  response:', repr(r1['response']))        # '' (empty!)
print('  thinking:', repr(r1['thinking'][:80]))    # still populated
print('  eval_count:', r1['eval_count'])           # 30 (all tokens burned on thinking)

# Test 2: chat API — think=False WORKS
r2 = ollama.chat(
    model='qwen3.5:9b',
    messages=[{'role': 'user', 'content': 'Say hello in 2 words. Reply with ONLY the words.'}],
    think=False,
    options={'num_predict': 30}
)
print('chat think=False:')
print('  content:', repr(r2['message']['content']))  # 'Hello there!' (actual output)
print('  eval_count:', r2['eval_count'])              # 3 (no wasted tokens)

Actual output

generate think=false:
  response: ''
  thinking: 'Thinking Process:\n\n1.  **Analyze the Request:**\n    *   Task: Say hello.\n    *   Constraint 1:'
  eval_count: 30

chat think=False:
  content: 'Hello there!'
  eval_count: 3

Expected behavior

think: false should disable thinking on BOTH generate and chat endpoints. When thinking is disabled, all num_predict tokens should be available for the visible response.

Additional context

Workaround

Switch from generate to chat API and pass think=False as a top-level parameter (not inside options).

Originally created by @andreamoro on GitHub (Mar 12, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14793 ### What is the issue? The `generate` API completely ignores `think: false` passed in `options` for `qwen3.5:9b`. The model still produces thinking tokens that consume the entire `num_predict` budget, resulting in an empty `response` field. The `chat` API with `think=False` as a top-level parameter works correctly. This asymmetry causes silent failures for applications using the `generate` endpoint with thinking models — they get empty responses with no error. ### OS / Ollama version - OS: Linux 6.17.0-14-generic (Ubuntu) - Ollama: 0.17.7 - Model: qwen3.5:9b ### Steps to reproduce ```python import ollama # Test 1: generate API — think=false IGNORED r1 = ollama.generate( model='qwen3.5:9b', prompt='Say hello in 2 words. Reply with ONLY the words.', options={'num_predict': 30, 'think': False} ) print('generate think=false:') print(' response:', repr(r1['response'])) # '' (empty!) print(' thinking:', repr(r1['thinking'][:80])) # still populated print(' eval_count:', r1['eval_count']) # 30 (all tokens burned on thinking) # Test 2: chat API — think=False WORKS r2 = ollama.chat( model='qwen3.5:9b', messages=[{'role': 'user', 'content': 'Say hello in 2 words. Reply with ONLY the words.'}], think=False, options={'num_predict': 30} ) print('chat think=False:') print(' content:', repr(r2['message']['content'])) # 'Hello there!' (actual output) print(' eval_count:', r2['eval_count']) # 3 (no wasted tokens) ``` ### Actual output ``` generate think=false: response: '' thinking: 'Thinking Process:\n\n1. **Analyze the Request:**\n * Task: Say hello.\n * Constraint 1:' eval_count: 30 chat think=False: content: 'Hello there!' eval_count: 3 ``` ### Expected behavior `think: false` should disable thinking on BOTH `generate` and `chat` endpoints. When thinking is disabled, all `num_predict` tokens should be available for the visible response. ### Additional context - Non-thinking models (e.g. `gemma2:9b`) are unaffected — `think: false` is harmlessly ignored - The model's Modelfile uses `RENDERER qwen3.5` / `PARSER qwen3.5`, so thinking is handled by Ollama's built-in renderer, not the template - Increasing `num_predict` to 200+ doesn't help on `generate` — the model just thinks longer, still empty `response` - Related: #14502, #14612, #14645 ### Workaround Switch from `generate` to `chat` API and pass `think=False` as a top-level parameter (not inside `options`).
GiteaMirror added the thinking label 2026-04-22 19:45:20 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 12, 2026):

--- 14793.py.orig	2026-03-12 10:35:55.053215178 +0100
+++ 14793.py	2026-03-12 10:41:33.575422728 +0100
@@ -4,11 +4,12 @@
 r1 = ollama.generate(
     model='qwen3.5:9b',
     prompt='Say hello in 2 words. Reply with ONLY the words.',
-    options={'num_predict': 30, 'think': False}
+    think=False,
+    options={'num_predict': 30}
 )
 print('generate think=false:')
 print('  response:', repr(r1['response']))        # '' (empty!)
-print('  thinking:', repr(r1['thinking'][:80]))    # still populated
+print('  thinking:', repr(r1.get('thinking')))    # still populated
 print('  eval_count:', r1['eval_count'])           # 30 (all tokens burned on thinking)
 
 # Test 2: chat API — think=False WORKS
<!-- gh-comment-id:4045318400 --> @rick-github commented on GitHub (Mar 12, 2026): ```diff --- 14793.py.orig 2026-03-12 10:35:55.053215178 +0100 +++ 14793.py 2026-03-12 10:41:33.575422728 +0100 @@ -4,11 +4,12 @@ r1 = ollama.generate( model='qwen3.5:9b', prompt='Say hello in 2 words. Reply with ONLY the words.', - options={'num_predict': 30, 'think': False} + think=False, + options={'num_predict': 30} ) print('generate think=false:') print(' response:', repr(r1['response'])) # '' (empty!) -print(' thinking:', repr(r1['thinking'][:80])) # still populated +print(' thinking:', repr(r1.get('thinking'))) # still populated print(' eval_count:', r1['eval_count']) # 30 (all tokens burned on thinking) # Test 2: chat API — think=False WORKS ```
Author
Owner

@andreamoro commented on GitHub (Mar 12, 2026):

Thanks @rick-github — confirmed! think=False as a top-level parameter works correctly on both generate and chat.

The actual issue is that think inside options={} is silently ignored. Updated test:

import ollama

# BROKEN: think inside options — silently ignored, empty response
r1 = ollama.generate(
    model='qwen3.5:9b',
    prompt='Say hello in 2 words.',
    options={'num_predict': 30, 'think': False}
)
print(r1['response'])  # '' (empty!)

# WORKS: think as top-level parameter
r2 = ollama.generate(
    model='qwen3.5:9b',
    prompt='Say hello in 2 words.',
    think=False,
    options={'num_predict': 30}
)
print(r2['response'])  # 'Hello there'

This is a silent failure that's easy to hit — think looks like a model parameter (similar to temperature, num_predict) so placing it in options feels natural. Perhaps Ollama could warn when unrecognized keys appear in options?

<!-- gh-comment-id:4046561008 --> @andreamoro commented on GitHub (Mar 12, 2026): Thanks @rick-github — confirmed! `think=False` as a top-level parameter works correctly on both `generate` and `chat`. The actual issue is that `think` inside `options={}` is silently ignored. Updated test: ```python import ollama # BROKEN: think inside options — silently ignored, empty response r1 = ollama.generate( model='qwen3.5:9b', prompt='Say hello in 2 words.', options={'num_predict': 30, 'think': False} ) print(r1['response']) # '' (empty!) # WORKS: think as top-level parameter r2 = ollama.generate( model='qwen3.5:9b', prompt='Say hello in 2 words.', think=False, options={'num_predict': 30} ) print(r2['response']) # 'Hello there' ``` This is a silent failure that's easy to hit — `think` looks like a model parameter (similar to `temperature`, `num_predict`) so placing it in `options` feels natural. Perhaps Ollama could warn when unrecognized keys appear in `options`?
Author
Owner

@rick-github commented on GitHub (Mar 12, 2026):

time=2026-03-12T14:03:45.429Z level=WARN source=types.go:976 msg="invalid option provided" option=think
<!-- gh-comment-id:4047042924 --> @rick-github commented on GitHub (Mar 12, 2026): ``` time=2026-03-12T14:03:45.429Z level=WARN source=types.go:976 msg="invalid option provided" option=think ```
Author
Owner

@duaneking commented on GitHub (Mar 28, 2026):

I'm still running into this issue; why not allow us to manage the thinking parameter ourselves?

<!-- gh-comment-id:4146770722 --> @duaneking commented on GitHub (Mar 28, 2026): I'm still running into this issue; why not allow us to manage the thinking parameter ourselves?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#35315