[GH-ISSUE #13154] think Parameter Not Suppressing Reasoning in qwen3:4b When Set to False #70759

Closed
opened 2026-05-04 22:51:55 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @mfaizanhassan on GitHub (Nov 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13154

What is the issue?

The think parameter in Ollama's chat API does not properly suppress reasoning content when explicitly set to False. When think=False is used, the model still generates extensive reasoning/thinking content instead of providing a direct answer.

Environment

  • Model: qwen3:4b
  • Ollama API: Python SDK (ollama.chat)
  • Expected Behavior: Based on Qwen3 documentation, the model supports /think and /no_think directives to control reasoning output

Steps to Reproduce

The issue can be reproduced by comparing the raw model behavior with Ollama's think parameter implementation.


Experiment 1: Raw Qwen Model with /think and /no_think (Expected Behavior)

Using the Hugging Face transformers library directly with the Qwen3-4B model demonstrates the correct behavior:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B",
    torch_dtype=torch.float16,
    device_map="auto"
)

base_prompt = "What is 5+5?"

# Test with /think
user_input_think = base_prompt + " /think"
messages_think = [{"role": "user", "content": user_input_think}]
text_think = tokenizer.apply_chat_template(
    messages_think,
    tokenize=False,
    add_generation_prompt=True
)
inputs_think = tokenizer(text_think, return_tensors="pt").to(device)
outputs_think = model.generate(**inputs_think, max_new_tokens=512)
response_think = tokenizer.decode(outputs_think[0][len(inputs_think.input_ids[0]):], skip_special_tokens=False)

print(f"With /think: {response_think}")

# Test with /no_think
user_input_no_think = base_prompt + " /no_think"
messages_no_think = [{"role": "user", "content": user_input_no_think}]
text_no_think = tokenizer.apply_chat_template(
    messages_no_think,
    tokenize=False,
    add_generation_prompt=True
)
inputs_no_think = tokenizer(text_no_think, return_tensors="pt").to(device)
outputs_no_think = model.generate(**inputs_no_think, max_new_tokens=512)
response_no_think = tokenizer.decode(outputs_no_think[0][len(inputs_no_think.input_ids[0]):], skip_special_tokens=False)

print(f"With /no_think: {response_no_think}")

Output:

With /think:

<think>
Okay, let's see. The user is asking "What is 5+5?" That seems straightforward...
[extensive reasoning omitted for brevity]
I think that's all. There's no ambiguity here. The answer is definitely 10.
</think>

The sum of 5 and 5 is **10**.

With /no_think:

<think>

</think>

5 + 5 equals 10.

Result

The raw model correctly responds to /think and /no_think directives:

  • /think: Generates extensive reasoning inside <think> tags
  • /no_think: Generates empty <think></think> tags with direct answer only

Experiment 2: Ollama Chat API with think Parameter (Buggy Behavior)

Testing Ollama's think parameter with the same model shows incorrect behavior:

from ollama import chat
from ollama import ChatResponse

test_cases = [False, None, True]
query = "What is 5+5?"

for think_value in test_cases:
    print(f"\nTesting with think={think_value}")

    response: ChatResponse = chat(**{
        'messages': [{'role': 'user', 'content': query}],
        'model': 'qwen3:4b',
        'think': think_value,
        'format': None,
        'options': {},
        'keep_alive': None
    })

    print(f"Content: {response.message.content}")
    print(f"Thinking: {response.message.thinking}")

Output:

1. think=False (BUGGY)

Content: Okay, the user asked "What is 5+5?" That seems straightforward. Let me think about how to approach this.

First, I should confirm what they're really asking. It's a basic math question, so they might be a kid learning addition, or maybe someone testing if I can do simple math...

[extensive reasoning continues in content field]

The result of **5 + 5** is **10**.

Thinking: None

Problem: The model generates full reasoning content despite think=False. The reasoning is embedded in the content field instead of being suppressed.

2. think=None (Default behavior)

Content: The answer to **5 + 5** is **10**.

Thinking: Okay, the user asked "What is 5+5?" Hmm, this seems like a very basic arithmetic question...
[reasoning properly separated]

Result: Thinking content is properly separated into the thinking field.

3. think=True (Explicit enable)

Content: The result of **5 + 5** is **10**.

Thinking: Okay, the user asked "What is 5+5?" Hmm, this seems like a super basic math question...
[reasoning properly separated]

Result: Same as think=None, thinking content properly separated.


Expected vs Actual Behavior

Expected Behavior for think=False

Based on how the raw model responds to /no_think, Ollama should:

  • Suppress all reasoning/thinking content
  • Return only a direct answer (similar to /no_think output)
  • content field: Direct answer only
  • thinking field: None or empty

Actual Behavior for think=False

Currently, Ollama:

  • Does NOT suppress reasoning content
  • Generates full thinking/reasoning process
  • Embeds reasoning in content field instead of separating it
  • Sets thinking field to None (but this doesn't help since reasoning is in content)

Additional Context

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.12.6

Originally created by @mfaizanhassan on GitHub (Nov 19, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13154 ### What is the issue? The `think` parameter in Ollama's chat API does not properly suppress reasoning content when explicitly set to `False`. When `think=False` is used, the model still generates extensive reasoning/thinking content instead of providing a direct answer. ### Environment - **Model**: `qwen3:4b` - **Ollama API**: Python SDK (`ollama.chat`) - **Expected Behavior**: Based on [Qwen3 documentation](https://arxiv.org/pdf/2505.09388), the model supports `/think` and `/no_think` directives to control reasoning output ### Steps to Reproduce The issue can be reproduced by comparing the raw model behavior with Ollama's `think` parameter implementation. --- ## Experiment 1: Raw Qwen Model with `/think` and `/no_think` (Expected Behavior) Using the Hugging Face transformers library directly with the Qwen3-4B model demonstrates the **correct behavior**: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B") model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen3-4B", torch_dtype=torch.float16, device_map="auto" ) base_prompt = "What is 5+5?" # Test with /think user_input_think = base_prompt + " /think" messages_think = [{"role": "user", "content": user_input_think}] text_think = tokenizer.apply_chat_template( messages_think, tokenize=False, add_generation_prompt=True ) inputs_think = tokenizer(text_think, return_tensors="pt").to(device) outputs_think = model.generate(**inputs_think, max_new_tokens=512) response_think = tokenizer.decode(outputs_think[0][len(inputs_think.input_ids[0]):], skip_special_tokens=False) print(f"With /think: {response_think}") # Test with /no_think user_input_no_think = base_prompt + " /no_think" messages_no_think = [{"role": "user", "content": user_input_no_think}] text_no_think = tokenizer.apply_chat_template( messages_no_think, tokenize=False, add_generation_prompt=True ) inputs_no_think = tokenizer(text_no_think, return_tensors="pt").to(device) outputs_no_think = model.generate(**inputs_no_think, max_new_tokens=512) response_no_think = tokenizer.decode(outputs_no_think[0][len(inputs_no_think.input_ids[0]):], skip_special_tokens=False) print(f"With /no_think: {response_no_think}") ``` **Output:** **With `/think`:** ``` <think> Okay, let's see. The user is asking "What is 5+5?" That seems straightforward... [extensive reasoning omitted for brevity] I think that's all. There's no ambiguity here. The answer is definitely 10. </think> The sum of 5 and 5 is **10**. ``` **With `/no_think`:** ``` <think> </think> 5 + 5 equals 10. ``` ### Result ✅ The raw model correctly responds to `/think` and `/no_think` directives: - `/think`: Generates extensive reasoning inside `<think>` tags - `/no_think`: Generates empty `<think></think>` tags with direct answer only --- ## Experiment 2: Ollama Chat API with `think` Parameter (Buggy Behavior) Testing Ollama's `think` parameter with the same model shows **incorrect behavior**: ```python from ollama import chat from ollama import ChatResponse test_cases = [False, None, True] query = "What is 5+5?" for think_value in test_cases: print(f"\nTesting with think={think_value}") response: ChatResponse = chat(**{ 'messages': [{'role': 'user', 'content': query}], 'model': 'qwen3:4b', 'think': think_value, 'format': None, 'options': {}, 'keep_alive': None }) print(f"Content: {response.message.content}") print(f"Thinking: {response.message.thinking}") ``` **Output:** ### 1. `think=False` ❌ (BUGGY) ``` Content: Okay, the user asked "What is 5+5?" That seems straightforward. Let me think about how to approach this. First, I should confirm what they're really asking. It's a basic math question, so they might be a kid learning addition, or maybe someone testing if I can do simple math... [extensive reasoning continues in content field] The result of **5 + 5** is **10**. Thinking: None ``` **Problem**: The model generates full reasoning content despite `think=False`. The reasoning is embedded in the `content` field instead of being suppressed. ### 2. `think=None` ✅ (Default behavior) ``` Content: The answer to **5 + 5** is **10**. Thinking: Okay, the user asked "What is 5+5?" Hmm, this seems like a very basic arithmetic question... [reasoning properly separated] ``` **Result**: Thinking content is properly separated into the `thinking` field. ### 3. `think=True` ✅ (Explicit enable) ``` Content: The result of **5 + 5** is **10**. Thinking: Okay, the user asked "What is 5+5?" Hmm, this seems like a super basic math question... [reasoning properly separated] ``` **Result**: Same as `think=None`, thinking content properly separated. --- ## Expected vs Actual Behavior ### Expected Behavior for `think=False` Based on how the raw model responds to `/no_think`, Ollama should: - **Suppress all reasoning/thinking content** - Return only a direct answer (similar to `/no_think` output) - `content` field: Direct answer only - `thinking` field: `None` or empty ### Actual Behavior for `think=False` Currently, Ollama: - ❌ **Does NOT suppress reasoning content** - ❌ Generates full thinking/reasoning process - ❌ Embeds reasoning in `content` field instead of separating it - ✅ Sets `thinking` field to `None` (but this doesn't help since reasoning is in `content`) ## Additional Context - Reference: [Qwen3 Technical Report - Section on reasoning control](https://arxiv.org/pdf/2505.09388) - The issue is reproducible with `qwen3:4b` - This bug affects downstream libraries that rely on Ollama's `think` parameter for controlling reasoning output such as Langchain. ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.12.6
GiteaMirror added the bug label 2026-05-04 22:51:55 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70759