[GH-ISSUE #14645] format is ignored when think is disabled for qwen3.5 series #56001

Open
opened 2026-04-29 10:07:35 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @johnnyxwan on GitHub (Mar 5, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14645

What is the issue?

Format is ignored when think is disabled for qwen3.5 series

I put an example here, and set temperature to 0, so that anyone can try to reproduce.
Ollama version: 0.17.6
Model: qwen3.5:35b-a3b (3460ffeede54)

I believe this can be achieved with 1) a proper output token probability masking, and 2) an empty thinking tag <think>\n\n</think>\n\n in template when thinking is disabled.
https://huggingface.co/Qwen/Qwen3.5-35B-A3B/blob/main/chat_template.jinja#L149

It appears to be ollama is expecting the end of thinking token, before it engages the probability masking for formatting. But since the tag is already closed in the template, the model actually never outputs that. As result, the masking is never applied.

Relevant output

[think = True, format = None]
Normal since format is not enabled.

response = client.chat(
    model = 'qwen3.5:35b-a3b',
    messages=[{'role': 'user', 'content': 'why is the sky blue'}],
    think=True,
    options={
        'temperature': 0
    }
)

print('Thinking exists?', 'thinking' in response['message'])
print('===')
print(response['message']['content'])
Thinking exists? True
===
The sky is blue due to a phenomenon called **Rayleigh scattering**. Here is a simple breakdown of how it works:

**1. Sunlight looks white, but isn't**
...

[think = False, format = None]
Again, normal since format is not enabled.

response = client.chat(
    model = 'qwen3.5:35b-a3b',
    messages=[{'role': 'user', 'content': 'why is the sky blue'}],
    think=False,
    options={
        'temperature': 0
    }
)

print('Thinking exists?', 'thinking' in response['message'])
print('===')
print(response['message']['content'])
Thinking exists? False
===
The sky appears blue due to a phenomenon called **Rayleigh scattering**.

Here is how it works:
...

[think = True, format = 'json']
Normal, which proves format alone is working if thinking enabled.

response = client.chat(
    model = 'qwen3.5:35b-a3b',
    messages=[{'role': 'user', 'content': 'why is the sky blue'}],
    think=True,
    format='json',
    options={
        'temperature': 0
    }
)

print('Thinking exists?', 'thinking' in response['message'])
print('===')
print(response['message']['content'])
Thinking exists? True
===
{"answer":"The sky is blue due to a phenomenon called
...

[think = False, format = 'json']
It is not returning json in this case, which shows format is ignored only when thinking is disabled.

response = client.chat(
    model = 'qwen3.5:35b-a3b',
    messages=[{'role': 'user', 'content': 'why is the sky blue'}],
    think=False,
    format='json',
    options={
        'temperature': 0
    }
)

print('Thinking exists?', 'thinking' in response['message'])
print('===')
print(response['message']['content'])
Thinking exists? False
===
The sky appears blue due to a phenomenon called **Rayleigh scattering**.

Here is how it works:
...

Ollama version

0.17.6

Originally created by @johnnyxwan on GitHub (Mar 5, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14645 ### What is the issue? Format is ignored when think is disabled for qwen3.5 series I put an example here, and set temperature to 0, so that anyone can try to reproduce. Ollama version: 0.17.6 Model: qwen3.5:35b-a3b (3460ffeede54) I believe this can be achieved with 1) a proper output token probability masking, and 2) an empty thinking tag `<think>\n\n</think>\n\n` in template when thinking is disabled. https://huggingface.co/Qwen/Qwen3.5-35B-A3B/blob/main/chat_template.jinja#L149 It appears to be ollama is expecting the end of thinking token, before it engages the probability masking for formatting. But since the tag is already closed in the template, the model actually never outputs that. As result, the masking is never applied. ### Relevant output [think = True, format = None] Normal since format is not enabled. ```python response = client.chat( model = 'qwen3.5:35b-a3b', messages=[{'role': 'user', 'content': 'why is the sky blue'}], think=True, options={ 'temperature': 0 } ) print('Thinking exists?', 'thinking' in response['message']) print('===') print(response['message']['content']) ``` ```shell Thinking exists? True === The sky is blue due to a phenomenon called **Rayleigh scattering**. Here is a simple breakdown of how it works: **1. Sunlight looks white, but isn't** ... ``` [think = False, format = None] Again, normal since format is not enabled. ```python response = client.chat( model = 'qwen3.5:35b-a3b', messages=[{'role': 'user', 'content': 'why is the sky blue'}], think=False, options={ 'temperature': 0 } ) print('Thinking exists?', 'thinking' in response['message']) print('===') print(response['message']['content']) ``` ```shell Thinking exists? False === The sky appears blue due to a phenomenon called **Rayleigh scattering**. Here is how it works: ... ``` [think = True, format = 'json'] Normal, which proves format alone is working if thinking enabled. ```python response = client.chat( model = 'qwen3.5:35b-a3b', messages=[{'role': 'user', 'content': 'why is the sky blue'}], think=True, format='json', options={ 'temperature': 0 } ) print('Thinking exists?', 'thinking' in response['message']) print('===') print(response['message']['content']) ``` ```shell Thinking exists? True === {"answer":"The sky is blue due to a phenomenon called ... ``` [think = False, format = 'json'] It is not returning json in this case, which shows format is ignored only when thinking is disabled. ```python response = client.chat( model = 'qwen3.5:35b-a3b', messages=[{'role': 'user', 'content': 'why is the sky blue'}], think=False, format='json', options={ 'temperature': 0 } ) print('Thinking exists?', 'thinking' in response['message']) print('===') print(response['message']['content']) ``` ```shell Thinking exists? False === The sky appears blue due to a phenomenon called **Rayleigh scattering**. Here is how it works: ... ``` ### Ollama version 0.17.6
GiteaMirror added the bug label 2026-04-29 10:07:35 -05:00
Author
Owner

@johnnyxwan commented on GitHub (Mar 6, 2026):

@majiayu000 thank you, this exactly solves this problem, looking forward to see it being merged.

<!-- gh-comment-id:4009962734 --> @johnnyxwan commented on GitHub (Mar 6, 2026): @majiayu000 thank you, this exactly solves this problem, looking forward to see it being merged.
Author
Owner

@majiayu000 commented on GitHub (Mar 6, 2026):

Happy to help

<!-- gh-comment-id:4010297378 --> @majiayu000 commented on GitHub (Mar 6, 2026): Happy to help
Author
Owner

@arnoudius commented on GitHub (Mar 7, 2026):

Nice, experiencing the same issue

<!-- gh-comment-id:4016810687 --> @arnoudius commented on GitHub (Mar 7, 2026): Nice, experiencing the same issue
Author
Owner

@BigArty commented on GitHub (Mar 18, 2026):

Why this fix is not included in 1.18, 1.18.1 and 1.18.2-rc? It seem like it was available 2 weeks ago already.

<!-- gh-comment-id:4083199036 --> @BigArty commented on GitHub (Mar 18, 2026): Why this fix is not included in 1.18, 1.18.1 and 1.18.2-rc? It seem like it was available 2 weeks ago already.
Author
Owner

@johnnyxwan commented on GitHub (Mar 19, 2026):

Just ran some tests, the problem persists as expected in v0.18.2.

<!-- gh-comment-id:4091552224 --> @johnnyxwan commented on GitHub (Mar 19, 2026): Just ran some tests, the problem persists as expected in v0.18.2.
Author
Owner

@johnnyxwan commented on GitHub (Mar 31, 2026):

Confirmed that the problem persists as expected in v0.19.0.

<!-- gh-comment-id:4159672163 --> @johnnyxwan commented on GitHub (Mar 31, 2026): Confirmed that the problem persists as expected in v0.19.0.
Author
Owner

@BigArty commented on GitHub (Apr 7, 2026):

Still does not work in v0.20.2

<!-- gh-comment-id:4197640167 --> @BigArty commented on GitHub (Apr 7, 2026): Still does not work in v0.20.2
Author
Owner

@BigArty commented on GitHub (Apr 10, 2026):

Moreover on 0.20.2 this combination of arguments ignores format completely even with think=True
response = client_llm.chat(model="qwen3.5:35b",
messages=messages,
context_length= 40000,
top_p = 0.95,
top_k = 20,
temperature = 0.7,
# repeat_penalty=1.5,
max_tokens=10000,
think=True,
stream=True,
format=ProjectOverview,
tools=tool_list,
)

It also seems that tools sometimes are not called correctly - I see the <> tags in the output. So the format is also not working for them.

<!-- gh-comment-id:4222350638 --> @BigArty commented on GitHub (Apr 10, 2026): Moreover on 0.20.2 this combination of arguments ignores format completely even with think=True response = client_llm.chat(model="qwen3.5:35b", messages=messages, context_length= 40000, top_p = 0.95, top_k = 20, temperature = 0.7, # repeat_penalty=1.5, max_tokens=10000, think=True, stream=True, format=ProjectOverview, tools=tool_list, ) It also seems that tools sometimes are not called correctly - I see the <> tags in the output. So the format is also not working for them.
Author
Owner

@BigArty commented on GitHub (Apr 10, 2026):

Just confirmed that all problems are still present on v0.20.5

<!-- gh-comment-id:4223442355 --> @BigArty commented on GitHub (Apr 10, 2026): Just confirmed that all problems are still present on v0.20.5
Author
Owner

@johnnyxwan commented on GitHub (Apr 18, 2026):

qwen3.6 is here, and this issue persists for qwen3.5, qwen3.6 and gemma4 in v0.21.0

<!-- gh-comment-id:4273110395 --> @johnnyxwan commented on GitHub (Apr 18, 2026): qwen3.6 is here, and this issue persists for qwen3.5, qwen3.6 and gemma4 in v0.21.0
Author
Owner

@Orbiter commented on GitHub (Apr 18, 2026):

qwen3.6 is here, and this issue persists for qwen3.5, qwen3.6 and gemma4 in v0.21.0

yes, I measure the same; I have details about format testing in this benchmark file: https://github.com/Orbiter/project-euler-llm-benchmark/blob/main/benchmark.json

This shows that there must be a direct connection to the thinking ability of the model, because the frob-models:

  • frob/qwen3.5-instruct:35b
  • frob/qwen3.5-instruct:122b
  • frob/qwen3.5-instruct:27b
  • frob/qwen3.5-instruct:9b

.. they all are format-enabled. So it is not a problem that is connected to qwen3.5, only the act of disabling thinking during the API access. Btw: Testing of the models was done using the OpenAI-API.

However, this thinking model, also disabled on the API, also works, which is a bit confusing:

  • hf.co/mradermacher/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-i1-GGUF:Q4_K_M-no_think
<!-- gh-comment-id:4273375809 --> @Orbiter commented on GitHub (Apr 18, 2026): > qwen3.6 is here, and this issue persists for qwen3.5, qwen3.6 and gemma4 in v0.21.0 yes, I measure the same; I have details about format testing in this benchmark file: https://github.com/Orbiter/project-euler-llm-benchmark/blob/main/benchmark.json This shows that there must be a direct connection to the thinking ability of the model, because the frob-models: - frob/qwen3.5-instruct:35b - frob/qwen3.5-instruct:122b - frob/qwen3.5-instruct:27b - frob/qwen3.5-instruct:9b .. they all are format-enabled. So it is not a problem that is connected to qwen3.5, only the act of disabling thinking during the API access. Btw: Testing of the models was done using the OpenAI-API. However, this thinking model, also disabled on the API, also works, which is a bit confusing: - hf.co/mradermacher/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-i1-GGUF:Q4_K_M-no_think
Author
Owner

@johnnyxwan commented on GitHub (Apr 23, 2026):

The same problem in gemma4 is fixed by #15678 in v0.21.1. However, the fix is scoped only for gemma4. qwen3.5 and qwen3.6 are still affected as expected.

<!-- gh-comment-id:4301803586 --> @johnnyxwan commented on GitHub (Apr 23, 2026): The same problem in gemma4 is fixed by #15678 in v0.21.1. However, the fix is scoped only for gemma4. qwen3.5 and qwen3.6 are still affected as expected.
Author
Owner

@johnnyxwan commented on GitHub (Apr 24, 2026):

From my understanding, there are 3 common ways how we can "toggle" thinking of a model:

  1. Seperate model weights: seperate checkpoints are created in the post-training phase
  2. Triggering token /think and /nothink: model behaviour changes with the triggering token in system prompt or user prompt
  3. Template control: inject empty thinking section <think></think> to the start of response to inform the model that thinking is ended / skipped

For example, gemma4 is suggested to use method 3 with method 2 to control thinking.
https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4#integration-notes

Model Behavior: Larger models (e.g., gemma-4-26B-A4B-it, gemma-4-31B-it) may occasionally generate a thought channel even when thinking mode is explicitly turned off. To stabilize model behavior in these edge cases, consider adding an empty thinking token to the prompt.

https://huggingface.co/google/gemma-4-26B-A4B-it/blob/main/chat_template.jinja#L343

{{- '<|turn>model\n' -}}
{%- if not enable_thinking | default(false) -%}
    {{- '<|channel>thought\n<channel|>' -}}
{%- endif -%}

For qwen3.5/3.6, method 3 alone is used:
https://huggingface.co/Qwen/Qwen3.5-35B-A3B/blob/main/chat_template.jinja#L149

{{- '<|im_start|>assistant\n' }}
{%- if enable_thinking is defined and enable_thinking is false %}
    {{- '<think>\n\n</think>\n\n' }}
{%- else %}
    {{- '<think>\n' }}
{%- endif %}

The core of a model is just a bunch of weights, it is always about how the code base handles the output. The current issue is casued by Ollama anticipating </think> before it applies the format masking. But gemma4, qwen3.5 and qwen 3.6 are assisted with or using method 3, which </think> is injected by the TEMPLATE and not emitted by the model.

The reason why some custom models (such as "frob/qwen3.5-instruct") work is that, this Ollama </think> anticipation behaviour is only turned on for model with thinking capability flag: slices.Contains(m.Capabilities(), model.CapabilityThinking)
21883571b7/server/routes.go (L2414)

A quick walkaround would be creating a seperated custom model, with the same model weights, but without model thinking capability flag. But I would happy to see Ollama support formating without thinking for qwen3.5/3.6 official model releases. I believe this is not just about gemma4/qwen3.5/3.6, but also an essential fix for future model releases.

<!-- gh-comment-id:4310419660 --> @johnnyxwan commented on GitHub (Apr 24, 2026): From my understanding, there are 3 common ways how we can "toggle" thinking of a model: 1. Seperate model weights: seperate checkpoints are created in the post-training phase 2. Triggering token `/think` and `/nothink`: model behaviour changes with the triggering token in system prompt or user prompt 3. Template control: inject empty thinking section `<think></think>` to the start of response to inform the model that thinking is ended / skipped For example, gemma4 is suggested to use method 3 with method 2 to control thinking. https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4#integration-notes > Model Behavior: Larger models (e.g., gemma-4-26B-A4B-it, gemma-4-31B-it) may occasionally generate a thought channel even when thinking mode is explicitly turned off. To stabilize model behavior in these edge cases, consider adding an empty thinking token to the prompt. https://huggingface.co/google/gemma-4-26B-A4B-it/blob/main/chat_template.jinja#L343 ``` {{- '<|turn>model\n' -}} {%- if not enable_thinking | default(false) -%} {{- '<|channel>thought\n<channel|>' -}} {%- endif -%} ``` For qwen3.5/3.6, method 3 alone is used: https://huggingface.co/Qwen/Qwen3.5-35B-A3B/blob/main/chat_template.jinja#L149 ``` {{- '<|im_start|>assistant\n' }} {%- if enable_thinking is defined and enable_thinking is false %} {{- '<think>\n\n</think>\n\n' }} {%- else %} {{- '<think>\n' }} {%- endif %} ``` The core of a model is just a bunch of weights, it is always about how the code base handles the output. The current issue is casued by Ollama anticipating `</think>` before it applies the format masking. But gemma4, qwen3.5 and qwen 3.6 are assisted with or using method 3, which `</think>` is injected by the TEMPLATE and not emitted by the model. The reason why some custom models (such as "frob/qwen3.5-instruct") work is that, this Ollama `</think>` anticipation behaviour is only turned on for model with thinking capability flag: `slices.Contains(m.Capabilities(), model.CapabilityThinking)` https://github.com/ollama/ollama/blob/21883571b746d9d965cf7747d5f09a5c53f389fb/server/routes.go#L2414 A quick walkaround would be creating a seperated custom model, with the same model weights, but without model thinking capability flag. But I would happy to see Ollama support formating without thinking for qwen3.5/3.6 official model releases. I believe this is not just about gemma4/qwen3.5/3.6, but also an essential fix for future model releases.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56001