[GH-ISSUE #14418] qwen3.5:35b issue #55872

Closed
opened 2026-04-29 09:50:09 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @gemlincong-dotcom on GitHub (Feb 25, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14418

What is the issue?

I have encountered the following issue when using qwen3.5:35b to replace the qwen3:30b model for RAG:
environment: ollama -v
ollama version is 0.17.1-rc1

  1. cannot disable thinking although I send api request with think=False .
  2. although I suggest the model to answer me in Chinese and wrote in the prompt, it still answer me in English.
  3. GGML_ASSERT(ctx->mem_buffer != NULL) failed

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @gemlincong-dotcom on GitHub (Feb 25, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14418 ### What is the issue? I have encountered the following issue when using qwen3.5:35b to replace the qwen3:30b model for RAG: environment: ollama -v ollama version is 0.17.1-rc1 1. cannot disable thinking although I send api request with think=False . 2. although I suggest the model to answer me in Chinese and wrote in the prompt, it still answer me in English. 3. GGML_ASSERT(ctx->mem_buffer != NULL) failed ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-29 09:50:09 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 25, 2026):

Server logs will aid in debugging.

<!-- gh-comment-id:3958602933 --> @rick-github commented on GitHub (Feb 25, 2026): [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging.
Author
Owner

@gemlincong-dotcom commented on GitHub (Feb 25, 2026):

Server logs will aid in debugging.

regardless issue 2 and 3, about issue 1, seems the qwen3.5 disable thinking is different with qwen3

Image
<!-- gh-comment-id:3960161341 --> @gemlincong-dotcom commented on GitHub (Feb 25, 2026): > [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging. regardless issue 2 and 3, about issue 1, seems the qwen3.5 disable thinking is different with qwen3 <img width="603" height="883" alt="Image" src="https://github.com/user-attachments/assets/6a3090da-69da-4f0d-bff5-76182742f439" />
Author
Owner

@rocaltair commented on GitHub (Feb 26, 2026):

Have you noticed that with same activation parameters under thinking mode, Qwen3.5:35B-A3B runs at
only half the speed of Qwen3:30B-A3B? @gemlincong-dotcom
Is it due to the model architecture or due to Ollama?

Qwen3:30B-A3B(quantization:Q4_K_M): about 61 tokens/s
Qwen3.5:35B-A3B(quantization:Q4_K_M): about 27 tokens/s

OS

Mac

Hardware

Apple silicon M4Pro with 48GB unified RAM

Ollama version

0.17.1-rc2

<!-- gh-comment-id:3965375198 --> @rocaltair commented on GitHub (Feb 26, 2026): Have you noticed that with same activation parameters under thinking mode, Qwen3.5:35B-A3B runs at only half the speed of Qwen3:30B-A3B? @gemlincong-dotcom Is it due to the model architecture or due to Ollama? Qwen3:30B-A3B(quantization:Q4_K_M): about 61 tokens/s Qwen3.5:35B-A3B(quantization:Q4_K_M): about 27 tokens/s # OS Mac # Hardware Apple silicon M4Pro with 48GB unified RAM # Ollama version 0.17.1-rc2
Author
Owner

@gemlincong-dotcom commented on GitHub (Feb 26, 2026):

Have you noticed that with same activation parameters under thinking mode, Qwen3.5:35B-A3B runs at only half the speed of Qwen3:30B-A3B? @gemlincong-dotcom Is it due to the model architecture or due to Ollama?

Qwen3:30B-A3B(quantization:Q4_K_M): about 61 tokens/s Qwen3.5:35B-A3B(quantization:Q4_K_M): about 27 tokens/s

OS

Mac

Hardware

Apple silicon M4Pro with 48GB unified RAM

Ollama version

0.17.1-rc2

have not compared the generation speed, but qwen3.5:35b is 4 GB larger than qwen3:30b, maybe that affect the speed.

<!-- gh-comment-id:3965830613 --> @gemlincong-dotcom commented on GitHub (Feb 26, 2026): > Have you noticed that with same activation parameters under thinking mode, Qwen3.5:35B-A3B runs at only half the speed of Qwen3:30B-A3B? [@gemlincong-dotcom](https://github.com/gemlincong-dotcom) Is it due to the model architecture or due to Ollama? > > Qwen3:30B-A3B(quantization:Q4_K_M): about 61 tokens/s Qwen3.5:35B-A3B(quantization:Q4_K_M): about 27 tokens/s > > # OS > Mac > > # Hardware > Apple silicon M4Pro with 48GB unified RAM > > # Ollama version > 0.17.1-rc2 have not compared the generation speed, but qwen3.5:35b is 4 GB larger than qwen3:30b, maybe that affect the speed.
Author
Owner

@ddarmon commented on GitHub (Mar 1, 2026):

I had the same issue 1 as you (i.e., think=false did not always work):

The root cause appears to be that the qwen3.5:35b-a3b model in the registry as of last week had its config set to RENDERER qwen3-vl-thinking / PARSER qwen3-vl-thinking instead of RENDERER qwen3.5 / PARSER qwen3.5.

The qwen3-vl-thinking renderer does not have emitEmptyThinkOnNoThink enabled, so when think=false is set, the assistant prefill is just a bare <|im_start|>assistant\n with no empty <think>\n\n</think>\n\n block to suppress thinking. The model, since it was trained to think by default, starts thinking on its own.

The qwen3.5 renderer/parser handles this correctly — it emits the empty think block as a prefill to tell the model to skip thinking.

Fix: The registry has already been updated with the correct config ("renderer": "qwen3.5", "parser": "qwen3.5"). Re-pulling the model picks up the fix:

ollama pull qwen3.5:35b-a3b

I have confirmed that this works on v0.17.4 after re-pull.

<!-- gh-comment-id:3980537841 --> @ddarmon commented on GitHub (Mar 1, 2026): I had the same issue 1 as you (i.e., `think=false` did not always work): The root cause appears to be that the `qwen3.5:35b-a3b` model in the registry as of last week had its config set to `RENDERER qwen3-vl-thinking` / `PARSER qwen3-vl-thinking` instead of `RENDERER qwen3.5` / `PARSER qwen3.5`. The `qwen3-vl-thinking` renderer does **not** have `emitEmptyThinkOnNoThink` enabled, so when `think=false` is set, the assistant prefill is just a bare `<|im_start|>assistant\n` with no empty `<think>\n\n</think>\n\n` block to suppress thinking. The model, since it was trained to think by default, starts thinking on its own. The `qwen3.5` renderer/parser handles this correctly — it emits the empty think block as a prefill to tell the model to skip thinking. **Fix:** The registry has already been updated with the correct config (`"renderer": "qwen3.5"`, `"parser": "qwen3.5"`). Re-pulling the model picks up the fix: ``` ollama pull qwen3.5:35b-a3b ``` I have confirmed that this works on v0.17.4 after re-pull.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55872