[GH-ISSUE #14662] why the ollama get response very slow? (QWEN3.5 35B A3B) #35254

Open
opened 2026-04-22 19:38:28 -05:00 by GiteaMirror · 18 comments
Owner

Originally created by @acer1204 on GitHub (Mar 6, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14662

What is the issue?

My hardware is an NVIDIA DGX Spark, running Ollama 0.17.7 and the latest version of OpenWebUI.

During testing, I noticed that the first prompt responds quickly. OpenWebUI immediately shows “think” and then generates the response.

However, when I ask a second question, it takes 30 to 60 seconds before the frontend webpage shows “think” and the response begins to appear.

This issue does not occur when I run GPT-OSS 120B.

Image

I test this problem in my RTX 3090 have the same problem.

i see the nvidia-smi, it use 100% cuda but not thing to show.

if restart ollama, the first qustion response quick.

Relevant log output


OS

Windows, Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.17.7

Originally created by @acer1204 on GitHub (Mar 6, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14662 ### What is the issue? My hardware is an NVIDIA DGX Spark, running Ollama 0.17.7 and the latest version of OpenWebUI. During testing, I noticed that the first prompt responds quickly. OpenWebUI immediately shows “think” and then generates the response. However, when I ask a second question, it takes 30 to 60 seconds before the frontend webpage shows “think” and the response begins to appear. This issue does not occur when I run GPT-OSS 120B. <img width="1062" height="112" alt="Image" src="https://github.com/user-attachments/assets/32e1f49c-98e7-437f-b9f8-f38d8a165e8c" /> I test this problem in my RTX 3090 have the same problem. i see the nvidia-smi, it use 100% cuda but not thing to show. if restart ollama, the first qustion response quick. ### Relevant log output ```shell ``` ### OS Windows, Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.17.7
GiteaMirror added the bug label 2026-04-22 19:38:28 -05:00
Author
Owner

@BeatWolf commented on GitHub (Mar 6, 2026):

I'm currently trying to debug the same thing with qwen3.5:2b. The time to first token is huge, but i don't see any model loading in the logs etc.
What is also strange, this is with using openwebui, even the second message in the chat is slow, not only the first. So something seems wrong

<!-- gh-comment-id:4010873027 --> @BeatWolf commented on GitHub (Mar 6, 2026): I'm currently trying to debug the same thing with qwen3.5:2b. The time to first token is huge, but i don't see any model loading in the logs etc. What is also strange, this is with using openwebui, even the second message in the chat is slow, not only the first. So something seems wrong
Author
Owner

@VistritPandey commented on GitHub (Mar 6, 2026):

+1 the same issue on M3 Ultra, it is taking twice the time to generate, when compared to qwen3-vl:8b and takes almost generates token at a speed of 40~50 tokens/sec (for 35b:a3b, 9b, and 4b), which ideally shouldn't be the case considering the size difference

<!-- gh-comment-id:4013124189 --> @VistritPandey commented on GitHub (Mar 6, 2026): +1 the same issue on `M3 Ultra`, it is taking twice the time to generate, when compared to `qwen3-vl:8b` and takes almost generates token at a speed of 40~50 tokens/sec (for 35b:a3b, 9b, and 4b), which ideally shouldn't be the case considering the size difference
Author
Owner

@SquareSphere commented on GitHub (Mar 7, 2026):

I have the same issue with the combination of Open WebUI and Ollama using Qwen 3.5 9b BUT if I try to run locally through the ollama UI it doesn't have the same issue

<!-- gh-comment-id:4017610929 --> @SquareSphere commented on GitHub (Mar 7, 2026): I have the same issue with the combination of Open WebUI and Ollama using Qwen 3.5 9b BUT if I try to run locally through the ollama UI it doesn't have the same issue
Author
Owner

@leegimblett commented on GitHub (Mar 9, 2026):

I have similar experience with qwen3.5:27b . Works great with the ollama cli with the gpu use falling to 0 soon after the output is finished. When using the api (Open WebUI) the first prompt is answered quickly but the second one hangs forever and the GPU flat lines at around 50% after the prompt is finished and stays there for a long time after.

<!-- gh-comment-id:4020684167 --> @leegimblett commented on GitHub (Mar 9, 2026): I have similar experience with qwen3.5:27b . Works great with the ollama cli with the gpu use falling to 0 soon after the output is finished. When using the api (Open WebUI) the first prompt is answered quickly but the second one hangs forever and the GPU flat lines at around 50% after the prompt is finished and stays there for a long time after.
Author
Owner

@leegimblett commented on GitHub (Mar 9, 2026):

I think my experience is Open WebUI related. I don't see it with out API clients + when I turn off thinking in WebUI the issue disappears. Web UI is sending requests in the background after the LLM completes its main response (title generation etc). The Qwen3.5 27B model does a lot of thinking so takes a long time. You can change the model used by the background processes (admin/interface)which can be a lot smaller than the main model. I used a Qwen3.5 0.8B and it still took a long time until I turned off that models thinking, then the problem went away.

<!-- gh-comment-id:4020836857 --> @leegimblett commented on GitHub (Mar 9, 2026): I think my experience is Open WebUI related. I don't see it with out API clients + when I turn off thinking in WebUI the issue disappears. Web UI is sending requests in the background after the LLM completes its main response (title generation etc). The Qwen3.5 27B model does a lot of thinking so takes a long time. You can change the model used by the background processes (admin/interface)which can be a lot smaller than the main model. I used a Qwen3.5 0.8B and it still took a long time until I turned off that models thinking, then the problem went away.
Author
Owner

@luke2023 commented on GitHub (Mar 9, 2026):

I am experiencing the same issue, on 5090 + windows 11 pro + open web ui

<!-- gh-comment-id:4023779920 --> @luke2023 commented on GitHub (Mar 9, 2026): I am experiencing the same issue, on 5090 + windows 11 pro + open web ui
Author
Owner

@rawflecat commented on GitHub (Mar 9, 2026):

I think my experience is Open WebUI related. I don't see it with out API clients + when I turn off thinking in WebUI the issue disappears. Web UI is sending requests in the background after the LLM completes its main response (title generation etc). The Qwen3.5 27B model does a lot of thinking so takes a long time. You can change the model used by the background processes (admin/interface)which can be a lot smaller than the main model. I used a Qwen3.5 0.8B and it still took a long time until I turned off that models thinking, then the problem went away.

Just went through this over the weekend with Open WebUI. Ended up spinning up a VM in order to run a small version of gemma to use for the backend processes. Helped tremendously, but the overhead of having to ship that to another model is a pain. Doing it on the same box is out of the questions. I'm not going to unload and reload a 40 GB+ model just to generate a chat title in under 60 seconds.

<!-- gh-comment-id:4024688551 --> @rawflecat commented on GitHub (Mar 9, 2026): > I think my experience is Open WebUI related. I don't see it with out API clients + when I turn off thinking in WebUI the issue disappears. Web UI is sending requests in the background after the LLM completes its main response (title generation etc). The Qwen3.5 27B model does a lot of thinking so takes a long time. You can change the model used by the background processes (admin/interface)which can be a lot smaller than the main model. I used a Qwen3.5 0.8B and it still took a long time until I turned off that models thinking, then the problem went away. Just went through this over the weekend with Open WebUI. Ended up spinning up a VM in order to run a small version of gemma to use for the backend processes. Helped tremendously, but the overhead of having to ship that to another model is a pain. Doing it on the same box is out of the questions. I'm not going to unload and reload a 40 GB+ model just to generate a chat title in under 60 seconds.
Author
Owner

@BeatWolf commented on GitHub (Mar 9, 2026):

i think its multiple issues. webui doing unnecessary thinking calls on large models but also there is likely something wrong with qwrn3.5 and or ollama that there are super high amount of thinking tokens for the simplest requests.

<!-- gh-comment-id:4024708323 --> @BeatWolf commented on GitHub (Mar 9, 2026): i think its multiple issues. webui doing unnecessary thinking calls on large models but also there is likely something wrong with qwrn3.5 and or ollama that there are super high amount of thinking tokens for the simplest requests.
Author
Owner

@luke2023 commented on GitHub (Mar 9, 2026):

I am experience an abnormally long time to first token on all Qwen 3.5 models (second and third request would be even worse), other models are working fine.

<!-- gh-comment-id:4024722463 --> @luke2023 commented on GitHub (Mar 9, 2026): I am experience an abnormally long time to first token on all Qwen 3.5 models (second and third request would be even worse), other models are working fine.
Author
Owner

@luke2023 commented on GitHub (Mar 9, 2026):

Can confirm only happens on Qwen3.5 series. Qwen3 and GPT-OSS works fine

<!-- gh-comment-id:4024744020 --> @luke2023 commented on GitHub (Mar 9, 2026): Can confirm only happens on Qwen3.5 series. Qwen3 and GPT-OSS works fine
Author
Owner

@Megara18 commented on GitHub (Apr 3, 2026):

Ollama + OpenWeb UI, Can confirm same problem with the new Gemma4. Slow responses and long thinking times for even say "Hi"

<!-- gh-comment-id:4182979111 --> @Megara18 commented on GitHub (Apr 3, 2026): Ollama + OpenWeb UI, Can confirm same problem with the new Gemma4. Slow responses and long thinking times for even say "Hi"
Author
Owner

@beykansen commented on GitHub (Apr 5, 2026):

same issue with my asus gx10. qwen3.5 and gemma4 do that.

<!-- gh-comment-id:4188711887 --> @beykansen commented on GitHub (Apr 5, 2026): same issue with my asus gx10. qwen3.5 and gemma4 do that.
Author
Owner

@Joni1717 commented on GitHub (Apr 6, 2026):

Same issue with qwen3.5 35b and 27b on 2xRTX4090

<!-- gh-comment-id:4192635059 --> @Joni1717 commented on GitHub (Apr 6, 2026): Same issue with qwen3.5 35b and 27b on 2xRTX4090
Author
Owner

@Megara18 commented on GitHub (Apr 16, 2026):

Image

It happens too with Qwen3.6... 40 Seconds of THINKING just for say "Hola" (Hello) is pretty insane... it prevents me to try frontier models with Ollama... near all new models haves this problem.

And is not about hardware, my server can move up +200B models at high speed.

Using Ollama lastest on Docker + Vulkan.

<!-- gh-comment-id:4262416215 --> @Megara18 commented on GitHub (Apr 16, 2026): <img width="538" height="133" alt="Image" src="https://github.com/user-attachments/assets/d08967a7-4d69-477b-a937-a05707463302" /> It happens too with Qwen3.6... 40 Seconds of THINKING just for say "Hola" (Hello) is pretty insane... it prevents me to try frontier models with Ollama... near all new models haves this problem. And is not about hardware, my server can move up +200B models at high speed. Using Ollama lastest on Docker + Vulkan.
Author
Owner

@Lemondsky commented on GitHub (Apr 21, 2026):

Same issue, ollama + openwebui + qwen (no matter the model) takes like 8 min before ecen start thinking. Prompt processing and tokengeneration fast as hell, but every message takes 8-9 minutes before it ecen get visible in ollama logs and starts. Without openwebui, direktly calling ollama api, response is fast af and nearly no processing time. Something seems extremely bugged in combination with qwen, makes openwebui useless for me at the time.

<!-- gh-comment-id:4287815465 --> @Lemondsky commented on GitHub (Apr 21, 2026): Same issue, ollama + openwebui + qwen (no matter the model) takes like 8 min before ecen start thinking. Prompt processing and tokengeneration fast as hell, but every message takes 8-9 minutes before it ecen get visible in ollama logs and starts. Without openwebui, direktly calling ollama api, response is fast af and nearly no processing time. Something seems extremely bugged in combination with qwen, makes openwebui useless for me at the time.
Author
Owner

@Joni1717 commented on GitHub (Apr 22, 2026):

For me, the issue wasn’t related to Ollama. It’s caused by qwen3.5/3.6 sometimes taking a very long time to “think.” Open WebUI generates tags, titles, and follow-up questions after the initial response, and each of these is a separate request using the same model. Since the model spends a lot of time on each request, the delays quickly add up (sometimes around a minute per request).

Here’s how I fixed it:
Go to Admin Settings → Models and select qwen3.5/3.6. Create a clone of the model. In the clone’s settings, disable “Thinking” (Ollama Thinking).
Then go to Admin Settings → Interface and select the cloned qwen3.6 model (with Thinking disabled).

Result:
Your main responses are still generated with the original qwen3.6 model, while tags, titles, and follow-up questions are handled by the faster version without Thinking—making the whole process much quicker.

<!-- gh-comment-id:4298557985 --> @Joni1717 commented on GitHub (Apr 22, 2026): For me, the issue wasn’t related to Ollama. It’s caused by qwen3.5/3.6 sometimes taking a very long time to “think.” Open WebUI generates tags, titles, and follow-up questions after the initial response, and each of these is a separate request using the same model. Since the model spends a lot of time on each request, the delays quickly add up (sometimes around a minute per request). Here’s how I fixed it: Go to Admin Settings → Models and select qwen3.5/3.6. Create a clone of the model. In the clone’s settings, disable “Thinking” (Ollama Thinking). Then go to Admin Settings → Interface and select the cloned qwen3.6 model (with Thinking disabled). Result: Your main responses are still generated with the original qwen3.6 model, while tags, titles, and follow-up questions are handled by the faster version without Thinking—making the whole process much quicker.
Author
Owner

@Megara18 commented on GitHub (Apr 22, 2026):

For me, the issue wasn’t related to Ollama. It’s caused by qwen3.5/3.6 sometimes taking a very long time to “think.” Open WebUI generates tags, titles, and follow-up questions after the initial response, and each of these is a separate request using the same model. Since the model spends a lot of time on each request, the delays quickly add up (sometimes around a minute per request).

Here’s how I fixed it: Go to Admin Settings → Models and select qwen3.5/3.6. Create a clone of the model. In the clone’s settings, disable “Thinking” (Ollama Thinking). Then go to Admin Settings → Interface and select the cloned qwen3.6 model (with Thinking disabled).

Result: Your main responses are still generated with the original qwen3.6 model, while tags, titles, and follow-up questions are handled by the faster version without Thinking—making the whole process much quicker.

No. Its not that. I already have specific models for the minor task already setup. And this is happening specific with frontier models like Gemma and Qwen.

Other tested 120B and 220B models do not have this problem.

<!-- gh-comment-id:4298858951 --> @Megara18 commented on GitHub (Apr 22, 2026): > For me, the issue wasn’t related to Ollama. It’s caused by qwen3.5/3.6 sometimes taking a very long time to “think.” Open WebUI generates tags, titles, and follow-up questions after the initial response, and each of these is a separate request using the same model. Since the model spends a lot of time on each request, the delays quickly add up (sometimes around a minute per request). > > Here’s how I fixed it: Go to Admin Settings → Models and select qwen3.5/3.6. Create a clone of the model. In the clone’s settings, disable “Thinking” (Ollama Thinking). Then go to Admin Settings → Interface and select the cloned qwen3.6 model (with Thinking disabled). > > Result: Your main responses are still generated with the original qwen3.6 model, while tags, titles, and follow-up questions are handled by the faster version without Thinking—making the whole process much quicker. No. Its not that. I already have specific models for the minor task already setup. And this is happening specific with frontier models like Gemma and Qwen. Other tested 120B and 220B models do not have this problem.
Author
Owner

@Lemondsky commented on GitHub (Apr 22, 2026):

For me, the issue wasn’t related to Ollama. It’s caused by qwen3.5/3.6 sometimes taking a very long time to “think.” Open WebUI generates tags, titles, and follow-up questions after the initial response, and each of these is a separate request using the same model. Since the model spends a lot of time on each request, the delays quickly add up (sometimes around a minute per request).

Here’s how I fixed it:
Go to Admin Settings → Models and select qwen3.5/3.6. Create a clone of the model. In the clone’s settings, disable “Thinking” (Ollama Thinking).
Then go to Admin Settings → Interface and select the cloned qwen3.6 model (with Thinking disabled).

Result:
Your main responses are still generated with the original qwen3.6 model, while tags, titles, and follow-up questions are handled by the faster version without Thinking—making the whole process much quicker.

To me its definetly related to openwebui. Tried some other tools, none has this issue, it happens only if i use openwebui with qwen and ollama.
There ist no token generation or anything happen for the ~8 minutes, cant see any kind of communication between qwen/ollama and openwebui in the logs, also the graphiccards dont do anything for the time, hanging arround 28 watts each. After the ~8 minutes they all go up to ~160 watts and start prompt processing and generating / thinking ..
Gemma and everything works fine.
I just use another interface for qwen now until its fixed.
Otherwise, my hardware is pretty exotic, maybe theres something colliding or bugging at my system with this specific combination, cant tell for sure

<!-- gh-comment-id:4299590685 --> @Lemondsky commented on GitHub (Apr 22, 2026): > For me, the issue wasn’t related to Ollama. It’s caused by qwen3.5/3.6 sometimes taking a very long time to “think.” Open WebUI generates tags, titles, and follow-up questions after the initial response, and each of these is a separate request using the same model. Since the model spends a lot of time on each request, the delays quickly add up (sometimes around a minute per request). > > Here’s how I fixed it: > Go to Admin Settings → Models and select qwen3.5/3.6. Create a clone of the model. In the clone’s settings, disable “Thinking” (Ollama Thinking). > Then go to Admin Settings → Interface and select the cloned qwen3.6 model (with Thinking disabled). > > Result: > Your main responses are still generated with the original qwen3.6 model, while tags, titles, and follow-up questions are handled by the faster version without Thinking—making the whole process much quicker. To me its definetly related to openwebui. Tried some other tools, none has this issue, it happens only if i use openwebui with qwen and ollama. There ist no token generation or anything happen for the ~8 minutes, cant see any kind of communication between qwen/ollama and openwebui in the logs, also the graphiccards dont do anything for the time, hanging arround 28 watts each. After the ~8 minutes they all go up to ~160 watts and start prompt processing and generating / thinking .. Gemma and everything works fine. I just use another interface for qwen now until its fixed. Otherwise, my hardware is pretty exotic, maybe theres something colliding or bugging at my system with this specific combination, cant tell for sure
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#35254