[GH-ISSUE #11320] issue: Text streaming will stop if thinking takes longer than 5 minutes #16183

Closed
opened 2026-04-19 22:11:41 -05:00 by GiteaMirror · 24 comments
Owner

Originally created by @knguyen298 on GitHub (Mar 6, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/11320

Check Existing Issues

  • I have searched the existing issues and discussions.

Installation Method

Docker

Open WebUI Version

v0.5.20

Ollama Version (if applicable)

No response

Operating System

Ubuntu 22.04

Browser (if applicable)

Firefox 135.0.1

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have checked the browser console logs.
  • I have checked the Docker container logs.
  • I have listed steps to reproduce the bug in detail.

Expected Behavior

Thinking should still continue, and then proceed to actual generation.

Actual Behavior

Thinking will stop streaming after 5 minutes. GPU utilization indicates that generation is still occurring.

Steps to Reproduce

  1. Use a reasoning model (I used QwQ 32B, q6_k_l).
  2. Input in a prompt which will require the model to think for an extended period of time. I used

Create a Flappy Bird game in Python. You must include these things:
You must use pygame.
The background color should be randomly chosen and is a light shade. Start with a light blue color.
Pressing SPACE multiple times will accelerate the bird.
The bird’s shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
Place on the bottom some land colored as dark brown or yellow chosen randomly.
Make a score shown on the top right side. Increment if you pass pipes and don’t hit them.
Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.

  1. Wait 5 minutes. Thinking will stop after 5 minutes.

Logs & Screenshots

Image

Screenshot from shortly after generation stopped. Generation started at 12:53 PM, and stopped at 12:58 PM. GPU usage was still 70%+, indicating the LLM was still generating data.

No messages showed up in the logs for Open-WebUI.

Additional Information

  • I am using llama.cpp through llama-swap. This is added into Open-WebUI as an OpenAI compatible endpoint.
  • I verified that generation is still working by running the same prompt into llama.cpp's built in web GUI. The model completed thinking and generated code, and did not stop at the 5 minute mark.
  • Thinking will work as intended if thinking does not take 5 minutes. For a less complex prompt, thinking will complete and generation will start and complete as expected.
  • AIOHTTP_CLIENT_TIMEOUT is configured in my the Docker Compose environment for Open-WebUI. I initially had it set to '', but I tested it with ' '. I also confirmed Keep Alive in the GUI was set to -1 - I have also tested with Keep Alive set to 1h with the same result. Interestingly, I don't see AIOHTTP_CLIENT_TIMEOUT being set in the logs during startup.
  • I tested with ENV set to both dev and prod.
  • Context is set to 32768, and num_predict is set to -1, so it does not seem to be an issue with the model stopping generation. I tested over half a dozen times, and they all stopped generation at 5 minutes.
  • As indicated by the "Stop" icon, Open-WebUI knows generation was not completed.
  • The 5 minute timer starts from the moment I hit the Send button. If the model needs time to load, that time will be included in the 5 minutes.
Originally created by @knguyen298 on GitHub (Mar 6, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/11320 ### Check Existing Issues - [x] I have searched the existing issues and discussions. ### Installation Method Docker ### Open WebUI Version v0.5.20 ### Ollama Version (if applicable) _No response_ ### Operating System Ubuntu 22.04 ### Browser (if applicable) Firefox 135.0.1 ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have checked the browser console logs. - [x] I have checked the Docker container logs. - [x] I have listed steps to reproduce the bug in detail. ### Expected Behavior Thinking should still continue, and then proceed to actual generation. ### Actual Behavior Thinking will stop streaming after 5 minutes. GPU utilization indicates that generation is still occurring. ### Steps to Reproduce 1. Use a reasoning model (I used QwQ 32B, q6_k_l). 2. Input in a prompt which will require the model to think for an extended period of time. I used > Create a Flappy Bird game in Python. You must include these things: You must use pygame. The background color should be randomly chosen and is a light shade. Start with a light blue color. Pressing SPACE multiple times will accelerate the bird. The bird’s shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color. Place on the bottom some land colored as dark brown or yellow chosen randomly. Make a score shown on the top right side. Increment if you pass pipes and don’t hit them. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again. 3. Wait 5 minutes. Thinking will stop after 5 minutes. ### Logs & Screenshots <img width="1512" alt="Image" src="https://github.com/user-attachments/assets/5f1de96c-7d8a-41f4-85bf-963e34b95e9d" /> Screenshot from shortly after generation stopped. Generation started at 12:53 PM, and stopped at 12:58 PM. GPU usage was still 70%+, indicating the LLM was still generating data. No messages showed up in the logs for Open-WebUI. ### Additional Information - I am using llama.cpp through llama-swap. This is added into Open-WebUI as an OpenAI compatible endpoint. - I verified that generation is still working by running the same prompt into llama.cpp's built in web GUI. The model completed thinking and generated code, and did not stop at the 5 minute mark. - Thinking will work as intended if thinking does not take 5 minutes. For a less complex prompt, thinking will complete and generation will start and complete as expected. - `AIOHTTP_CLIENT_TIMEOUT` is configured in my the Docker Compose environment for Open-WebUI. I initially had it set to `''`, but I tested it with `' '`. I also confirmed `Keep Alive` in the GUI was set to -1 - I have also tested with `Keep Alive` set to `1h` with the same result. Interestingly, I don't see `AIOHTTP_CLIENT_TIMEOUT` being set in the logs during startup. - I tested with `ENV` set to both `dev` and `prod`. - Context is set to `32768`, and `num_predict` is set to `-1`, so it does not seem to be an issue with the model stopping generation. I tested over half a dozen times, and they all stopped generation at 5 minutes. - As indicated by the "Stop" icon, Open-WebUI knows generation was not completed. - The 5 minute timer starts from the moment I hit the `Send` button. If the model needs time to load, that time will be included in the 5 minutes.
GiteaMirror added the bug label 2026-04-19 22:11:41 -05:00
Author
Owner

@JulianSchwabCommits commented on GitHub (Mar 6, 2025):

I have the same issue

<!-- gh-comment-id:2704777405 --> @JulianSchwabCommits commented on GitHub (Mar 6, 2025): I have the same issue
Author
Owner

@rgaricano commented on GitHub (Mar 6, 2025):

it seem that it's a timeout of llama-swap, maybe you can try setting llama-swap with a bigger ttl (600 or more): link to llama-swap readme & configuration: 62275e078d/README.md (L89)

<!-- gh-comment-id:2704815295 --> @rgaricano commented on GitHub (Mar 6, 2025): it seem that it's a timeout of llama-swap, maybe you can try setting llama-swap with a bigger ttl (600 or more): link to llama-swap readme & configuration: https://github.com/mostlygeek/llama-swap/blob/62275e078dfce6c7ad7322dcd8b14d1c343f28d6/README.md?plain=1#L89
Author
Owner

@knguyen298 commented on GitHub (Mar 6, 2025):

it seem that it's a timeout of llama-swap, maybe you can try setting llama-swap with a bigger ttl (600 or more): link to llama-swap readme & configuration: 62275e078d/README.md (L89)

ttl is not configured, meaning it goes to the default value of 0 (never unload). It also doesn't have this issue when using the built in llama.cpp GUI, even when loaded through llama-swap.

<!-- gh-comment-id:2704828216 --> @knguyen298 commented on GitHub (Mar 6, 2025): > it seem that it's a timeout of llama-swap, maybe you can try setting llama-swap with a bigger ttl (600 or more): link to llama-swap readme & configuration: https://github.com/mostlygeek/llama-swap/blob/62275e078dfce6c7ad7322dcd8b14d1c343f28d6/README.md?plain=1#L89 `ttl` is not configured, meaning it goes to the default value of `0` (never unload). It also doesn't have this issue when using the built in llama.cpp GUI, even when loaded through llama-swap.
Author
Owner

@rgaricano commented on GitHub (Mar 6, 2025):

I only see a timeout of 300s on 3b70cd64d7/backend/open_webui/env.py (L398)
It's set to 300 by default if env is <>"" & not number, I don't think that is the case, but you can try to set it bigger, e.g.600, to delimit and safely rule out that this is the problem

another timeout env is AIOHTTP_CLIENT_TIMEOUT_OPENAI_MODEL_LIST, for requests to openai & ollama, is set to none (or 5 if error) but... aiohttp.client by default seem to be 300! ( https://docs.aiohttp.org/en/stable/client_reference.html )

I would try configuring those env variables, to see how it reacts... and if you manage to solve it, please let us know.

(sorry if i can't help more, i'm not know enougth the code and i can't reproduce your problem)

<!-- gh-comment-id:2704935464 --> @rgaricano commented on GitHub (Mar 6, 2025): I only see a timeout of 300s on https://github.com/open-webui/open-webui/blob/3b70cd64d7fa6902e8c79cf8dcbf3c7e84cf704b/backend/open_webui/env.py#L398 It's set to 300 by default if env is <>"" & not number, I don't think that is the case, but you can try to set it bigger, e.g.600, to delimit and safely rule out that this is the problem another timeout env is AIOHTTP_CLIENT_TIMEOUT_OPENAI_MODEL_LIST, for requests to openai & ollama, is set to none (or 5 if error) but... aiohttp.client by default seem to be 300! ( https://docs.aiohttp.org/en/stable/client_reference.html ) I would try configuring those env variables, to see how it reacts... and if you manage to solve it, please let us know. (sorry if i can't help more, i'm not know enougth the code and i can't reproduce your problem)
Author
Owner

@knguyen298 commented on GitHub (Mar 6, 2025):

So I did some further testing:

  • Setting AIOHTTP_CLIENT_TIMEOUT to 1200 seems to fix the issue. Streaming continues past the 5 minute mark, but I stopped it before it finished.
  • Setting it to "" causes it to stop after 5 minutes again. I confirmed via Portainer that the value is set to "" in the container.

Seems to me that the blank value is not being interpreted correctly.

<!-- gh-comment-id:2705011532 --> @knguyen298 commented on GitHub (Mar 6, 2025): So I did some further testing: - Setting `AIOHTTP_CLIENT_TIMEOUT` to `1200` seems to fix the issue. Streaming continues past the 5 minute mark, but I stopped it before it finished. - Setting it to `""` causes it to stop after 5 minutes again. I confirmed via Portainer that the value is set to `""` in the container. Seems to me that the blank value is not being interpreted correctly.
Author
Owner

@knguyen298 commented on GitHub (Mar 6, 2025):

I look a closer look at the environment variable Python code, and saw that ‎AIOHTTP_CLIENT_TIMEOUT‎ is set to "" if the environment variable wasn't defined in the OS. So I removed it from the environment section in my Docker Compose file.

Opening a shell into the container, and opening up a Python terminal, I confirmed that AIOHTTP_CLIENT_TIMEOUT == "" now evaluates to true. The connection now no longer closes after 5 minutes.

Either the documentation needs to be updated, stating that "Only define the value if you want a timeout", and/or the code needs to be updated to work properly with a defined empty string.

<!-- gh-comment-id:2705077467 --> @knguyen298 commented on GitHub (Mar 6, 2025): I look a closer look at the environment variable Python code, and saw that `‎AIOHTTP_CLIENT_TIMEOUT‎` is set to `""` if the environment variable wasn't defined in the OS. So I removed it from the environment section in my Docker Compose file. Opening a shell into the container, and opening up a Python terminal, I confirmed that `AIOHTTP_CLIENT_TIMEOUT == ""` now evaluates to true. The connection now no longer closes after 5 minutes. Either the documentation needs to be updated, stating that "Only define the value if you want a timeout", and/or the code needs to be updated to work properly with a defined empty string.
Author
Owner

@rgaricano commented on GitHub (Mar 6, 2025):

yes, and with an entry in Troubleshooting about this such of things.

For reference:
https://docs.aiohttp.org/en/stable/client_quickstart.html#aiohttp-client-timeouts

"...
aiohttp.client Timeouts
Timeout settings are stored in ClientTimeout data structure.

By default aiohttp uses a total 300 seconds (5min) timeout, it means that the whole operation should finish in 5 minutes. In order to allow time for DNS fallback, the default sock_connect timeout is 30 seconds.
..."

<!-- gh-comment-id:2705084144 --> @rgaricano commented on GitHub (Mar 6, 2025): yes, and with an entry in Troubleshooting about this such of things. For reference: https://docs.aiohttp.org/en/stable/client_quickstart.html#aiohttp-client-timeouts "... **aiohttp.client Timeouts** Timeout settings are stored in [ClientTimeout](https://docs.aiohttp.org/en/stable/client_reference.html#aiohttp.ClientTimeout) data structure. By default aiohttp uses a total 300 seconds (5min) timeout, it means that the whole operation should finish in 5 minutes. In order to allow time for DNS fallback, the default sock_connect timeout is 30 seconds. ..."
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 6, 2025):

But does it ever stop thinking? Mine after a while has 0% GPU usage, but openwebui reports "thinking" undefinitely (without ever getting out of the "thinking" section)...

<!-- gh-comment-id:2705095323 --> @AlbertoSinigaglia commented on GitHub (Mar 6, 2025): But does it ever stop thinking? Mine after a while has 0% GPU usage, but openwebui reports "thinking" undefinitely (without ever getting out of the "thinking" section)...
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 6, 2025):

Maybe has something to do with QwQ-32B and how it handles thinking? I see a while special section on huggingface on how to run it propertly https://huggingface.co/Qwen/QwQ-32B#usage-guidelines

<!-- gh-comment-id:2705099765 --> @AlbertoSinigaglia commented on GitHub (Mar 6, 2025): Maybe has something to do with QwQ-32B and how it handles thinking? I see a while special section on huggingface on how to run it propertly https://huggingface.co/Qwen/QwQ-32B#usage-guidelines
Author
Owner

@knguyen298 commented on GitHub (Mar 6, 2025):

@rgaricano I don't think the issue is with how the variable is being passed to aiohttp: this seems fine to me.
3b70cd64d7/backend/open_webui/routers/openai.py (L678)

I think the issue is how the Python variable is being set by the OS environment variable when the OS environment variable is set to single quotes or double quotes to indicate an empty string. After some further testing:

  • AIOHTTP_CLIENT_TIMEOUT='' returns AIOHTTP_CLIENT_TIMEOUT="''" in Python.
  • AIOHTTP_CLIENT_TIMEOUT="" returns AIOHTTP_CLIENT_TIMEOUT='""' in Python.
    Only by not defining AIOHTTP_CLIENT_TIMEOUT will it be set to "" and then correctly set to None.
<!-- gh-comment-id:2705101382 --> @knguyen298 commented on GitHub (Mar 6, 2025): @rgaricano I don't think the issue is with how the variable is being passed to aiohttp: this seems fine to me. https://github.com/open-webui/open-webui/blob/3b70cd64d7fa6902e8c79cf8dcbf3c7e84cf704b/backend/open_webui/routers/openai.py#L678 I think the issue is how the Python variable is being set by the OS environment variable when the OS environment variable is set to single quotes or double quotes to indicate an empty string. After some further testing: - `AIOHTTP_CLIENT_TIMEOUT=''` returns `AIOHTTP_CLIENT_TIMEOUT="''"` in Python. - `AIOHTTP_CLIENT_TIMEOUT=""` returns `AIOHTTP_CLIENT_TIMEOUT='""'` in Python. Only by not defining `AIOHTTP_CLIENT_TIMEOUT` will it be set to `""` and then correctly set to `None`.
Author
Owner

@knguyen298 commented on GitHub (Mar 6, 2025):

@AlbertoSinigaglia

But does it ever stop thinking? Mine after a while has 0% GPU usage, but openwebui reports "thinking" undefinitely (without ever getting out of the "thinking" section)...
Maybe has something to do with QwQ-32B and how it handles thinking? I see a while special section on huggingface on how to run it propertly https://huggingface.co/Qwen/QwQ-32B#usage-guidelines

Mine runs fine, using the Q6_K_L quant from bartowski via llama.cpp. Check your sampling parameters?

<!-- gh-comment-id:2705103740 --> @knguyen298 commented on GitHub (Mar 6, 2025): @AlbertoSinigaglia > But does it ever stop thinking? Mine after a while has 0% GPU usage, but openwebui reports "thinking" undefinitely (without ever getting out of the "thinking" section)... > Maybe has something to do with QwQ-32B and how it handles thinking? I see a while special section on huggingface on how to run it propertly https://huggingface.co/Qwen/QwQ-32B#usage-guidelines Mine runs fine, using the Q6_K_L quant from bartowski via llama.cpp. Check your sampling parameters?
Author
Owner

@rgaricano commented on GitHub (Mar 6, 2025):

Alberto, I don't think that your problem is for the <think> question, this problem give you a incomplete response, but you have a response,
it you have not response is other thing, a closed connection not notified, proxy timeout,... do you have some log of the error? what is your system config?

<!-- gh-comment-id:2705106813 --> @rgaricano commented on GitHub (Mar 6, 2025): Alberto, I don't think that your problem is for the \<think\> question, this problem give you a incomplete response, but you have a response, it you have not response is other thing, a closed connection not notified, proxy timeout,... do you have some log of the error? what is your system config?
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 6, 2025):

@rgaricano no error nor nothing displayed... If I read the "thinking" it clearly gets to a final point of the reasoning chain, but never spits out a response

<!-- gh-comment-id:2705110701 --> @AlbertoSinigaglia commented on GitHub (Mar 6, 2025): @rgaricano no error nor nothing displayed... If I read the "thinking" it clearly gets to a final point of the reasoning chain, but never spits out a response
Author
Owner

@knguyen298 commented on GitHub (Mar 6, 2025):

@AlbertoSinigaglia this looks to be an issue with the model published by the Ollama library, not related to this issue and is not an issue with Open-WebUI.
https://github.com/ollama/ollama/issues/9523#issuecomment-2703880818

Use a different GGUF, you can download and import HuggingFace GGUFs that are not in the Ollama library.

<!-- gh-comment-id:2705119779 --> @knguyen298 commented on GitHub (Mar 6, 2025): @AlbertoSinigaglia this looks to be an issue with the model published by the Ollama library, not related to this issue and is not an issue with Open-WebUI. https://github.com/ollama/ollama/issues/9523#issuecomment-2703880818 Use a different GGUF, you can download and import HuggingFace GGUFs that are not in the Ollama library.
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 6, 2025):

@knguyen298

@AlbertoSinigaglia this looks to be an issue with the model published by the Ollama library, not related to this issue and is not an issue with Open-WebUI. ollama/ollama#9523 (comment)

Use a different GGUF, you can download and import HuggingFace GGUFs that are not in the Ollama library.

uhhhh that's a nice catch, thanks

if you don't mind, can you give me a pointer to that "download and import HuggingFace GGUFs that are not in the Ollama library"? never done it (EDIT: check new comment down below)

<!-- gh-comment-id:2705129268 --> @AlbertoSinigaglia commented on GitHub (Mar 6, 2025): @knguyen298 > [@AlbertoSinigaglia](https://github.com/AlbertoSinigaglia) this looks to be an issue with the model published by the Ollama library, not related to this issue and is not an issue with Open-WebUI. [ollama/ollama#9523 (comment)](https://github.com/ollama/ollama/issues/9523#issuecomment-2703880818) > > Use a different GGUF, you can download and import HuggingFace GGUFs that are not in the Ollama library. uhhhh that's a nice catch, thanks if you don't mind, can you give me a pointer to that "download and import HuggingFace GGUFs that are not in the Ollama library"? never done it (EDIT: check new comment down below)
Author
Owner

@rgaricano commented on GitHub (Mar 6, 2025):

ok, yes, i was reading about before, there was some other issues reported https://github.com/open-webui/open-webui/issues/11259 but there is not for the ollama model, is model provider itself,
and the solution is on HF https://huggingface.co/Qwen/QwQ-32B/discussions/4

By the way, check Max Tokens (num_predict) param of the model, in case you have a small one and it is cutting off your response

<!-- gh-comment-id:2705132741 --> @rgaricano commented on GitHub (Mar 6, 2025): ok, yes, i was reading about before, there was some other issues reported https://github.com/open-webui/open-webui/issues/11259 but there is not for the ollama model, is model provider itself, and the solution is on HF https://huggingface.co/Qwen/QwQ-32B/discussions/4 By the way, check _Max Tokens (num_predict)_ param of the model, in case you have a small one and it is cutting off your response
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 6, 2025):

ok nvm, seems like ollama pull hf.co/bartowski/QwQ-32B-Preview-GGUF:Q8_0 did the trick... any suggestion on the quantization? I have a A6000, so I have 48Gb of VRAM, but I'm not sure that using a Q_8 that uses 32Gb is worth over the Q_4 that uses half of it

EDIT: this model instead doesn't think at all lol, just straight out answers

<!-- gh-comment-id:2705137351 --> @AlbertoSinigaglia commented on GitHub (Mar 6, 2025): ok nvm, seems like `ollama pull hf.co/bartowski/QwQ-32B-Preview-GGUF:Q8_0` did the trick... any suggestion on the quantization? I have a A6000, so I have 48Gb of VRAM, but I'm not sure that using a Q_8 that uses 32Gb is worth over the Q_4 that uses half of it EDIT: this model instead doesn't think at all lol, just straight out answers
Author
Owner

@knguyen298 commented on GitHub (Mar 6, 2025):

@AlbertoSinigaglia Did you set the sampling parameters? As for quant, I used Q6_K_L from bartowski.

<!-- gh-comment-id:2705143416 --> @knguyen298 commented on GitHub (Mar 6, 2025): @AlbertoSinigaglia Did you set the sampling parameters? As for quant, I used Q6_K_L from bartowski.
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 6, 2025):

@knguyen298 yup but i feel I messed up something, cause the Q_8 still doesn't want to "think"... I'm downloading the Q6_K_L version, to see if anything changes, but I'm pretty sure I'm the one that has messed up some sampling parameter

Image
<!-- gh-comment-id:2705154707 --> @AlbertoSinigaglia commented on GitHub (Mar 6, 2025): @knguyen298 yup but i feel I messed up something, cause the Q_8 still doesn't want to "think"... I'm downloading the Q6_K_L version, to see if anything changes, but I'm pretty sure I'm the one that has messed up some sampling parameter <img width="890" alt="Image" src="https://github.com/user-attachments/assets/4db3e6d4-2440-4282-a2bb-5d68b41057d6" />
Author
Owner

@knguyen298 commented on GitHub (Mar 6, 2025):

@AlbertoSinigaglia try Context Length = 32768 and num_predict = -1 (you'll have to drag the slider to the left to get to -2, and then change the 2 to a 1).

<!-- gh-comment-id:2705158020 --> @knguyen298 commented on GitHub (Mar 6, 2025): @AlbertoSinigaglia try `Context Length = 32768` and `num_predict = -1` (you'll have to drag the slider to the left to get to `-2`, and then change the 2 to a 1).
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 6, 2025):

@knguyen298 this made me laugh

Image

...using these settings:

Image

So now, I have the original QwQ-32B that only reasons, and the quantized versions that do not want to reason at all lol (not even with your prompt...)

<!-- gh-comment-id:2705167018 --> @AlbertoSinigaglia commented on GitHub (Mar 6, 2025): @knguyen298 this made me laugh <img width="1125" alt="Image" src="https://github.com/user-attachments/assets/120e9ebc-e38e-4794-916b-fe3bbe3b66c1" /> ...using these settings: <img width="907" alt="Image" src="https://github.com/user-attachments/assets/64aff0fb-8730-4fd5-a194-f43b47675d67" /> So now, I have the original QwQ-32B that only reasons, and the quantized versions that do not want to reason at all lol (not even with your prompt...)
Author
Owner

@rgaricano commented on GitHub (Mar 6, 2025):

@AlbertoSinigaglia, that model have 64 layers & i think that your gpu can fit all of them, you can set num_gpu (Ollama) to 64

<!-- gh-comment-id:2705192110 --> @rgaricano commented on GitHub (Mar 6, 2025): @AlbertoSinigaglia, [that model](https://huggingface.co/unsloth/QwQ-32B-Preview-GGUF) have 64 layers & i think that your gpu can fit all of them, you can set _num_gpu (Ollama)_ to 64
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 7, 2025):

@rgaricano

@AlbertoSinigaglia, that model have 64 layers & i think that your gpu can fit all of them, you can set num_gpu (Ollama) to 64

Thanks, done it, though it seems to be ignored (I guess) https://www.reddit.com/r/ollama/comments/1d29wdx/what_happen_with_parameter_num_gpu/

<!-- gh-comment-id:2705203061 --> @AlbertoSinigaglia commented on GitHub (Mar 7, 2025): @rgaricano > [@AlbertoSinigaglia](https://github.com/AlbertoSinigaglia), [that model](https://huggingface.co/unsloth/QwQ-32B-Preview-GGUF) have 64 layers & i think that your gpu can fit all of them, you can set _num_gpu (Ollama)_ to 64 Thanks, done it, though it seems to be ignored (I guess) https://www.reddit.com/r/ollama/comments/1d29wdx/what_happen_with_parameter_num_gpu/
Author
Owner

@rgaricano commented on GitHub (Mar 7, 2025):

ok, yes, i see that now are dynamically managed, vars are there gpu_num & NumGPU but assigned as -1 on runner and estimated here.

<!-- gh-comment-id:2705252037 --> @rgaricano commented on GitHub (Mar 7, 2025): ok, yes, i see that now are dynamically managed, vars are there gpu_num & NumGPU but assigned as -1 [on runner](https://github.com/ollama/ollama/blob/e2252d0fc6ea5c410b1ac4fa0a722beda78b3431/api/types.go#L616) and estimated [here](https://github.com/ollama/ollama/blob/e2252d0fc6ea5c410b1ac4fa0a722beda78b3431/llm/memory.go#L23).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#16183