mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-08 04:16:03 -05:00
[GH-ISSUE #11320] issue: Text streaming will stop if thinking takes longer than 5 minutes #31712
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @knguyen298 on GitHub (Mar 6, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/11320
Check Existing Issues
Installation Method
Docker
Open WebUI Version
v0.5.20
Ollama Version (if applicable)
No response
Operating System
Ubuntu 22.04
Browser (if applicable)
Firefox 135.0.1
Confirmation
README.md.Expected Behavior
Thinking should still continue, and then proceed to actual generation.
Actual Behavior
Thinking will stop streaming after 5 minutes. GPU utilization indicates that generation is still occurring.
Steps to Reproduce
Logs & Screenshots
Screenshot from shortly after generation stopped. Generation started at 12:53 PM, and stopped at 12:58 PM. GPU usage was still 70%+, indicating the LLM was still generating data.
No messages showed up in the logs for Open-WebUI.
Additional Information
AIOHTTP_CLIENT_TIMEOUTis configured in my the Docker Compose environment for Open-WebUI. I initially had it set to'', but I tested it with' '. I also confirmedKeep Alivein the GUI was set to -1 - I have also tested withKeep Aliveset to1hwith the same result. Interestingly, I don't seeAIOHTTP_CLIENT_TIMEOUTbeing set in the logs during startup.ENVset to bothdevandprod.32768, andnum_predictis set to-1, so it does not seem to be an issue with the model stopping generation. I tested over half a dozen times, and they all stopped generation at 5 minutes.Sendbutton. If the model needs time to load, that time will be included in the 5 minutes.@JulianSchwabCommits commented on GitHub (Mar 6, 2025):
I have the same issue
@rgaricano commented on GitHub (Mar 6, 2025):
it seem that it's a timeout of llama-swap, maybe you can try setting llama-swap with a bigger ttl (600 or more): link to llama-swap readme & configuration:
62275e078d/README.md (L89)@knguyen298 commented on GitHub (Mar 6, 2025):
ttlis not configured, meaning it goes to the default value of0(never unload). It also doesn't have this issue when using the built in llama.cpp GUI, even when loaded through llama-swap.@rgaricano commented on GitHub (Mar 6, 2025):
I only see a timeout of 300s on
3b70cd64d7/backend/open_webui/env.py (L398)It's set to 300 by default if env is <>"" & not number, I don't think that is the case, but you can try to set it bigger, e.g.600, to delimit and safely rule out that this is the problem
another timeout env is AIOHTTP_CLIENT_TIMEOUT_OPENAI_MODEL_LIST, for requests to openai & ollama, is set to none (or 5 if error) but... aiohttp.client by default seem to be 300! ( https://docs.aiohttp.org/en/stable/client_reference.html )
I would try configuring those env variables, to see how it reacts... and if you manage to solve it, please let us know.
(sorry if i can't help more, i'm not know enougth the code and i can't reproduce your problem)
@knguyen298 commented on GitHub (Mar 6, 2025):
So I did some further testing:
AIOHTTP_CLIENT_TIMEOUTto1200seems to fix the issue. Streaming continues past the 5 minute mark, but I stopped it before it finished.""causes it to stop after 5 minutes again. I confirmed via Portainer that the value is set to""in the container.Seems to me that the blank value is not being interpreted correctly.
@knguyen298 commented on GitHub (Mar 6, 2025):
I look a closer look at the environment variable Python code, and saw that
AIOHTTP_CLIENT_TIMEOUTis set to""if the environment variable wasn't defined in the OS. So I removed it from the environment section in my Docker Compose file.Opening a shell into the container, and opening up a Python terminal, I confirmed that
AIOHTTP_CLIENT_TIMEOUT == ""now evaluates to true. The connection now no longer closes after 5 minutes.Either the documentation needs to be updated, stating that "Only define the value if you want a timeout", and/or the code needs to be updated to work properly with a defined empty string.
@rgaricano commented on GitHub (Mar 6, 2025):
yes, and with an entry in Troubleshooting about this such of things.
For reference:
https://docs.aiohttp.org/en/stable/client_quickstart.html#aiohttp-client-timeouts
"...
aiohttp.client Timeouts
Timeout settings are stored in ClientTimeout data structure.
By default aiohttp uses a total 300 seconds (5min) timeout, it means that the whole operation should finish in 5 minutes. In order to allow time for DNS fallback, the default sock_connect timeout is 30 seconds.
..."
@AlbertoSinigaglia commented on GitHub (Mar 6, 2025):
But does it ever stop thinking? Mine after a while has 0% GPU usage, but openwebui reports "thinking" undefinitely (without ever getting out of the "thinking" section)...
@AlbertoSinigaglia commented on GitHub (Mar 6, 2025):
Maybe has something to do with QwQ-32B and how it handles thinking? I see a while special section on huggingface on how to run it propertly https://huggingface.co/Qwen/QwQ-32B#usage-guidelines
@knguyen298 commented on GitHub (Mar 6, 2025):
@rgaricano I don't think the issue is with how the variable is being passed to aiohttp: this seems fine to me.
3b70cd64d7/backend/open_webui/routers/openai.py (L678)I think the issue is how the Python variable is being set by the OS environment variable when the OS environment variable is set to single quotes or double quotes to indicate an empty string. After some further testing:
AIOHTTP_CLIENT_TIMEOUT=''returnsAIOHTTP_CLIENT_TIMEOUT="''"in Python.AIOHTTP_CLIENT_TIMEOUT=""returnsAIOHTTP_CLIENT_TIMEOUT='""'in Python.Only by not defining
AIOHTTP_CLIENT_TIMEOUTwill it be set to""and then correctly set toNone.@knguyen298 commented on GitHub (Mar 6, 2025):
@AlbertoSinigaglia
Mine runs fine, using the Q6_K_L quant from bartowski via llama.cpp. Check your sampling parameters?
@rgaricano commented on GitHub (Mar 6, 2025):
Alberto, I don't think that your problem is for the <think> question, this problem give you a incomplete response, but you have a response,
it you have not response is other thing, a closed connection not notified, proxy timeout,... do you have some log of the error? what is your system config?
@AlbertoSinigaglia commented on GitHub (Mar 6, 2025):
@rgaricano no error nor nothing displayed... If I read the "thinking" it clearly gets to a final point of the reasoning chain, but never spits out a response
@knguyen298 commented on GitHub (Mar 6, 2025):
@AlbertoSinigaglia this looks to be an issue with the model published by the Ollama library, not related to this issue and is not an issue with Open-WebUI.
https://github.com/ollama/ollama/issues/9523#issuecomment-2703880818
Use a different GGUF, you can download and import HuggingFace GGUFs that are not in the Ollama library.
@AlbertoSinigaglia commented on GitHub (Mar 6, 2025):
@knguyen298
uhhhh that's a nice catch, thanks
if you don't mind, can you give me a pointer to that "download and import HuggingFace GGUFs that are not in the Ollama library"? never done it (EDIT: check new comment down below)
@rgaricano commented on GitHub (Mar 6, 2025):
ok, yes, i was reading about before, there was some other issues reported https://github.com/open-webui/open-webui/issues/11259 but there is not for the ollama model, is model provider itself,
and the solution is on HF https://huggingface.co/Qwen/QwQ-32B/discussions/4
By the way, check Max Tokens (num_predict) param of the model, in case you have a small one and it is cutting off your response
@AlbertoSinigaglia commented on GitHub (Mar 6, 2025):
ok nvm, seems like
ollama pull hf.co/bartowski/QwQ-32B-Preview-GGUF:Q8_0did the trick... any suggestion on the quantization? I have a A6000, so I have 48Gb of VRAM, but I'm not sure that using a Q_8 that uses 32Gb is worth over the Q_4 that uses half of itEDIT: this model instead doesn't think at all lol, just straight out answers
@knguyen298 commented on GitHub (Mar 6, 2025):
@AlbertoSinigaglia Did you set the sampling parameters? As for quant, I used Q6_K_L from bartowski.
@AlbertoSinigaglia commented on GitHub (Mar 6, 2025):
@knguyen298 yup but i feel I messed up something, cause the Q_8 still doesn't want to "think"... I'm downloading the Q6_K_L version, to see if anything changes, but I'm pretty sure I'm the one that has messed up some sampling parameter
@knguyen298 commented on GitHub (Mar 6, 2025):
@AlbertoSinigaglia try
Context Length = 32768andnum_predict = -1(you'll have to drag the slider to the left to get to-2, and then change the 2 to a 1).@AlbertoSinigaglia commented on GitHub (Mar 6, 2025):
@knguyen298 this made me laugh
...using these settings:
So now, I have the original QwQ-32B that only reasons, and the quantized versions that do not want to reason at all lol (not even with your prompt...)
@rgaricano commented on GitHub (Mar 6, 2025):
@AlbertoSinigaglia, that model have 64 layers & i think that your gpu can fit all of them, you can set num_gpu (Ollama) to 64
@AlbertoSinigaglia commented on GitHub (Mar 7, 2025):
@rgaricano
Thanks, done it, though it seems to be ignored (I guess) https://www.reddit.com/r/ollama/comments/1d29wdx/what_happen_with_parameter_num_gpu/
@rgaricano commented on GitHub (Mar 7, 2025):
ok, yes, i see that now are dynamically managed, vars are there gpu_num & NumGPU but assigned as -1 on runner and estimated here.