mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-08 12:58:11 -05:00
[GH-ISSUE #8254] Streaming mode is slow even with fast LLM providers (groq.com for example) #30582
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @vlebert on GitHub (Jan 1, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/8254
Bug Report
I found out that streaming mode in Open WebUI is actually a bottleneck : the token rate is greatly lower that what can provide the LLM provider.
Take for example https://groq.com/ which provide extremely fast inference for LLama models (250 token/second for llama 3.3)
When used in Open Web UI, the word/token rate is "normal" (comparable to 4o or 3.5 sonnet for example).
If you disable the stream response mode, the model response is less than a second for long responses.
Actually I fell that whatever the model used (4o or 4o-mini for example), the token rate in open web ui is similar while it should be a lot better for 4o mini.
Is there something wrong in the way the stream are handled by open web ui ?
Installation Method
Cloudron
Environment
Open WebUI Version: v0.5.2
Operating System: Ubuntu
Browser (if applicable): Chrome
@tjbck commented on GitHub (Jan 1, 2025):
Please read the changelogs: https://docs.openwebui.com/getting-started/advanced-topics/env-configuration#enable_realtime_chat_save
@vlebert commented on GitHub (Jan 1, 2025):
Hi @tjbck
I just tried setting this value to false. It did not solve my issue
For example chatting with a llama 8B on groq should be almost instantaneous. It is not with open webui
Can you reopen the issue ?
@vlebert commented on GitHub (Jan 2, 2025):
@tjbck
I also compared with another chat client (msty). The speed of the stream with groq models are really abnormally slow with open web ui.
@gamesgao commented on GitHub (Jan 3, 2025):
Hi, @vlebert
I also met the same issue and the ENV Variable really solve the issue.
But what I met is that (Maybe wrong since I did not double confirm) You need to set
ENABLE_REALTIME_CHAT_SAVE = False rather than
ENABLE_REALTIME_CHAT_SAVE = false
Quite stupid but it works for me if I use False
@vlebert commented on GitHub (Jan 3, 2025):
Hi @gamesgao
Thanks for your answer. I just tried both
Falseandfalsevalues but it did not change anything. When I ask a simple prompt like "write a long text" to a Groq llama model, the stream is really slow (approximately 1 line per second).It should be a lot quicker with this AI provider.
I am a bit disapointed that this issue was closed before being actually solved @tjbck
Should I reopen a new one with more context ?
@vlebert commented on GitHub (Jan 3, 2025):
Hmm I see from changelog that this value is introduced in version 5.3.
I am still on 5.2 this is certainly the reason. I'll give updates when switching to 5.3
@i-iooi-i commented on GitHub (Feb 9, 2025):
@gamesgao Thank you very much, your method is feasible and the response and reply speed has become much faster. It was a terrible experience, I don't know why the author didn't solve this problem, but found the right answer from you.