[GH-ISSUE #8254] Streaming mode is slow even with fast LLM providers (groq.com for example) #15054

Closed
opened 2026-04-19 21:20:23 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @vlebert on GitHub (Jan 1, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/8254

Bug Report

I found out that streaming mode in Open WebUI is actually a bottleneck : the token rate is greatly lower that what can provide the LLM provider.

Take for example https://groq.com/ which provide extremely fast inference for LLama models (250 token/second for llama 3.3)

When used in Open Web UI, the word/token rate is "normal" (comparable to 4o or 3.5 sonnet for example).
If you disable the stream response mode, the model response is less than a second for long responses.

Actually I fell that whatever the model used (4o or 4o-mini for example), the token rate in open web ui is similar while it should be a lot better for 4o mini.

Is there something wrong in the way the stream are handled by open web ui ?


Installation Method

Cloudron

Environment

  • Open WebUI Version: v0.5.2

  • Operating System: Ubuntu

  • Browser (if applicable): Chrome

Originally created by @vlebert on GitHub (Jan 1, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/8254 # Bug Report I found out that streaming mode in Open WebUI is actually a bottleneck : the token rate is greatly lower that what can provide the LLM provider. Take for example https://groq.com/ which provide extremely fast inference for LLama models (250 token/second for llama 3.3) When used in Open Web UI, the word/token rate is "normal" (comparable to 4o or 3.5 sonnet for example). If you disable the stream response mode, the model response is less than a second for long responses. Actually I fell that whatever the model used (4o or 4o-mini for example), the token rate in open web ui is similar while it should be a lot better for 4o mini. Is there something wrong in the way the stream are handled by open web ui ? --- ## Installation Method Cloudron ## Environment - **Open WebUI Version:** v0.5.2 - **Operating System:** Ubuntu - **Browser (if applicable):** Chrome -
Author
Owner

@tjbck commented on GitHub (Jan 1, 2025):

Please read the changelogs: https://docs.openwebui.com/getting-started/advanced-topics/env-configuration#enable_realtime_chat_save

<!-- gh-comment-id:2567135470 --> @tjbck commented on GitHub (Jan 1, 2025): Please read the changelogs: https://docs.openwebui.com/getting-started/advanced-topics/env-configuration#enable_realtime_chat_save
Author
Owner

@vlebert commented on GitHub (Jan 1, 2025):

Hi @tjbck

I just tried setting this value to false. It did not solve my issue

For example chatting with a llama 8B on groq should be almost instantaneous. It is not with open webui

Can you reopen the issue ?

<!-- gh-comment-id:2567139388 --> @vlebert commented on GitHub (Jan 1, 2025): Hi @tjbck I just tried setting this value to false. It did not solve my issue For example chatting with a llama 8B on groq should be almost instantaneous. It is not with open webui Can you reopen the issue ?
Author
Owner

@vlebert commented on GitHub (Jan 2, 2025):

@tjbck

I also compared with another chat client (msty). The speed of the stream with groq models are really abnormally slow with open web ui.

<!-- gh-comment-id:2567424680 --> @vlebert commented on GitHub (Jan 2, 2025): @tjbck I also compared with another chat client (msty). The speed of the stream with groq models are really abnormally slow with open web ui.
Author
Owner

@gamesgao commented on GitHub (Jan 3, 2025):

Hi, @vlebert

I also met the same issue and the ENV Variable really solve the issue.
But what I met is that (Maybe wrong since I did not double confirm) You need to set
ENABLE_REALTIME_CHAT_SAVE = False rather than
ENABLE_REALTIME_CHAT_SAVE = false

Quite stupid but it works for me if I use False

<!-- gh-comment-id:2568700544 --> @gamesgao commented on GitHub (Jan 3, 2025): Hi, @vlebert I also met the same issue and the ENV Variable really solve the issue. But what I met is that (Maybe wrong since I did not double confirm) You need to set ENABLE_REALTIME_CHAT_SAVE = False rather than ENABLE_REALTIME_CHAT_SAVE = false Quite stupid but it works for me if I use False
Author
Owner

@vlebert commented on GitHub (Jan 3, 2025):

Hi @gamesgao

Thanks for your answer. I just tried both False and false values but it did not change anything. When I ask a simple prompt like "write a long text" to a Groq llama model, the stream is really slow (approximately 1 line per second).

It should be a lot quicker with this AI provider.

I am a bit disapointed that this issue was closed before being actually solved @tjbck
Should I reopen a new one with more context ?

<!-- gh-comment-id:2568845639 --> @vlebert commented on GitHub (Jan 3, 2025): Hi @gamesgao Thanks for your answer. I just tried both `False` and `false` values but it did not change anything. When I ask a simple prompt like "write a long text" to a Groq llama model, the stream is really slow (approximately 1 line per second). It should be a lot quicker with this AI provider. I am a bit disapointed that this issue was closed before being actually solved @tjbck Should I reopen a new one with more context ?
Author
Owner

@vlebert commented on GitHub (Jan 3, 2025):

Hmm I see from changelog that this value is introduced in version 5.3.
I am still on 5.2 this is certainly the reason. I'll give updates when switching to 5.3

<!-- gh-comment-id:2568847535 --> @vlebert commented on GitHub (Jan 3, 2025): Hmm I see from changelog that this value is introduced in version 5.3. I am still on 5.2 this is certainly the reason. I'll give updates when switching to 5.3
Author
Owner

@i-iooi-i commented on GitHub (Feb 9, 2025):

@gamesgao Thank you very much, your method is feasible and the response and reply speed has become much faster. It was a terrible experience, I don't know why the author didn't solve this problem, but found the right answer from you.

<!-- gh-comment-id:2646320478 --> @i-iooi-i commented on GitHub (Feb 9, 2025): @gamesgao Thank you very much, your method is feasible and the response and reply speed has become much faster. It was a terrible experience, I don't know why the author didn't solve this problem, but found the right answer from you.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#15054