[GH-ISSUE #8254] Streaming mode is slow even with fast LLM providers (groq.com for example) #15054

New Issue

GiteaMirror · 2026-04-19T21:20:23-05:00

GiteaMirror commented

2026-04-19 21:20:23 -05:00

Originally created by @vlebert on GitHub (Jan 1, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/8254

Bug Report

I found out that streaming mode in Open WebUI is actually a bottleneck : the token rate is greatly lower that what can provide the LLM provider.

Take for example https://groq.com/ which provide extremely fast inference for LLama models (250 token/second for llama 3.3)

When used in Open Web UI, the word/token rate is "normal" (comparable to 4o or 3.5 sonnet for example).
If you disable the stream response mode, the model response is less than a second for long responses.

Actually I fell that whatever the model used (4o or 4o-mini for example), the token rate in open web ui is similar while it should be a lot better for 4o mini.

Is there something wrong in the way the stream are handled by open web ui ?

Installation Method

Cloudron

Environment

Open WebUI Version: v0.5.2
Operating System: Ubuntu
Browser (if applicable): Chrome

Originally created by @vlebert on GitHub (Jan 1, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/8254 # Bug Report I found out that streaming mode in Open WebUI is actually a bottleneck : the token rate is greatly lower that what can provide the LLM provider. Take for example https://groq.com/ which provide extremely fast inference for LLama models (250 token/second for llama 3.3) When used in Open Web UI, the word/token rate is "normal" (comparable to 4o or 3.5 sonnet for example). If you disable the stream response mode, the model response is less than a second for long responses. Actually I fell that whatever the model used (4o or 4o-mini for example), the token rate in open web ui is similar while it should be a lot better for 4o mini. Is there something wrong in the way the stream are handled by open web ui ? --- ## Installation Method Cloudron ## Environment - **Open WebUI Version:** v0.5.2 - **Operating System:** Ubuntu - **Browser (if applicable):** Chrome -

GiteaMirror closed this issue

2026-04-19 21:20:24 -05:00

GiteaMirror commented

2026-04-19 21:20:28 -05:00

@tjbck commented on GitHub (Jan 1, 2025):

Please read the changelogs: https://docs.openwebui.com/getting-started/advanced-topics/env-configuration#enable_realtime_chat_save

@tjbck commented on GitHub (Jan 1, 2025): Please read the changelogs: https://docs.openwebui.com/getting-started/advanced-topics/env-configuration#enable_realtime_chat_save

GiteaMirror commented

2026-04-19 21:20:29 -05:00

@vlebert commented on GitHub (Jan 1, 2025):

Hi @tjbck

I just tried setting this value to false. It did not solve my issue

For example chatting with a llama 8B on groq should be almost instantaneous. It is not with open webui

Can you reopen the issue ?

@vlebert commented on GitHub (Jan 1, 2025): Hi @tjbck I just tried setting this value to false. It did not solve my issue For example chatting with a llama 8B on groq should be almost instantaneous. It is not with open webui Can you reopen the issue ?

GiteaMirror commented

2026-04-19 21:20:30 -05:00

@vlebert commented on GitHub (Jan 2, 2025):

@tjbck

I also compared with another chat client (msty). The speed of the stream with groq models are really abnormally slow with open web ui.

@vlebert commented on GitHub (Jan 2, 2025): @tjbck I also compared with another chat client (msty). The speed of the stream with groq models are really abnormally slow with open web ui.

GiteaMirror commented

2026-04-19 21:20:31 -05:00

@gamesgao commented on GitHub (Jan 3, 2025):

Hi, @vlebert

I also met the same issue and the ENV Variable really solve the issue.
But what I met is that (Maybe wrong since I did not double confirm) You need to set
ENABLE_REALTIME_CHAT_SAVE = False rather than
ENABLE_REALTIME_CHAT_SAVE = false

Quite stupid but it works for me if I use False

@gamesgao commented on GitHub (Jan 3, 2025): Hi, @vlebert I also met the same issue and the ENV Variable really solve the issue. But what I met is that (Maybe wrong since I did not double confirm) You need to set ENABLE_REALTIME_CHAT_SAVE = False rather than ENABLE_REALTIME_CHAT_SAVE = false Quite stupid but it works for me if I use False

GiteaMirror commented

2026-04-19 21:20:31 -05:00

@vlebert commented on GitHub (Jan 3, 2025):

Hi @gamesgao

Thanks for your answer. I just tried both False and false values but it did not change anything. When I ask a simple prompt like "write a long text" to a Groq llama model, the stream is really slow (approximately 1 line per second).

It should be a lot quicker with this AI provider.

I am a bit disapointed that this issue was closed before being actually solved @tjbck
Should I reopen a new one with more context ?

@vlebert commented on GitHub (Jan 3, 2025): Hi @gamesgao Thanks for your answer. I just tried both `False` and `false` values but it did not change anything. When I ask a simple prompt like "write a long text" to a Groq llama model, the stream is really slow (approximately 1 line per second). It should be a lot quicker with this AI provider. I am a bit disapointed that this issue was closed before being actually solved @tjbck Should I reopen a new one with more context ?

GiteaMirror commented

2026-04-19 21:20:32 -05:00

@vlebert commented on GitHub (Jan 3, 2025):

Hmm I see from changelog that this value is introduced in version 5.3.
I am still on 5.2 this is certainly the reason. I'll give updates when switching to 5.3

@vlebert commented on GitHub (Jan 3, 2025): Hmm I see from changelog that this value is introduced in version 5.3. I am still on 5.2 this is certainly the reason. I'll give updates when switching to 5.3

GiteaMirror commented

2026-04-19 21:20:33 -05:00

@i-iooi-i commented on GitHub (Feb 9, 2025):

@gamesgao Thank you very much, your method is feasible and the response and reply speed has become much faster. It was a terrible experience, I don't know why the author didn't solve this problem, but found the right answer from you.

@i-iooi-i commented on GitHub (Feb 9, 2025): @gamesgao Thank you very much, your method is feasible and the response and reply speed has become much faster. It was a terrible experience, I don't know why the author didn't solve this problem, but found the right answer from you.

GiteaMirror referenced this issue

2026-04-19 23:12:51 -05:00

[GH-ISSUE #15054] issue: Open-WebUI sends a lot of requests #17452

GiteaMirror referenced this issue

2026-04-25 06:49:32 -05:00

[GH-ISSUE #15054] issue: Open-WebUI sends a lot of requests #32981

GiteaMirror referenced this issue

2026-05-05 18:44:36 -05:00

[GH-ISSUE #15054] issue: Open-WebUI sends a lot of requests #56118

Sign in to join this conversation.