mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 02:48:13 -05:00
[GH-ISSUE #16650] issue: Streaming output from local Ollama in OpenWebUI is extremely slow (40–50‑token bursts) when WebSocket is disabled #17995
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @yuliang615 on GitHub (Aug 15, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/16650
Check Existing Issues
Installation Method
Docker
Open WebUI Version
0.6.22
Ollama Version (if applicable)
No response
Operating System
ubuntu
Browser (if applicable)
chrome
Confirmation
README.md.Expected Behavior
When the streaming option is enabled, the UI should display the model’s output token‑by‑token (or at least in very small chunks, e.g. 1–2 tokens) as the LLM generates it, regardless of whether the model is accessed via the OpenAI API key or via a local Ollama instance.
Actual Behavior
Using a remote LLM (OpenAI API key) through OpenWebUI → tokens appear smoothly, 1–2 at a time.
Using a local Ollama instance → the UI lags and only shows roughly 40–50 tokens at a time before the next burst.
The local Ollama CLI (ollama run … --stream) behaves correctly (1–2 tokens per second).
The problem persists when the WebSocket setting in OpenWebUI is disabled; it appears that the streaming mechanism is affected.
Steps to Reproduce
Install OpenWebUI and Ollama locally (Docker or native).
Configure OpenWebUI to use the local Ollama model
In config.json (or via the UI) set model: "ollama" and point to the local Ollama URL.
Disable WebSocket in the OpenWebUI settings (or set websocket: false in the config).
Start OpenWebUI.
Send a prompt (e.g., “Explain quantum entanglement”) through the web UI.
Observe that the output only updates after a large batch of ~40–50 tokens has been generated; the UI feels sluggish.
Re‑enable WebSocket (or set websocket: true) and repeat step 5 – the output now streams smoothly, 1–2 tokens at a time.
Logs & Screenshots
https://github.com/user-attachments/assets/1df21ac4-bd39-4ee3-b928-1faf47963d53
Additional Information
What we haven’t tested yet
Whether the lag disappears when WebSocket is enabled again (i.e. is it strictly a WebSocket issue?).
Whether the same problem appears if we use the HTTP‑fallback route but keep WebSocket enabled, or if we switch to a different reverse‑proxy (NGINX, Caddy, etc.).
@yuliang615 commented on GitHub (Aug 15, 2025):
@yuliang615 commented on GitHub (Aug 15, 2025):
Solved:
Nginx by default buffers HTTP responses – unless told otherwise, it will keep the entire response in memory until the backend (Ollama) finishes sending it.
Way to fix it:
Add Nginx settings
location / {
proxy_pass http://127.0.0.1:3000;
Adding eENABLE_WEBSOCKET_SUPPORT=false to the Docker startup parameters is only a temporary solution. The correct method is to enable WebSockets in Nginx.
@tjbck commented on GitHub (Aug 16, 2025):
Websocket support is required.