[GH-ISSUE #16650] issue: Streaming output from local Ollama in OpenWebUI is extremely slow (40–50‑token bursts) when WebSocket is disabled #17995

New Issue

GiteaMirror · 2026-04-19T23:54:07-05:00

GiteaMirror commented

2026-04-19 23:54:07 -05:00

Originally created by @yuliang615 on GitHub (Aug 15, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/16650

Check Existing Issues

I have searched the existing issues and discussions.
I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

0.6.22

Ollama Version (if applicable)

No response

Operating System

ubuntu

Browser (if applicable)

chrome

Confirmation

I have read and followed all instructions in README.md.
I am using the latest version of both Open WebUI and Ollama.
I have included the browser console logs.
I have included the Docker container logs.
I have provided every relevant configuration, setting, and environment variable used in my setup.
I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
Start with the initial platform/version/OS and dependencies used,
Specify exact install/launch/configure commands,
List URLs visited, user input (incl. example values/emails/passwords if needed),
Describe all options and toggles enabled or changed,
Include any files or environmental changes,
Identify the expected and actual result at each stage,
Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

When the streaming option is enabled, the UI should display the model’s output token‑by‑token (or at least in very small chunks, e.g. 1–2 tokens) as the LLM generates it, regardless of whether the model is accessed via the OpenAI API key or via a local Ollama instance.

Actual Behavior

Using a remote LLM (OpenAI API key) through OpenWebUI → tokens appear smoothly, 1–2 at a time.
Using a local Ollama instance → the UI lags and only shows roughly 40–50 tokens at a time before the next burst.
The local Ollama CLI (ollama run … --stream) behaves correctly (1–2 tokens per second).
The problem persists when the WebSocket setting in OpenWebUI is disabled; it appears that the streaming mechanism is affected.

Steps to Reproduce

Install OpenWebUI and Ollama locally (Docker or native).
Configure OpenWebUI to use the local Ollama model
In config.json (or via the UI) set model: "ollama" and point to the local Ollama URL.
Disable WebSocket in the OpenWebUI settings (or set websocket: false in the config).
Start OpenWebUI.
Send a prompt (e.g., “Explain quantum entanglement”) through the web UI.
Observe that the output only updates after a large batch of ~40–50 tokens has been generated; the UI feels sluggish.
Re‑enable WebSocket (or set websocket: true) and repeat step 5 – the output now streams smoothly, 1–2 tokens at a time.

Logs & Screenshots

https://github.com/user-attachments/assets/1df21ac4-bd39-4ee3-b928-1faf47963d53

Additional Information

What we haven’t tested yet

Whether the lag disappears when WebSocket is enabled again (i.e. is it strictly a WebSocket issue?).
Whether the same problem appears if we use the HTTP‑fallback route but keep WebSocket enabled, or if we switch to a different reverse‑proxy (NGINX, Caddy, etc.).

Originally created by @yuliang615 on GitHub (Aug 15, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/16650 ### Check Existing Issues - [x] I have searched the existing issues and discussions. - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version 0.6.22 ### Ollama Version (if applicable) _No response_ ### Operating System ubuntu ### Browser (if applicable) chrome ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior When the streaming option is enabled, the UI should display the model’s output token‑by‑token (or at least in very small chunks, e.g. 1–2 tokens) as the LLM generates it, regardless of whether the model is accessed via the OpenAI API key or via a local Ollama instance. ### Actual Behavior Using a remote LLM (OpenAI API key) through OpenWebUI → tokens appear smoothly, 1–2 at a time. Using a local Ollama instance → the UI lags and only shows roughly 40–50 tokens at a time before the next burst. The local Ollama CLI (ollama run … --stream) behaves correctly (1–2 tokens per second). The problem persists when the WebSocket setting in OpenWebUI is disabled; it appears that the streaming mechanism is affected. ### Steps to Reproduce Install OpenWebUI and Ollama locally (Docker or native). Configure OpenWebUI to use the local Ollama model In config.json (or via the UI) set model: "ollama" and point to the local Ollama URL. Disable WebSocket in the OpenWebUI settings (or set websocket: false in the config). Start OpenWebUI. Send a prompt (e.g., “Explain quantum entanglement”) through the web UI. Observe that the output only updates after a large batch of ~40–50 tokens has been generated; the UI feels sluggish. Re‑enable WebSocket (or set websocket: true) and repeat step 5 – the output now streams smoothly, 1–2 tokens at a time. ### Logs & Screenshots https://github.com/user-attachments/assets/1df21ac4-bd39-4ee3-b928-1faf47963d53 ### Additional Information What we haven’t tested yet Whether the lag disappears when WebSocket is enabled again (i.e. is it strictly a WebSocket issue?). Whether the same problem appears if we use the HTTP‑fallback route but keep WebSocket enabled, or if we switch to a different reverse‑proxy (NGINX, Caddy, etc.).

GiteaMirror added the bug label 2026-04-19 23:54:08 -05:00

GiteaMirror closed this issue

2026-04-19 23:54:08 -05:00

GiteaMirror commented

2026-04-19 23:54:09 -05:00

@yuliang615 commented on GitHub (Aug 15, 2025):

Why WebSocket was turned off:https://github.com/open-webui/open-webui/discussions/11071

@yuliang615 commented on GitHub (Aug 15, 2025): <img width="1989" height="491" alt="Image" src="https://github.com/user-attachments/assets/1a5dd893-818c-4c02-af91-ccc338ac53f6" /> Why WebSocket was turned off:https://github.com/open-webui/open-webui/discussions/11071

GiteaMirror commented

2026-04-19 23:54:10 -05:00

@yuliang615 commented on GitHub (Aug 15, 2025):

Solved:
Nginx by default buffers HTTP responses – unless told otherwise, it will keep the entire response in memory until the backend (Ollama) finishes sending it.
Way to fix it:
Add Nginx settings
location / {
proxy_pass http://127.0.0.1:3000;

    # Tell Nginx this is a WebSocket/SSE stream
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";

    # Forward useful headers
    proxy_set_header Host              $host;
    proxy_set_header X-Real-IP         $remote_addr;
    proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;

    # Allow long‑lived connections needed for streaming
    proxy_connect_timeout 10s;
    proxy_send_timeout    600s;
    proxy_read_timeout    600s;

    # Optional: disable buffering to reduce latency even further
    # proxy_buffering off;
}

Adding eENABLE_WEBSOCKET_SUPPORT=false to the Docker startup parameters is only a temporary solution. The correct method is to enable WebSockets in Nginx.

@yuliang615 commented on GitHub (Aug 15, 2025): Solved: Nginx by default buffers HTTP responses – unless told otherwise, it will keep the entire response in memory until the backend (Ollama) finishes sending it. Way to fix it: Add Nginx settings location / { proxy_pass http://127.0.0.1:3000; # Tell Nginx this is a WebSocket/SSE stream proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; # Forward useful headers proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; # Allow long‑lived connections needed for streaming proxy_connect_timeout 10s; proxy_send_timeout 600s; proxy_read_timeout 600s; # Optional: disable buffering to reduce latency even further # proxy_buffering off; } Adding eENABLE_WEBSOCKET_SUPPORT=false to the Docker startup parameters is only a temporary solution. The correct method is to enable WebSockets in Nginx.

GiteaMirror commented

2026-04-19 23:54:10 -05:00

@tjbck commented on GitHub (Aug 16, 2025):

Websocket support is required.

@tjbck commented on GitHub (Aug 16, 2025): Websocket support is required.

GiteaMirror referenced this issue

2026-04-20 05:30:31 -05:00

[PR #17995] [MERGED] chore: Changelog #24631

GiteaMirror referenced this issue

2026-04-25 12:39:58 -05:00

[PR #17995] [MERGED] chore: Changelog #40261

GiteaMirror referenced this issue