[GH-ISSUE #18475] issue: When uploading a large document (full context) and asking a question, there is an issue with token generation streaming synchronization. #18607

New Issue

GiteaMirror · 2026-04-20T00:49:26-05:00

GiteaMirror commented

2026-04-20 00:49:26 -05:00

Originally created by @Cyp9715 on GitHub (Oct 21, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/18475

Check Existing Issues

I have searched for any existing and/or related issues.
I have searched for any existing and/or related discussions.
I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

v0.6.34

Ollama Version (if applicable)

No response

Operating System

Ubuntu 24.04

Browser (if applicable)

Firefox 144.0

Confirmation

I have read and followed all instructions in README.md.
I am using the latest version of both Open WebUI and Ollama.
I have included the browser console logs.
I have included the Docker container logs.
I have provided every relevant configuration, setting, and environment variable used in my setup.
I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
Start with the initial platform/version/OS and dependencies used,
Specify exact install/launch/configure commands,
List URLs visited, user input (incl. example values/emails/passwords if needed),
Describe all options and toggles enabled or changed,
Include any files or environmental changes,
Identify the expected and actual result at each stage,
Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

After the user sends a prompt, if all tokens have been correctly decoded in the background, OpenWebUI should display them immediately.

Actual Behavior

However, after attaching a large document (200k token), enabling full-context mode, and receiving a question and answer, OpenWebUI displays tokens extremely slowly in the browser window. If you open a new tab and refresh the page, you can confirm that all tokens have already been generated and are present within OpenWebUI.

In other words, OpenWebUI is rendering the screen significantly slower than the actual generation speed—not by a minor delay of 1–2 seconds, but by a noticeable lag of at least 20 seconds or more, severely impacting the user experience.

Steps to Reproduce

All environments are running within Docker, using the following commands:

docker run --name vllm-qwen3 \
           --runtime nvidia --gpus all \
           -v ~/.cache/huggingface:/root/.cache/huggingface \
           --restart unless-stopped \
           -p 172.17.0.1:{port}:{port} \
           --shm-size=32g \
           vllm/vllm-openai:v0.11.0 \
           --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
           --tensor-parallel-size 2 \
           --max-num-batched-tokens 16384 \
           --gpu-memory-utilization 0.75

docker run -p {port}:8080 --gpus all \
           --restart unless-stopped \
           -v open-webui-pg:/app/backend/data \
           --name open-webui-test \
           -e TZ=Asia/Seoul \
           -e WEBUI_SECRET_KEY={key} \
           -e VECTOR_DB="qdrant" \
           -e QDRANT_URI="http://172.17.0.1:{port}" \
           -e DATABASE_URL='postgresql://postgres:{password}@172.17.0.1:{port}/openwebui' \
           ghcr.io/open-webui/open-webui:main

Logs & Screenshots

In the browser window on the right, the user entered the question immediately, and the response is still being generated.
However, when accessing the same page from the browser window on the left, it is evident that the full response has already been generated (confirmed also by OpenWebUI’s “response completed” notification).

Additional Information

This issue occurs not only in Firefox but also in Chrome. Monitoring via nvidia-smi confirms that vLLM has completed token generation well before the UI begins to display them. When opening a new tab and reloading the OpenWebUI page, all tokens are already fully rendered and visible—yet, in the original tab where the question was submitted, the token streaming is painfully slow. This discrepancy clearly indicates a client-side rendering or event-stream synchronization bug specific to the active chat tab, not a backend or model performance issue.

Originally created by @Cyp9715 on GitHub (Oct 21, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/18475 ### Check Existing Issues - [x] I have searched for any existing and/or related issues. - [x] I have searched for any existing and/or related discussions. - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version v0.6.34 ### Ollama Version (if applicable) _No response_ ### Operating System Ubuntu 24.04 ### Browser (if applicable) Firefox 144.0 ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [ ] I have included the browser console logs. - [ ] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior After the user sends a prompt, if all tokens have been correctly decoded in the background, OpenWebUI should display them immediately. ### Actual Behavior However, after attaching a large document (200k token), enabling full-context mode, and receiving a question and answer, OpenWebUI displays tokens extremely slowly in the browser window. If you open a new tab and refresh the page, you can confirm that all tokens have already been generated and are present within OpenWebUI. In other words, OpenWebUI is rendering the screen significantly slower than the actual generation speed—not by a minor delay of 1–2 seconds, but by a noticeable lag of at least 20 seconds or more, severely impacting the user experience. ### Steps to Reproduce All environments are running within Docker, using the following commands: ``` docker run --name vllm-qwen3 \ --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --restart unless-stopped \ -p 172.17.0.1:{port}:{port} \ --shm-size=32g \ vllm/vllm-openai:v0.11.0 \ --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \ --tensor-parallel-size 2 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.75 ``` ``` docker run -p {port}:8080 --gpus all \ --restart unless-stopped \ -v open-webui-pg:/app/backend/data \ --name open-webui-test \ -e TZ=Asia/Seoul \ -e WEBUI_SECRET_KEY={key} \ -e VECTOR_DB="qdrant" \ -e QDRANT_URI="http://172.17.0.1:{port}" \ -e DATABASE_URL='postgresql://postgres:{password}@172.17.0.1:{port}/openwebui' \ ghcr.io/open-webui/open-webui:main ``` ### Logs & Screenshots <img width="1911" height="942" alt="Image" src="https://github.com/user-attachments/assets/2e403e42-7a3f-4156-8e7a-22556eca7990" /> In the browser window on the right, the user entered the question immediately, and the response is still being generated. However, when accessing the *same page* from the browser window on the left, it is evident that the full response has already been generated (confirmed also by OpenWebUI’s “response completed” notification). ### Additional Information This issue occurs not only in Firefox but also in Chrome. Monitoring via nvidia-smi confirms that vLLM has completed token generation well before the UI begins to display them. When opening a new tab and reloading the OpenWebUI page, all tokens are already fully rendered and visible—yet, in the original tab where the question was submitted, the token streaming is painfully slow. This discrepancy clearly indicates a client-side rendering or event-stream synchronization bug specific to the active chat tab, not a backend or model performance issue.

GiteaMirror added the bug label 2026-04-20 00:49:26 -05:00

GiteaMirror closed this issue

2026-04-20 00:49:27 -05:00

GiteaMirror commented

2026-04-20 00:49:27 -05:00

@rgaricano commented on GitHub (Oct 21, 2025):

Are you tried:

Disabling the "Fade Effect for Streaming Text" setting (go to Settings > Interface and turn off the "Fade Effect for Streaming Text" toggle)
Adjusting Stream Delta Chunk Size, increasing it ("Stream Delta Chunk Size" in adminSettings/General)

I think that the slowdown occurs after SSE parsing. When splitLargeDeltas is enabled:

Content chunks larger than 5 characters are artificially split into 1-3 character pieces
Each mini-chunk has a 5ms delay (await sleep(5)) between yields
For a 200-character chunk, this adds ~330-1000ms of artificial delay

With large context responses generating hundreds of tokens per SSE event, probably this chunking creates the perceived slowness you're experiencing.

Note: the delay is hardcoded here:
46ae3f4f5d/src/lib/apis/streaming/index.ts (L135)

@rgaricano commented on GitHub (Oct 21, 2025): Are you tried: - Disabling the "Fade Effect for Streaming Text" setting (go to Settings > Interface and turn off the "Fade Effect for Streaming Text" toggle) - Adjusting Stream Delta Chunk Size, increasing it ("Stream Delta Chunk Size" in adminSettings/General) I think that the slowdown occurs after SSE parsing. When splitLargeDeltas is enabled: - Content chunks larger than 5 characters are artificially split into 1-3 character pieces - Each mini-chunk has a 5ms delay (await sleep(5)) between yields - For a 200-character chunk, this adds ~330-1000ms of artificial delay With large context responses generating hundreds of tokens per SSE event, probably this chunking creates the perceived slowness you're experiencing. Note: the delay is hardcoded here: https://github.com/open-webui/open-webui/blob/46ae3f4f5d7d4d706041bdae4ad2d802e568712b/src/lib/apis/streaming/index.ts#L135

GiteaMirror commented

2026-04-20 00:49:28 -05:00

@Cyp9715 commented on GitHub (Oct 23, 2025):

Increasing the "Stream Delta Chunk Size" value has resolved the issue. Is this due to a performance issue with the client PC?
Thank you for your appreciation, and you may close this issue.

@Cyp9715 commented on GitHub (Oct 23, 2025): Increasing the "Stream Delta Chunk Size" value has resolved the issue. Is this due to a performance issue with the client PC? Thank you for your appreciation, and you may close this issue.

GiteaMirror commented

2026-04-20 00:49:28 -05:00

@hdnh2006 commented on GitHub (Nov 26, 2025):

Are you tried:

Disabling the "Fade Effect for Streaming Text" setting (go to Settings > Interface and turn off the "Fade Effect for Streaming Text" toggle)

Adjusting Stream Delta Chunk Size, increasing it ("Stream Delta Chunk Size" in adminSettings/General)

I think that the slowdown occurs after SSE parsing. When splitLargeDeltas is enabled:

Content chunks larger than 5 characters are artificially split into 1-3 character pieces

Each mini-chunk has a 5ms delay (await sleep(5)) between yields

For a 200-character chunk, this adds ~330-1000ms of artificial delay

With large context responses generating hundreds of tokens per SSE event, probably this chunking creates the perceived slowness you're experiencing.

Note: the delay is hardcoded here:

open-webui/src/lib/apis/streaming/index.ts

Line 135 in 46ae3f4

await sleep(5);

Sorry mate, but I can't find that option in Settings/General, I am facing the same issue.

@Cyp9715 any idea?

@hdnh2006 commented on GitHub (Nov 26, 2025): > Are you tried: > > * Disabling the "Fade Effect for Streaming Text" setting (go to Settings > Interface and turn off the "Fade Effect for Streaming Text" toggle) > * Adjusting Stream Delta Chunk Size, increasing it ("Stream Delta Chunk Size" in adminSettings/General) > > I think that the slowdown occurs after SSE parsing. When splitLargeDeltas is enabled: > > * Content chunks larger than 5 characters are artificially split into 1-3 character pieces > * Each mini-chunk has a 5ms delay (await sleep(5)) between yields > * For a 200-character chunk, this adds ~330-1000ms of artificial delay > > With large context responses generating hundreds of tokens per SSE event, probably this chunking creates the perceived slowness you're experiencing. > > Note: the delay is hardcoded here: > > [open-webui/src/lib/apis/streaming/index.ts](https://github.com/open-webui/open-webui/blob/46ae3f4f5d7d4d706041bdae4ad2d802e568712b/src/lib/apis/streaming/index.ts#L135) > > Line 135 in [46ae3f4](/open-webui/open-webui/commit/46ae3f4f5d7d4d706041bdae4ad2d802e568712b) > > await sleep(5); Sorry mate, but I can't find that option in Settings/General, I am facing the same issue. @Cyp9715 any idea? <img width="1658" height="853" alt="Image" src="https://github.com/user-attachments/assets/7c7df773-d7cd-404e-aae0-485c6fe16abb" />

GiteaMirror commented

2026-04-20 00:49:29 -05:00

@hdnh2006 commented on GitHub (Nov 26, 2025):

Ok, it looks like my problem was similar but I solved it just setting: UVICORN_WORKERS=1. Previously it was set to 4.

This solved my issue.

@hdnh2006 commented on GitHub (Nov 26, 2025): Ok, it looks like my problem was similar but I solved it just setting: `UVICORN_WORKERS=1`. Previously it was set to 4. This solved my issue.

GiteaMirror referenced this issue

2026-04-20 05:37:13 -05:00

[PR #18607] [CLOSED] feat: Add calling system prompt #24852

GiteaMirror referenced this issue

2026-04-25 12:58:11 -05:00

[PR #18607] [CLOSED] feat: Add calling system prompt #40482

GiteaMirror referenced this issue

2026-04-29 23:19:10 -05:00

[PR #18607] [CLOSED] feat: Add calling system prompt #47900

GiteaMirror referenced this issue

2026-05-06 08:40:11 -05:00

[PR #18607] [CLOSED] feat: Add calling system prompt #63708