[GH-ISSUE #15850] issue: missing tokens when streaming on fast inference providers #56358

Closed
opened 2026-05-05 19:13:10 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @kalebwalton on GitHub (Jul 18, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/15850

Check Existing Issues

  • I have searched the existing issues and discussions.
  • I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

v0.6.16

Ollama Version (if applicable)

No response

Operating System

Windows Sequoia

Browser (if applicable)

No response

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided every relevant configuration, setting, and environment variable used in my setup.
  • I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
  • I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
  • Start with the initial platform/version/OS and dependencies used,
  • Specify exact install/launch/configure commands,
  • List URLs visited, user input (incl. example values/emails/passwords if needed),
  • Describe all options and toggles enabled or changed,
  • Include any files or environmental changes,
  • Identify the expected and actual result at each stage,
  • Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

Streaming output provides all streamed content and does not miss any parts

Actual Behavior

Streaming output occasionally misses a stream chunk (a few characters). It is often unnoticeable and you may assume it's a model issue or an inference provider issue, however, I have validated the issue occurs with multiple models from multiple inference providers.

Steps to Reproduce

  1. Overcome issue https://github.com/open-webui/open-webui/issues/15848 by monkeypatching backend/open_webui/utils/middleware.py replacing 2470da8336/backend/open_webui/utils/middleware.py (L2042) with log.debug(f"Error: {e}") (this enables debug logging to print streaming errors properly)
  2. Run latest with docker run -d --name openwebui -p 3000:8080 -e GLOBAL_LOG_LEVEL=debug -v /path/to/monkeypatched_middleware.py:/app/backend/open_webui/utils/middleware.py -v openwebui-data:/app/backend/data --restart unless-stopped ghcr.io/open-webui/open-webui:latest
  3. Configure with any model like Cerebras qwen-3-235b-a22b or OpenAI gpt-4o-mini
  4. Run a prompt like 'print a bunch of stuff'
  5. Check the logs for errors like those indicated below
  6. If you don't see the error then rerun the prompt a few times or another prompt that outputs a bunch of tokens and it'll show up

**NOTE: I believe this happens more frequently on faster streaming models like OpenAI gpt-4o-mini or Cerebras qwen-3-235b-a22b . **

Logs & Screenshots

2025-07-18 20:35:40.116 | DEBUG    | open_webui.utils.middleware:stream_body_handler:2058 - Error: Unterminated string starting at: line 1 column 139 (char 138) - {}
2025-07-18 20:35:40.137 | DEBUG    | open_webui.utils.middleware:stream_body_handler:2058 - Error: Unterminated string starting at: line 1 column 172 (char 171) - {}
2025-07-18 20:35:40.268 | DEBUG    | open_webui.utils.middleware:stream_body_handler:2058 - Error: Expecting ':' delimiter: line 1 column 171 (char 170) - {}
2025-07-18 20:35:40.306 | DEBUG    | open_webui.utils.middleware:stream_body_handler:2058 - Error: Unterminated string starting at: line 1 column 7 (char 6) - {}
2025-07-18 20:35:40.325 | DEBUG    | open_webui.utils.middleware:stream_body_handler:2058 - Error: Unterminated string starting at: line 1 column 173 (char 172) - {}

Additional Information

I have investigated this at length. If you add do some debugging after line 1786 you'll find that 1. every so often a valid JSON event will be chunked into two line iterations, the first will contain some of the data including the beginning of the JSON string, and the second will contain the remaining part of the JSON string, and 2. each line does not contain line endings.

I dug around and found 2470da8336/backend/open_webui/routers/openai.py (L865) which seems to use aiohttp.ClientSession, and then I try to follow that through and get a bit confused.

I don't know whether the correct solution is to do buffering in openwebui's middleware.py where it's processing the lines (which won't work well because the line endings are not showing up and you can only key on something like }), or if the solution is to do something lower level to prevent SSE JSON lines from ever being fragmented in the first place...

Originally created by @kalebwalton on GitHub (Jul 18, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/15850 ### Check Existing Issues - [x] I have searched the existing issues and discussions. - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version v0.6.16 ### Ollama Version (if applicable) _No response_ ### Operating System Windows Sequoia ### Browser (if applicable) _No response_ ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior Streaming output provides all streamed content and does not miss any parts ### Actual Behavior Streaming output occasionally misses a stream chunk (a few characters). It is often unnoticeable and you may assume it's a model issue or an inference provider issue, however, I have validated the issue occurs with multiple models from multiple inference providers. ### Steps to Reproduce 1. Overcome issue https://github.com/open-webui/open-webui/issues/15848 by monkeypatching backend/open_webui/utils/middleware.py replacing https://github.com/open-webui/open-webui/blob/2470da833679f61619f2275862185259fe7f5168/backend/open_webui/utils/middleware.py#L2042 with `log.debug(f"Error: {e}")` (this enables debug logging to print streaming errors properly) 2. Run latest with `docker run -d --name openwebui -p 3000:8080 -e GLOBAL_LOG_LEVEL=debug -v /path/to/monkeypatched_middleware.py:/app/backend/open_webui/utils/middleware.py -v openwebui-data:/app/backend/data --restart unless-stopped ghcr.io/open-webui/open-webui:latest` 3. Configure with any model like Cerebras qwen-3-235b-a22b or OpenAI gpt-4o-mini 4. Run a prompt like 'print a bunch of stuff' 5. Check the logs for errors like those indicated below 6. If you don't see the error then rerun the prompt a few times or another prompt that outputs a bunch of tokens and it'll show up **NOTE: I believe this happens more frequently on faster streaming models like OpenAI gpt-4o-mini or Cerebras qwen-3-235b-a22b . ** ### Logs & Screenshots ``` 2025-07-18 20:35:40.116 | DEBUG | open_webui.utils.middleware:stream_body_handler:2058 - Error: Unterminated string starting at: line 1 column 139 (char 138) - {} 2025-07-18 20:35:40.137 | DEBUG | open_webui.utils.middleware:stream_body_handler:2058 - Error: Unterminated string starting at: line 1 column 172 (char 171) - {} 2025-07-18 20:35:40.268 | DEBUG | open_webui.utils.middleware:stream_body_handler:2058 - Error: Expecting ':' delimiter: line 1 column 171 (char 170) - {} 2025-07-18 20:35:40.306 | DEBUG | open_webui.utils.middleware:stream_body_handler:2058 - Error: Unterminated string starting at: line 1 column 7 (char 6) - {} 2025-07-18 20:35:40.325 | DEBUG | open_webui.utils.middleware:stream_body_handler:2058 - Error: Unterminated string starting at: line 1 column 173 (char 172) - {} ``` ### Additional Information I have investigated this at length. If you add do some debugging after line 1786 you'll find that 1. every so often a valid JSON event will be chunked into two `line` iterations, the first will contain some of the data including the beginning of the JSON string, and the second will contain the remaining part of the JSON string, and 2. each `line` does not contain line endings. I dug around and found https://github.com/open-webui/open-webui/blob/2470da833679f61619f2275862185259fe7f5168/backend/open_webui/routers/openai.py#L865 which seems to use aiohttp.ClientSession, and then I try to follow that through and get a bit confused. I don't know whether the correct solution is to do buffering in openwebui's middleware.py where it's processing the lines (which won't work well because the line endings are not showing up and you can only key on something like `}`), or if the solution is to do something lower level to prevent SSE JSON lines from ever being fragmented in the first place...
GiteaMirror added the bug label 2026-05-05 19:13:10 -05:00
Author
Owner

@kalebwalton commented on GitHub (Jul 24, 2025):

I have done additional testing and believe that there's a direct correlation between the speed of the inference provider and the number of missing tokens. I believe the fragmentation is occurring OpenWebUIs use of dependent library such as aiohttp, and I am not certain if the fix needs to come in OpenWebUI or aiohttp or other area, but I think it needs to start at OpenWebUI.

I believe this will become more prevalent as inference providers run on faster AI hardware.

<!-- gh-comment-id:3113737627 --> @kalebwalton commented on GitHub (Jul 24, 2025): I have done additional testing and believe that there's a direct correlation between the speed of the inference provider and the number of missing tokens. I believe the fragmentation is occurring OpenWebUIs use of dependent library such as aiohttp, and I am not certain if the fix needs to come in OpenWebUI or aiohttp or other area, but I think it needs to start at OpenWebUI. I believe this will become more prevalent as inference providers run on faster AI hardware.
Author
Owner

@tjbck commented on GitHub (Aug 4, 2025):

Unable to reproduce, keep us updated!

<!-- gh-comment-id:3150349468 --> @tjbck commented on GitHub (Aug 4, 2025): Unable to reproduce, keep us updated!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#56358