[PR #16329] [CLOSED] feat: Batched response streaming #62952

Closed
opened 2026-05-06 07:26:05 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/16329
Author: @Ithanil
Created: 8/6/2025
Status: Closed

Base: devHead: batch_streaming


📝 Commits (6)

  • d4e1c78 batch deltas when streaming responses, for better performance with high token/s (BATCH_SIZE hardcoded)
  • 719d01d reduce minimal delta count for fluid streaming visualization to 3
  • e5b4b9f allow to set streaming batch size per model (and as admin in settings / chat controls)
  • a81294c make sure the highest of all streaming batch size settings is used and allow the option for non-admins
  • dc064ca fix filtering out stream_batch_size from request parameters
  • adbd505 harden extraction of stream_batch_size from form_data

📊 Changes

7 files changed (+88 additions, -7 deletions)

View changed files

📝 backend/open_webui/main.py (+1 -0)
📝 backend/open_webui/utils/middleware.py (+20 -6)
📝 backend/open_webui/utils/payload.py (+1 -0)
📝 src/lib/apis/streaming/index.ts (+1 -1)
📝 src/lib/components/chat/Chat.svelte (+8 -0)
📝 src/lib/components/chat/Settings/Advanced/AdvancedParams.svelte (+55 -0)
📝 src/lib/components/chat/Settings/General.svelte (+2 -0)

📄 Description

Pull Request Checklist

Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.

Before submitting, make sure you've checked the following:

  • Target branch: Please verify that the pull request targets the dev branch.
  • Description: Provide a concise description of the changes made in this pull request.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources?
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Have you written and run sufficient tests to validate the changes?
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • build: Changes that affect the build system or external dependencies
    • ci: Changes to our continuous integration processes or workflows
    • chore: Refactor, cleanup, or other non-functional code changes
    • docs: Documentation update or addition
    • feat: Introduces a new feature or enhancement to the codebase
    • fix: Bug fix or error correction
    • i18n: Internationalization or localization changes
    • perf: Performance improvement
    • refactor: Code restructuring for better maintainability, readability, or scalability
    • style: Changes that do not affect the meaning of the code (white space, formatting, missing semi-colons, etc.)
    • test: Adding missing tests or correcting existing tests
    • WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

Description

Currently, streaming high token/s via Open WebUI leads to very high CPU usage on the server, the Redis and also the client. In the worst case, the pubsub messages aren't consumed fast enough back from Redis, leading to a quickly growing output buffer on the Redis server and ultimately a terminated Redis connection. It is possible to make your deployment unusable by streaming enough fast responses.

In the following part of the result of profiling using cProfile during streaming of a response:

>>> p.sort_stats(pstats.SortKey.CUMULATIVE).print_stats()
Tue Aug  5 22:24:25 2025   profiling_results.prof

         29276198 function calls (28451782 primitive calls) in 42.148 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   131723    0.307    0.000   21.347    0.000 /usr/local/lib/python3.11/site-packages/socketio/async_server.py:129(emit)
   131723    0.294    0.000   20.924    0.000 /usr/local/lib/python3.11/site-packages/socketio/async_pubsub_manager.py:40(emit)
   204781    0.316    0.000   11.431    0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/retry.py:58(call_with_retry)
   125999    0.251    0.000   10.747    0.000 /usr/local/lib/python3.11/site-packages/socketio/async_redis_manager.py:73(_publish)
    68688    0.317    0.000    9.869    0.000 /usr/local/lib/python3.11/site-packages/socketio/async_pubsub_manager.py:138(_handle_emit)
   126004    0.477    0.000    9.838    0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/client.py:667(execute_command)
    68688    0.589    0.000    9.485    0.000 /usr/local/lib/python3.11/site-packages/socketio/async_manager.py:13(emit)
    15837    0.496    0.000    8.688    0.001 /usr/local/lib/python3.11/site-packages/socketio/async_pubsub_manager.py:197(_thread)
   204797    0.425    0.000    7.624    0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/connection.py:567(read_response)
    78801    0.112    0.000    7.074    0.000 /usr/local/lib/python3.11/site-packages/socketio/async_redis_manager.py:112(_listen)
   204797    0.360    0.000    7.046    0.000 /usr/local/lib/python3.11/site-packages/redis/_parsers/resp2.py:74(read_response)
    78798    0.084    0.000    6.961    0.000 /usr/local/lib/python3.11/site-packages/socketio/async_redis_manager.py:91(_redis_listen_with_retries)
    78801    0.202    0.000    6.877    0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/client.py:1094(listen)
    11193    0.015    0.000    6.791    0.001 /app/backend/open_webui/utils/middleware.py:1407(response_handler)
    11184    0.281    0.000    6.701    0.001 /app/backend/open_webui/utils/middleware.py:1801(stream_body_handler)
395774/204797    1.047    0.000    6.540    0.000 /usr/local/lib/python3.11/site-packages/redis/_parsers/resp2.py:87(_read_response)
    78802    0.259    0.000    6.234    0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/client.py:969(parse_response)
    62965    0.332    0.000    5.972    0.000 /usr/local/lib/python3.11/site-packages/socketio/packet.py:45(encode)
    78804    0.120    0.000    5.917    0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/client.py:956(_execute)
    63030    0.257    0.000    5.659    0.000 /usr/local/lib/python3.11/json/__init__.py:183(dumps)
   125930    0.232    0.000    5.337    0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/client.py:647(_send_command_parse_response)
    63030    0.214    0.000    5.308    0.000 /usr/local/lib/python3.11/json/encoder.py:183(encode)
    63030    5.052    0.000    5.052    0.000 /usr/local/lib/python3.11/json/encoder.py:205(iterencode)
    65045    0.194    0.000    3.970    0.000 /usr/local/lib/python3.11/site-packages/redis/_parsers/resp2.py:123(<listcomp>)
   391627    1.126    0.000    3.786    0.000 /usr/local/lib/python3.11/site-packages/redis/_parsers/base.py:273(_readline)
    11448    0.113    0.000    3.625    0.000 /app/backend/open_webui/socket/main.py:633(__event_emitter__)
    63051    0.249    0.000    2.742    0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/connection.py:1131(get_connection)
     5731    0.061    0.000    2.742    0.000 /usr/local/lib/python3.11/site-packages/engineio/async_socket.py:198(writer)
    63012    0.175    0.000    2.630    0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/connection.py:552(send_command)
     5724    0.012    0.000    2.575    0.000 /app/backend/open_webui/socket/utils.py:74(get)
     5729    0.036    0.000    2.566    0.000 /app/backend/open_webui/socket/utils.py:48(__getitem__)

Obviously, most time is spent somehow related to SocketIO/Redis and streaming in the middleware.

One way to reduce the problem is introduced in this PR: The possibility to batch multiple tokens together before emission during streaming. This reduces proportionally to the batch size the number of events emitted, the number of pubsub messages and the amount of data processed by the client, improving performance in all components.

Because the optimal batch size for a still visually fluid streaming depends on the generated token/s of the given model, the setting is introduced as "Advanced parameter", configurable for each model individually, but also per user as setting or in chat controls. The highest value will take precedence, such that the admin maintains control of the minimal acceptable batch size. The default batch size remains at 1, i.e. no batching.

The fluidity of the streaming is maintained for considerable batch sizes, depending on generation speed, by the fluid streaming mechanism in the frontend code. The minimal threshold for it to apply is reduced from 5 to 3 deltas.

As a result, just setting a batch size of 3 reduces CPU usage on the server by about 50%, with minimal loss in fluidity for decent token generation rates.

Added

  • Delta batching for response streaming
  • Configurable batch size per model, user, chat

Changed

  • Tweaked threshold for fluid streaming mechanism in frontend

Screenshots or Videos

https://github.com/user-attachments/assets/842fcac2-8128-4384-ac1b-d38b4ccea084

https://github.com/user-attachments/assets/7a361676-922f-4642-9ab2-b59f8b6ff57d

Screenshot From 2025-08-06 14-52-55

Additional notes

An argument could be made that the term "buffer" should be used instead of "batch", affecting the variable names used. Please let me know if you prefer to call it buffer.

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/16329 **Author:** [@Ithanil](https://github.com/Ithanil) **Created:** 8/6/2025 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `batch_streaming` --- ### 📝 Commits (6) - [`d4e1c78`](https://github.com/open-webui/open-webui/commit/d4e1c7865f1a67dde978f4df98846073796c212d) batch deltas when streaming responses, for better performance with high token/s (BATCH_SIZE hardcoded) - [`719d01d`](https://github.com/open-webui/open-webui/commit/719d01d0d3b6149bc6087d6a72f7882171d81977) reduce minimal delta count for fluid streaming visualization to 3 - [`e5b4b9f`](https://github.com/open-webui/open-webui/commit/e5b4b9f8484a2fcb50ef7ac8e8ad862660242e87) allow to set streaming batch size per model (and as admin in settings / chat controls) - [`a81294c`](https://github.com/open-webui/open-webui/commit/a81294ce225f8dd7e85cbb6f33b8340e566a01de) make sure the highest of all streaming batch size settings is used and allow the option for non-admins - [`dc064ca`](https://github.com/open-webui/open-webui/commit/dc064cadf7d75855f58dca7f8d9b3dc1f2e4ef8e) fix filtering out stream_batch_size from request parameters - [`adbd505`](https://github.com/open-webui/open-webui/commit/adbd50582d01059374cf19921b62436af1fb6bf1) harden extraction of stream_batch_size from form_data ### 📊 Changes **7 files changed** (+88 additions, -7 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/main.py` (+1 -0) 📝 `backend/open_webui/utils/middleware.py` (+20 -6) 📝 `backend/open_webui/utils/payload.py` (+1 -0) 📝 `src/lib/apis/streaming/index.ts` (+1 -1) 📝 `src/lib/components/chat/Chat.svelte` (+8 -0) 📝 `src/lib/components/chat/Settings/Advanced/AdvancedParams.svelte` (+55 -0) 📝 `src/lib/components/chat/Settings/General.svelte` (+2 -0) </details> ### 📄 Description # Pull Request Checklist ### Note to first-time contributors: Please open a discussion post in [Discussions](https://github.com/open-webui/open-webui/discussions) and describe your changes before submitting a pull request. **Before submitting, make sure you've checked the following:** - [x] **Target branch:** Please verify that the pull request targets the `dev` branch. - [x] **Description:** Provide a concise description of the changes made in this pull request. - [x] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [x] **Documentation:** Have you updated relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs), or other documentation sources? - [x] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - [x] **Testing:** Have you written and run sufficient tests to validate the changes? - [x] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [x] **Prefix:** To clearly categorize this pull request, prefix the pull request title using one of the following: - **BREAKING CHANGE**: Significant changes that may affect compatibility - **build**: Changes that affect the build system or external dependencies - **ci**: Changes to our continuous integration processes or workflows - **chore**: Refactor, cleanup, or other non-functional code changes - **docs**: Documentation update or addition - **feat**: Introduces a new feature or enhancement to the codebase - **fix**: Bug fix or error correction - **i18n**: Internationalization or localization changes - **perf**: Performance improvement - **refactor**: Code restructuring for better maintainability, readability, or scalability - **style**: Changes that do not affect the meaning of the code (white space, formatting, missing semi-colons, etc.) - **test**: Adding missing tests or correcting existing tests - **WIP**: Work in progress, a temporary label for incomplete or ongoing work # Changelog Entry ### Description Currently, streaming high token/s via Open WebUI leads to *very* high CPU usage on the server, the Redis and also the client. In the worst case, the pubsub messages aren't consumed fast enough back from Redis, leading to a quickly growing output buffer on the Redis server and ultimately a **terminated** Redis connection. It is possible to make your deployment **unusable** by streaming enough fast responses. In the following part of the result of profiling using cProfile during streaming of a response: ``` >>> p.sort_stats(pstats.SortKey.CUMULATIVE).print_stats() Tue Aug 5 22:24:25 2025 profiling_results.prof 29276198 function calls (28451782 primitive calls) in 42.148 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 131723 0.307 0.000 21.347 0.000 /usr/local/lib/python3.11/site-packages/socketio/async_server.py:129(emit) 131723 0.294 0.000 20.924 0.000 /usr/local/lib/python3.11/site-packages/socketio/async_pubsub_manager.py:40(emit) 204781 0.316 0.000 11.431 0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/retry.py:58(call_with_retry) 125999 0.251 0.000 10.747 0.000 /usr/local/lib/python3.11/site-packages/socketio/async_redis_manager.py:73(_publish) 68688 0.317 0.000 9.869 0.000 /usr/local/lib/python3.11/site-packages/socketio/async_pubsub_manager.py:138(_handle_emit) 126004 0.477 0.000 9.838 0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/client.py:667(execute_command) 68688 0.589 0.000 9.485 0.000 /usr/local/lib/python3.11/site-packages/socketio/async_manager.py:13(emit) 15837 0.496 0.000 8.688 0.001 /usr/local/lib/python3.11/site-packages/socketio/async_pubsub_manager.py:197(_thread) 204797 0.425 0.000 7.624 0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/connection.py:567(read_response) 78801 0.112 0.000 7.074 0.000 /usr/local/lib/python3.11/site-packages/socketio/async_redis_manager.py:112(_listen) 204797 0.360 0.000 7.046 0.000 /usr/local/lib/python3.11/site-packages/redis/_parsers/resp2.py:74(read_response) 78798 0.084 0.000 6.961 0.000 /usr/local/lib/python3.11/site-packages/socketio/async_redis_manager.py:91(_redis_listen_with_retries) 78801 0.202 0.000 6.877 0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/client.py:1094(listen) 11193 0.015 0.000 6.791 0.001 /app/backend/open_webui/utils/middleware.py:1407(response_handler) 11184 0.281 0.000 6.701 0.001 /app/backend/open_webui/utils/middleware.py:1801(stream_body_handler) 395774/204797 1.047 0.000 6.540 0.000 /usr/local/lib/python3.11/site-packages/redis/_parsers/resp2.py:87(_read_response) 78802 0.259 0.000 6.234 0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/client.py:969(parse_response) 62965 0.332 0.000 5.972 0.000 /usr/local/lib/python3.11/site-packages/socketio/packet.py:45(encode) 78804 0.120 0.000 5.917 0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/client.py:956(_execute) 63030 0.257 0.000 5.659 0.000 /usr/local/lib/python3.11/json/__init__.py:183(dumps) 125930 0.232 0.000 5.337 0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/client.py:647(_send_command_parse_response) 63030 0.214 0.000 5.308 0.000 /usr/local/lib/python3.11/json/encoder.py:183(encode) 63030 5.052 0.000 5.052 0.000 /usr/local/lib/python3.11/json/encoder.py:205(iterencode) 65045 0.194 0.000 3.970 0.000 /usr/local/lib/python3.11/site-packages/redis/_parsers/resp2.py:123(<listcomp>) 391627 1.126 0.000 3.786 0.000 /usr/local/lib/python3.11/site-packages/redis/_parsers/base.py:273(_readline) 11448 0.113 0.000 3.625 0.000 /app/backend/open_webui/socket/main.py:633(__event_emitter__) 63051 0.249 0.000 2.742 0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/connection.py:1131(get_connection) 5731 0.061 0.000 2.742 0.000 /usr/local/lib/python3.11/site-packages/engineio/async_socket.py:198(writer) 63012 0.175 0.000 2.630 0.000 /usr/local/lib/python3.11/site-packages/redis/asyncio/connection.py:552(send_command) 5724 0.012 0.000 2.575 0.000 /app/backend/open_webui/socket/utils.py:74(get) 5729 0.036 0.000 2.566 0.000 /app/backend/open_webui/socket/utils.py:48(__getitem__) ``` Obviously, most time is spent somehow related to SocketIO/Redis and streaming in the middleware. One way to reduce the problem is introduced in this PR: The possibility to batch multiple tokens together before emission during streaming. This reduces proportionally to the batch size the number of events emitted, the number of pubsub messages and the amount of data processed by the client, improving performance in all components. Because the optimal batch size for a still visually fluid streaming depends on the generated token/s of the given model, the setting is introduced as "Advanced parameter", configurable for each model individually, but also per user as setting or in chat controls. The **highest** value will take precedence, such that the admin maintains control of the minimal acceptable batch size. The default batch size remains at 1, i.e. no batching. The fluidity of the streaming is maintained for considerable batch sizes, depending on generation speed, by the fluid streaming mechanism in the frontend code. The minimal threshold for it to apply is reduced from 5 to 3 deltas. **As a result, just setting a batch size of 3 reduces CPU usage on the server by about 50%, with minimal loss in fluidity for decent token generation rates.** ### Added - Delta batching for response streaming - Configurable batch size per model, user, chat ### Changed - Tweaked threshold for fluid streaming mechanism in frontend ### Screenshots or Videos https://github.com/user-attachments/assets/842fcac2-8128-4384-ac1b-d38b4ccea084 https://github.com/user-attachments/assets/7a361676-922f-4642-9ab2-b59f8b6ff57d <img width="1207" height="363" alt="Screenshot From 2025-08-06 14-52-55" src="https://github.com/user-attachments/assets/0cc69a50-5644-4cdd-bdba-587f3103e02d" /> ### Additional notes An argument could be made that the term "buffer" should be used instead of "batch", affecting the variable names used. Please let me know if you prefer to call it buffer. ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-06 07:26:05 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#62952