mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-08 04:16:03 -05:00
[PR #16329] [CLOSED] feat: Batched response streaming #39726
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/16329
Author: @Ithanil
Created: 8/6/2025
Status: ❌ Closed
Base:
dev← Head:batch_streaming📝 Commits (6)
d4e1c78batch deltas when streaming responses, for better performance with high token/s (BATCH_SIZE hardcoded)719d01dreduce minimal delta count for fluid streaming visualization to 3e5b4b9fallow to set streaming batch size per model (and as admin in settings / chat controls)a81294cmake sure the highest of all streaming batch size settings is used and allow the option for non-adminsdc064cafix filtering out stream_batch_size from request parametersadbd505harden extraction of stream_batch_size from form_data📊 Changes
7 files changed (+88 additions, -7 deletions)
View changed files
📝
backend/open_webui/main.py(+1 -0)📝
backend/open_webui/utils/middleware.py(+20 -6)📝
backend/open_webui/utils/payload.py(+1 -0)📝
src/lib/apis/streaming/index.ts(+1 -1)📝
src/lib/components/chat/Chat.svelte(+8 -0)📝
src/lib/components/chat/Settings/Advanced/AdvancedParams.svelte(+55 -0)📝
src/lib/components/chat/Settings/General.svelte(+2 -0)📄 Description
Pull Request Checklist
Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.
Before submitting, make sure you've checked the following:
devbranch.Changelog Entry
Description
Currently, streaming high token/s via Open WebUI leads to very high CPU usage on the server, the Redis and also the client. In the worst case, the pubsub messages aren't consumed fast enough back from Redis, leading to a quickly growing output buffer on the Redis server and ultimately a terminated Redis connection. It is possible to make your deployment unusable by streaming enough fast responses.
In the following part of the result of profiling using cProfile during streaming of a response:
Obviously, most time is spent somehow related to SocketIO/Redis and streaming in the middleware.
One way to reduce the problem is introduced in this PR: The possibility to batch multiple tokens together before emission during streaming. This reduces proportionally to the batch size the number of events emitted, the number of pubsub messages and the amount of data processed by the client, improving performance in all components.
Because the optimal batch size for a still visually fluid streaming depends on the generated token/s of the given model, the setting is introduced as "Advanced parameter", configurable for each model individually, but also per user as setting or in chat controls. The highest value will take precedence, such that the admin maintains control of the minimal acceptable batch size. The default batch size remains at 1, i.e. no batching.
The fluidity of the streaming is maintained for considerable batch sizes, depending on generation speed, by the fluid streaming mechanism in the frontend code. The minimal threshold for it to apply is reduced from 5 to 3 deltas.
As a result, just setting a batch size of 3 reduces CPU usage on the server by about 50%, with minimal loss in fluidity for decent token generation rates.
Added
Changed
Screenshots or Videos
https://github.com/user-attachments/assets/842fcac2-8128-4384-ac1b-d38b4ccea084
https://github.com/user-attachments/assets/7a361676-922f-4642-9ab2-b59f8b6ff57d
Additional notes
An argument could be made that the term "buffer" should be used instead of "batch", affecting the variable names used. Please let me know if you prefer to call it buffer.
Contributor License Agreement
By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.