mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-21 09:01:47 -05:00
[GH-ISSUE #23733] issue/perf: Exponential growth of backend, frontend and network bandwith usage with growing chat length #35585
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Classic298 on GitHub (Apr 14, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/23733
Socket.IO emits grow O(N²) during LLM streaming: full message is re-serialized on every token
Thanks to @ShirasawaSama !!
Simple TLDR:
TLDR: Open WebUI re-sends and re-renders the whole chat on every single token, so CPU, RAM, bandwidth, and Redis all explode exponentially in CPU and memory and bandwith usage as the conversation grows. Gets even worse with high concurrency and long chats.
Summary
During an LLM streaming response, the backend re-serializes the entire accumulated output (all prior text, reasoning blocks, tool calls, images, sources) into one HTML string and emits it via Socket.IO on every SSE event. As a response grows, each WebSocket frame grows with it, so total bytes on the wire for an N-token response scale as O(N²). The cost is then amplified by Socket.IO's Redis pub/sub fan-out (× Redis nodes × workers) and again by the frontend Markdown parser, which re-parses the whole content string on each update.
This is visible in devtools → Network → WS frames on any long streaming response:
chat:completionframe sizes climb steadily (e.g. 3014 → 3037 → 3073 → 3092 → … bytes) as the response streams in.Reproduction
devbranch backend and frontend.eventsframe carryingtype: "chat:completion"— note that thecontentfield grows by the entire response so far every single frame, not by the delta.Root cause
All primary offenders live in
backend/open_webui/utils/middleware.pyinsidestreaming_chat_response_handler/stream_body_handler.serialize_output()— re-serializes the whole output list on every callbackend/open_webui/utils/middleware.py:404-453full_output()— always cumulativebackend/open_webui/utils/middleware.py:3603-3604Tool-call emit — full re-serialize on each tool-call delta
backend/open_webui/utils/middleware.py:3872-3879Main text-delta emit — full re-serialize on each token
backend/open_webui/utils/middleware.py:4080-4106delta_chunk_sizeonly batches frequency, not payload sizebackend/open_webui/utils/middleware.py:3645-3663The existing
delta_chunk_size/flush_pending_delta_datamechanismreduces how often emits are sent, but each emit still carries
serialize_output(full_output())— i.e. the full blob. So increasingdelta_chunk_sizetrades latency for bandwidth without fixing theunderlying growth.
The emit sink
backend/open_webui/socket/main.py:814-828When
WEBSOCKET_MANAGER=redis, every emit goes throughsocketio.AsyncRedisManagerand is published to Redis pub/sub, pickled/unpickled on every subscribing worker.Frontend amplification
src/lib/components/chat/Chat.svelte:1711-1743— on everychat:completion, the fulldata.contentstring overwritesmessage.content, defeating any delta optimization that might exist upstream.src/lib/components/chat/Messages/Markdown.svelte:73-94— the Markdown component re-parses the entiremessage.contentstring onceper
requestAnimationFrame, which is 20+ ms on large conversations.Note: the frontend already has a working delta path at
src/lib/components/chat/Chat.svelte:472-473(chat:message:delta→message.content += data.content). The backend simply doesn't use it for the streaming hot path.The damage equation
Per incoming SSE token, every one of these four layers pays for the growing blob:
Total bytes across the infrastructure for a single response:
Concrete example matching the orders of magnitude in observed traffic:
At 30 tok/s, the frontend main thread spends 600+ ms/sec re-parsing Markdown for content the user has already seen, which is why long-chat streaming feels janky on slower machines / mobile.
Acknowledgments
The root-cause analysis, the observation that Redis pub/sub amplifies the problem catastrophically, and the JSON-Patch-with-separated-blocks design all come from @Shirasawasama. This issue writes up their findings so they can be tracked in the repository.
@Classic298 commented on GitHub (Apr 14, 2026):
The PR #23735 MASSIVELY improves the current stance and will make it much better for ALL deployments - but it's not a full fix. it's only 90% of the way there.
The PR 23735 only fixes it for the actual message content - reasoning and tool call not yet fixed with that PR - but still a massive improvement already
@tjbck commented on GitHub (Apr 17, 2026):
Thanks for the thorough analysis. The O(N²) characterization is correct and this is something I've been aware of. Before jumping to solutions I want to explain why the current implementation works the way it does, because the properties it provides are load-bearing and easy to take for granted until they break.
The current architecture is full-state-on-every-emit by design. Every Socket.IO frame carries the complete rendered content of the assistant message. This makes the frontend entirely stateless with respect to streaming: it receives a string, sets
message.content, and hands it to the Markdown renderer. If a WebSocket frame is dropped, the connection flaps, the user switches tabs and comes back, or the browser GC causes a missed event, the very next frame self-corrects because it contains the complete truth. There is no accumulated client-side state that can drift out of sync with the backend.The backend is also the sole rendering authority.
serialize_output()produces the canonical HTML (with<details>blocks for reasoning, tool calls, code interpreter output), and the frontend is intentionally a dumb pipe to the DOM. This is the simplest possible contract between backend and frontend, and it has been extremely reliable across every provider, model configuration, and edge case encountered. TheENABLE_REALTIME_CHAT_SAVEpath writes this same serialized content to the database on every token, which is why page refreshes during active streaming always show correct content.The cost of this reliability is quadratic total bandwidth. The serialized content grows linearly with tokens emitted, and it emits once per token (or per
delta_chunk_sizetokens), so total bytes scale as O(N²/K). For a 2000-token response with reasoning and tool calls, that's roughly 10 MB of cumulative WebSocket traffic before Redis fan-out. That is real and I want to improve it.The natural response is to switch to deltas: send only the new token text and have the frontend append it. This drops total bandwidth to O(N), which is obviously attractive. But it introduces problems beyond just lost frames. A user opening the same chat in a second browser tab while streaming is active would have no base content to apply deltas to, and would see nothing (or need a separate full-state request mechanism that does not currently exist). Stream filter functions that plugins can hook into on every event would only see an isolated token fragment instead of full content, breaking any filter that needs context across tokens such as content moderation, pattern redaction, or formatting transforms. When a user cancels mid-stream, the backend serializes the current output and saves it to the database, but the frontend's accumulated
+=string was never verified againstserialize_output()and could silently diverge due to split unicode sequences, partially detected tags, or filter modifications. On mobile and unstable networks where Socket.IO reconnects are frequent, every micro-disconnect becomes a potential silent corruption event. And any third-party system consuming or logging Socket.IO events for monitoring, audit, or replay would need the entire ordered frame sequence to reconstruct state, rather than being able to inspect any individual frame independently.There is also a structural limitation. During plain text streaming, the serialized output is indeed
previous + new_token, so a trailing text delta works. But during reasoning content streaming, tool call argument streaming, and whenever tag detection restructures the output list mid-token, the HTML changes inside<details>blocks rather than at the end of the string. A trailing text append cannot express these updates. The backend already branches on these cases internally (inside_tag_block,reasoning_content, tag detection vialen(output)changes), so it knows which situation it is in, but leaning on that distinction adds meaningful complexity to a streaming path that is already one of the hardest parts of the codebase to follow.The lowest-risk immediate improvement is increasing the default
delta_chunk_size. With the current default of 1 it emits on every single token. Raising it to something like 5 or 10 cuts total bandwidth proportionally with zero code changes and zero change to the reliability model. It is still O(N²) but with a meaningfully smaller constant, and for the majority of real-world responses this may bring cumulative traffic into acceptable range without touching the architecture.Beyond that, I'm open to suggestions. The hard constraint is not regressing the properties above: stateless frontend on reconnect and tab-open, backend as sole rendering authority, correct content on page refresh, filter functions receiving full context, and each Socket.IO frame being independently meaningful. If someone sees an approach that meaningfully changes the growth curve without introducing fragile accumulated client-side state, I would genuinely like to hear it.
@ShirasawaSama commented on GitHub (Apr 17, 2026):
Yes, so one approach is to use JSON Patch and Redis to log each incremental update. That way, the client can retrieve previous data from Redis using the current version index.
@rgaricano commented on GitHub (Apr 17, 2026):
Another solution that can potentially solve this issue is handling the chat as collaborative Yjs doc,
Instead of calling serialize_output(full_output()) on every token, the backend could:
The frontend would need to:
(This is similar to how the current collaboration provider works in Collaboration.ts)
Benefits of Yjs Approach:
Challenges of this implementation:
Document Structure Design
The current output array contains structured data with different types (message, reasoning, tool calls). This would need to be mapped to a Yjs document structure.
Backend Serialization
The serialize_output() function would need to work with Yjs documents instead of the output array, potentially requiring a new serialization path.
Filter Function Adaptation
Current filter functions receive the full form data. They would need to be adapted to work with Yjs document state.
Real-time Chat Save
The ENABLE_REALTIME_CHAT_SAVE feature saves serialized content to the database. This would need to work with Yjs document state.
But maybe a pragmatic solution might be an hybrid approach:
This would preserve all existing reliability properties while dramatically reducing bandwidth usage.
Sample of the hybrid Yjs Implementation for Chat Streaming:
Backend Implementation
1. Yjs Document Initialization
2. Modified Stream Handler
3. On-demand Serialization for Filters
4. Database Save Integration
Frontend Implementation
1. Chat Message Yjs Handler
2. Integration with ResponseMessage
Backend Full State Support
@tkalevra commented on GitHub (Apr 24, 2026):
I utilized ai to write a diff to update the block, I was tempted to follow forward given the trajectory, however 1. I don't code and 2. not trying to step on toes here, I appreciate the dedication and hard work.
I wrote this simply because of my personal use-case, The system under 0.9.1 was not useable in the current state.
Limitations / Partial Fix:
data is not Noneguard to the non-delta flush branch to prevent emitting null payloads on the text pathghcr.io/open-webui/open-webui:v0.9.1)To revert:
docker cp /mnt/data/docker/open-webui-middleware-0.9.1.py.pre-delta-patch open-webui:/app/backend/open_webui/utils/middleware.py && docker restart open-webuiimage: ghcr.io/open-webui/open-webui:v0.9.1Credit: analysis and original PR by @Classic298 and @ShirasawaSama (#23735)
docker cp open-webui:/app/backend/open_webui/utils/middleware.py /mnt/data/docker/open-webui-middleware-0.9.1-backup.pypython3 owui_delta_patch.py --verify-only /mnt/data/docker/open-webui-middleware-0.9.1.pyOK: Target block found exactly oncepython3 owui_delta_patch.py /mnt/data/docker/open-webui-middleware-0.9.1.pydocker cp /mnt/data/docker/open-webui-middleware-0.9.1.py open-webui:/app/backend/open_webui/utils/middleware.pydocker restart open-webui@rgaricano commented on GitHub (Apr 25, 2026):
@ShirasawaSama @tkalevra @Classic298 @tjbck
PR: https://github.com/open-webui/open-webui/pull/24126 for use Ydoc for message stream updates (reasoning block) as I mentioned before.
Draft for test