[GH-ISSUE #23733] issue/perf: Exponential growth of backend, frontend and network bandwith usage with growing chat length #35585

New Issue

GiteaMirror · 2026-04-25T09:46:07-05:00

GiteaMirror commented

2026-04-25 09:46:07 -05:00

Originally created by @Classic298 on GitHub (Apr 14, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/23733

Socket.IO emits grow O(N²) during LLM streaming: full message is re-serialized on every token

Thanks to @ShirasawaSama !!

Simple TLDR:

TLDR: Open WebUI re-sends and re-renders the whole chat on every single token, so CPU, RAM, bandwidth, and Redis all explode exponentially in CPU and memory and bandwith usage as the conversation grows. Gets even worse with high concurrency and long chats.

Summary

During an LLM streaming response, the backend re-serializes the entire accumulated output (all prior text, reasoning blocks, tool calls, images, sources) into one HTML string and emits it via Socket.IO on every SSE event. As a response grows, each WebSocket frame grows with it, so total bytes on the wire for an N-token response scale as O(N²). The cost is then amplified by Socket.IO's Redis pub/sub fan-out (× Redis nodes × workers) and again by the frontend Markdown parser, which re-parses the whole content string on each update.

This is visible in devtools → Network → WS frames on any long streaming response: chat:completion frame sizes climb steadily (e.g. 3014 → 3037 → 3073 → 3092 → … bytes) as the response streams in.

Reproduction

Run the current dev branch backend and frontend.
Open any chat and send a prompt that produces a long response (e.g. "write a 2000-word essay about X").
Open browser devtools → Network → filter WS → click the Socket.IO frame → watch the "Messages" tab.
Observe each events frame carrying type: "chat:completion" — note that the content field grows by the entire response so far every single frame, not by the delta.
For extra impact: enable a model with reasoning or tool calling. Every text token re-sends the reasoning block and every prior tool call.

Root cause

All primary offenders live in backend/open_webui/utils/middleware.py inside streaming_chat_response_handler / stream_body_handler.

`serialize_output()` — re-serializes the whole output list on every call

backend/open_webui/utils/middleware.py:404-453

def serialize_output(output: list) -> str:
    """
    Convert OR-aligned output items to HTML for display.
    For LLM consumption, use convert_output_to_messages() instead.
    """
    content = ''
    # ... loops EVERY item in the output list (text, function_call,
    # function_call_output, reasoning, ...) and concatenates them into one
    # HTML string, every time it's called.
    for idx, item in enumerate(output):
        ...

`full_output()` — always cumulative

backend/open_webui/utils/middleware.py:3603-3604

def full_output():
    return prior_output + output if prior_output else output

Tool-call emit — full re-serialize on each tool-call delta

backend/open_webui/utils/middleware.py:3872-3879

await event_emitter(
    {
        'type': 'chat:completion',
        'data': {
            'content': serialize_output(full_output() + pending_fc_items),
        },
    }
)

Main text-delta emit — full re-serialize on each token

backend/open_webui/utils/middleware.py:4080-4106

if ENABLE_REALTIME_CHAT_SAVE:
    await Chats.upsert_message_to_chat_by_id_and_message_id(
        metadata['chat_id'],
        metadata['message_id'],
        {
            'content': serialize_output(full_output()),
            'output': full_output(),
        },
    )
else:
    data = {
        'content': serialize_output(full_output()),
    }
 
if delta:
    delta_count += 1
    last_delta_data = data
    if delta_count >= delta_chunk_size:
        await flush_pending_delta_data(delta_chunk_size)

`delta_chunk_size` only batches frequency, not payload size

backend/open_webui/utils/middleware.py:3645-3663

The existing delta_chunk_size / flush_pending_delta_data mechanism
reduces how often emits are sent, but each emit still carries
serialize_output(full_output()) — i.e. the full blob. So increasing
delta_chunk_size trades latency for bandwidth without fixing the
underlying growth.

The emit sink

backend/open_webui/socket/main.py:814-828

async def get_event_emitter(request_info, update_db=True):
    async def __event_emitter__(event_data):
        ...
        await sio.emit(
            'events',
            {
                'chat_id': chat_id,
                'message_id': message_id,
                'data': event_data,
            },
            room=f'user:{user_id}',
        )

When WEBSOCKET_MANAGER=redis, every emit goes through socketio.AsyncRedisManager and is published to Redis pub/sub, pickled/unpickled on every subscribing worker.

Frontend amplification

src/lib/components/chat/Chat.svelte:1711-1743 — on every chat:completion, the full data.content string overwrites message.content, defeating any delta optimization that might exist upstream.
src/lib/components/chat/Messages/Markdown.svelte:73-94 — the Markdown component re-parses the entire message.content string once
per requestAnimationFrame, which is 20+ ms on large conversations.

Note: the frontend already has a working delta path at src/lib/components/chat/Chat.svelte:472-473 (chat:message:delta → message.content += data.content). The backend simply doesn't use it for the streaming hot path.

The damage equation

Per incoming SSE token, every one of these four layers pays for the growing blob:

LLM token arrives
  │
  ├─ [BACKEND CPU]    serialize_output(full_output())        → O(size_so_far)
  ├─ [REDIS BUS]      AsyncRedisManager publish + subscribe  → × nodes × workers
  ├─ [WS WIRE]        full string to every connected client  → O(size_so_far)
  └─ [FRONTEND CPU]   Markdown re-parse of full content      → 20+ ms per token

Total bytes across the infrastructure for a single response:

total_bytes  ≈  Σsᵢ  ×  redis_cluster_nodes  ×  owui_worker_count  ×  concurrent_streams
             ≈  O(N²) amplified by the fan-out factor

Concrete example matching the orders of magnitude in observed traffic:

1 response, ~2000 tokens, per-emit size growing from a few KB to ~50 KB
Σsᵢ ≈ 50 MB of WS payload from a single worker
6-node Redis cluster × 4 workers × 100 concurrent chats
≈ 120 GB of infrastructure traffic to deliver ~10 MB of actual new tokens — roughly 10,000× amplification.

At 30 tok/s, the frontend main thread spends 600+ ms/sec re-parsing Markdown for content the user has already seen, which is why long-chat streaming feels janky on slower machines / mobile.

Acknowledgments

The root-cause analysis, the observation that Redis pub/sub amplifies the problem catastrophically, and the JSON-Patch-with-separated-blocks design all come from @Shirasawasama. This issue writes up their findings so they can be tracked in the repository.

Originally created by @Classic298 on GitHub (Apr 14, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/23733 # Socket.IO emits grow O(N²) during LLM streaming: full message is re-serialized on every token Thanks to @ShirasawaSama !! ## Simple TLDR: TLDR: Open WebUI re-sends and re-renders the whole chat on every single token, so CPU, RAM, bandwidth, and Redis all explode exponentially in CPU and memory and bandwith usage as the conversation grows. Gets even worse with high concurrency and long chats. ## Summary During an LLM streaming response, the backend re-serializes the **entire** accumulated output (all prior text, reasoning blocks, tool calls, images, sources) into one HTML string and emits it via Socket.IO **on every SSE event**. As a response grows, each WebSocket frame grows with it, so total bytes on the wire for an N-token response scale as **O(N²)**. The cost is then amplified by Socket.IO's Redis pub/sub fan-out (× Redis nodes × workers) and again by the frontend Markdown parser, which re-parses the whole content string on each update. This is visible in devtools → Network → WS frames on any long streaming response: `chat:completion` frame sizes climb steadily (e.g. 3014 → 3037 → 3073 → 3092 → … bytes) as the response streams in. ## Reproduction 1. Run the current `dev` branch backend and frontend. 2. Open any chat and send a prompt that produces a long response (e.g. "write a 2000-word essay about X"). 3. Open browser devtools → Network → filter WS → click the Socket.IO frame → watch the "Messages" tab. 4. Observe each `events` frame carrying `type: "chat:completion"` — note that the `content` field grows by the *entire response so far* every single frame, not by the delta. 5. For extra impact: enable a model with reasoning or tool calling. Every text token re-sends the reasoning block and every prior tool call. ## Root cause All primary offenders live in `backend/open_webui/utils/middleware.py` inside `streaming_chat_response_handler` / `stream_body_handler`. #### `serialize_output()` — re-serializes the whole output list on every call `backend/open_webui/utils/middleware.py:404-453` ```python def serialize_output(output: list) -> str: """ Convert OR-aligned output items to HTML for display. For LLM consumption, use convert_output_to_messages() instead. """ content = '' # ... loops EVERY item in the output list (text, function_call, # function_call_output, reasoning, ...) and concatenates them into one # HTML string, every time it's called. for idx, item in enumerate(output): ... ``` #### `full_output()` — always cumulative `backend/open_webui/utils/middleware.py:3603-3604` ```python def full_output(): return prior_output + output if prior_output else output ``` #### Tool-call emit — full re-serialize on each tool-call delta `backend/open_webui/utils/middleware.py:3872-3879` ```python await event_emitter( { 'type': 'chat:completion', 'data': { 'content': serialize_output(full_output() + pending_fc_items), }, } ) ``` #### Main text-delta emit — full re-serialize on each token `backend/open_webui/utils/middleware.py:4080-4106` ```python if ENABLE_REALTIME_CHAT_SAVE: await Chats.upsert_message_to_chat_by_id_and_message_id( metadata['chat_id'], metadata['message_id'], { 'content': serialize_output(full_output()), 'output': full_output(), }, ) else: data = { 'content': serialize_output(full_output()), } if delta: delta_count += 1 last_delta_data = data if delta_count >= delta_chunk_size: await flush_pending_delta_data(delta_chunk_size) ``` #### `delta_chunk_size` only batches frequency, not payload size `backend/open_webui/utils/middleware.py:3645-3663` The existing `delta_chunk_size` / `flush_pending_delta_data` mechanism reduces *how often* emits are sent, but each emit still carries `serialize_output(full_output())` — i.e. the full blob. So increasing `delta_chunk_size` trades latency for bandwidth without fixing the underlying growth. #### The emit sink `backend/open_webui/socket/main.py:814-828` ```python async def get_event_emitter(request_info, update_db=True): async def __event_emitter__(event_data): ... await sio.emit( 'events', { 'chat_id': chat_id, 'message_id': message_id, 'data': event_data, }, room=f'user:{user_id}', ) ``` When `WEBSOCKET_MANAGER=redis`, every emit goes through `socketio.AsyncRedisManager` and is published to Redis pub/sub, pickled/unpickled on every subscribing worker. #### Frontend amplification - `src/lib/components/chat/Chat.svelte:1711-1743` — on every `chat:completion`, the full `data.content` string **overwrites** `message.content`, defeating any delta optimization that might exist upstream. - `src/lib/components/chat/Messages/Markdown.svelte:73-94` — the Markdown component re-parses the entire `message.content` string once per `requestAnimationFrame`, which is **20+ ms** on large conversations. Note: the frontend already has a working delta path at `src/lib/components/chat/Chat.svelte:472-473` (`chat:message:delta` → `message.content += data.content`). The backend simply doesn't use it for the streaming hot path. ## The damage equation Per incoming SSE token, every one of these four layers pays for the growing blob: ``` LLM token arrives │ ├─ [BACKEND CPU] serialize_output(full_output()) → O(size_so_far) ├─ [REDIS BUS] AsyncRedisManager publish + subscribe → × nodes × workers ├─ [WS WIRE] full string to every connected client → O(size_so_far) └─ [FRONTEND CPU] Markdown re-parse of full content → 20+ ms per token ``` Total bytes across the infrastructure for a single response: ``` total_bytes ≈ Σsᵢ × redis_cluster_nodes × owui_worker_count × concurrent_streams ≈ O(N²) amplified by the fan-out factor ``` **Concrete example** matching the orders of magnitude in observed traffic: - 1 response, ~2000 tokens, per-emit size growing from a few KB to ~50 KB - Σsᵢ ≈ 50 MB of WS payload from a single worker - 6-node Redis cluster × 4 workers × 100 concurrent chats - **≈ 120 GB of infrastructure traffic to deliver ~10 MB of actual new tokens — roughly 10,000× amplification.** At 30 tok/s, the frontend main thread spends **600+ ms/sec** re-parsing Markdown for content the user has already seen, which is why long-chat streaming feels janky on slower machines / mobile. ## Acknowledgments The root-cause analysis, the observation that Redis pub/sub amplifies the problem catastrophically, and the JSON-Patch-with-separated-blocks design all come from **@Shirasawasama**. This issue writes up their findings so they can be tracked in the repository.

GiteaMirror commented

2026-04-25 09:46:08 -05:00

@Classic298 commented on GitHub (Apr 14, 2026):

The PR #23735 MASSIVELY improves the current stance and will make it much better for ALL deployments - but it's not a full fix. it's only 90% of the way there.

The PR 23735 only fixes it for the actual message content - reasoning and tool call not yet fixed with that PR - but still a massive improvement already

@Classic298 commented on GitHub (Apr 14, 2026): The PR #23735 MASSIVELY improves the current stance and will make it much better for ALL deployments - but it's not a full fix. it's only 90% of the way there. The PR 23735 only fixes it for the actual message content - reasoning and tool call not yet fixed with that PR - but still a massive improvement already

GiteaMirror commented

2026-04-25 09:46:09 -05:00

@tjbck commented on GitHub (Apr 17, 2026):

Thanks for the thorough analysis. The O(N²) characterization is correct and this is something I've been aware of. Before jumping to solutions I want to explain why the current implementation works the way it does, because the properties it provides are load-bearing and easy to take for granted until they break.

The current architecture is full-state-on-every-emit by design. Every Socket.IO frame carries the complete rendered content of the assistant message. This makes the frontend entirely stateless with respect to streaming: it receives a string, sets message.content, and hands it to the Markdown renderer. If a WebSocket frame is dropped, the connection flaps, the user switches tabs and comes back, or the browser GC causes a missed event, the very next frame self-corrects because it contains the complete truth. There is no accumulated client-side state that can drift out of sync with the backend.

The backend is also the sole rendering authority. serialize_output() produces the canonical HTML (with <details> blocks for reasoning, tool calls, code interpreter output), and the frontend is intentionally a dumb pipe to the DOM. This is the simplest possible contract between backend and frontend, and it has been extremely reliable across every provider, model configuration, and edge case encountered. The ENABLE_REALTIME_CHAT_SAVE path writes this same serialized content to the database on every token, which is why page refreshes during active streaming always show correct content.

The cost of this reliability is quadratic total bandwidth. The serialized content grows linearly with tokens emitted, and it emits once per token (or per delta_chunk_size tokens), so total bytes scale as O(N²/K). For a 2000-token response with reasoning and tool calls, that's roughly 10 MB of cumulative WebSocket traffic before Redis fan-out. That is real and I want to improve it.

The natural response is to switch to deltas: send only the new token text and have the frontend append it. This drops total bandwidth to O(N), which is obviously attractive. But it introduces problems beyond just lost frames. A user opening the same chat in a second browser tab while streaming is active would have no base content to apply deltas to, and would see nothing (or need a separate full-state request mechanism that does not currently exist). Stream filter functions that plugins can hook into on every event would only see an isolated token fragment instead of full content, breaking any filter that needs context across tokens such as content moderation, pattern redaction, or formatting transforms. When a user cancels mid-stream, the backend serializes the current output and saves it to the database, but the frontend's accumulated += string was never verified against serialize_output() and could silently diverge due to split unicode sequences, partially detected tags, or filter modifications. On mobile and unstable networks where Socket.IO reconnects are frequent, every micro-disconnect becomes a potential silent corruption event. And any third-party system consuming or logging Socket.IO events for monitoring, audit, or replay would need the entire ordered frame sequence to reconstruct state, rather than being able to inspect any individual frame independently.

There is also a structural limitation. During plain text streaming, the serialized output is indeed previous + new_token, so a trailing text delta works. But during reasoning content streaming, tool call argument streaming, and whenever tag detection restructures the output list mid-token, the HTML changes inside <details> blocks rather than at the end of the string. A trailing text append cannot express these updates. The backend already branches on these cases internally (inside_tag_block, reasoning_content, tag detection via len(output) changes), so it knows which situation it is in, but leaning on that distinction adds meaningful complexity to a streaming path that is already one of the hardest parts of the codebase to follow.

The lowest-risk immediate improvement is increasing the default delta_chunk_size. With the current default of 1 it emits on every single token. Raising it to something like 5 or 10 cuts total bandwidth proportionally with zero code changes and zero change to the reliability model. It is still O(N²) but with a meaningfully smaller constant, and for the majority of real-world responses this may bring cumulative traffic into acceptable range without touching the architecture.

Beyond that, I'm open to suggestions. The hard constraint is not regressing the properties above: stateless frontend on reconnect and tab-open, backend as sole rendering authority, correct content on page refresh, filter functions receiving full context, and each Socket.IO frame being independently meaningful. If someone sees an approach that meaningfully changes the growth curve without introducing fragile accumulated client-side state, I would genuinely like to hear it.

@tjbck commented on GitHub (Apr 17, 2026): Thanks for the thorough analysis. The O(N²) characterization is correct and this is something I've been aware of. Before jumping to solutions I want to explain why the current implementation works the way it does, because the properties it provides are load-bearing and easy to take for granted until they break. The current architecture is full-state-on-every-emit by design. Every Socket.IO frame carries the complete rendered content of the assistant message. This makes the frontend entirely stateless with respect to streaming: it receives a string, sets `message.content`, and hands it to the Markdown renderer. If a WebSocket frame is dropped, the connection flaps, the user switches tabs and comes back, or the browser GC causes a missed event, the very next frame self-corrects because it contains the complete truth. There is no accumulated client-side state that can drift out of sync with the backend. The backend is also the sole rendering authority. `serialize_output()` produces the canonical HTML (with `<details>` blocks for reasoning, tool calls, code interpreter output), and the frontend is intentionally a dumb pipe to the DOM. This is the simplest possible contract between backend and frontend, and it has been extremely reliable across every provider, model configuration, and edge case encountered. The `ENABLE_REALTIME_CHAT_SAVE` path writes this same serialized content to the database on every token, which is why page refreshes during active streaming always show correct content. The cost of this reliability is quadratic total bandwidth. The serialized content grows linearly with tokens emitted, and it emits once per token (or per `delta_chunk_size` tokens), so total bytes scale as O(N²/K). For a 2000-token response with reasoning and tool calls, that's roughly 10 MB of cumulative WebSocket traffic before Redis fan-out. That is real and I want to improve it. The natural response is to switch to deltas: send only the new token text and have the frontend append it. This drops total bandwidth to O(N), which is obviously attractive. But it introduces problems beyond just lost frames. A user opening the same chat in a second browser tab while streaming is active would have no base content to apply deltas to, and would see nothing (or need a separate full-state request mechanism that does not currently exist). Stream filter functions that plugins can hook into on every event would only see an isolated token fragment instead of full content, breaking any filter that needs context across tokens such as content moderation, pattern redaction, or formatting transforms. When a user cancels mid-stream, the backend serializes the current output and saves it to the database, but the frontend's accumulated `+=` string was never verified against `serialize_output()` and could silently diverge due to split unicode sequences, partially detected tags, or filter modifications. On mobile and unstable networks where Socket.IO reconnects are frequent, every micro-disconnect becomes a potential silent corruption event. And any third-party system consuming or logging Socket.IO events for monitoring, audit, or replay would need the entire ordered frame sequence to reconstruct state, rather than being able to inspect any individual frame independently. There is also a structural limitation. During plain text streaming, the serialized output is indeed `previous + new_token`, so a trailing text delta works. But during reasoning content streaming, tool call argument streaming, and whenever tag detection restructures the output list mid-token, the HTML changes inside `<details>` blocks rather than at the end of the string. A trailing text append cannot express these updates. The backend already branches on these cases internally (`inside_tag_block`, `reasoning_content`, tag detection via `len(output)` changes), so it knows which situation it is in, but leaning on that distinction adds meaningful complexity to a streaming path that is already one of the hardest parts of the codebase to follow. The lowest-risk immediate improvement is increasing the default `delta_chunk_size`. With the current default of 1 it emits on every single token. Raising it to something like 5 or 10 cuts total bandwidth proportionally with zero code changes and zero change to the reliability model. It is still O(N²) but with a meaningfully smaller constant, and for the majority of real-world responses this may bring cumulative traffic into acceptable range without touching the architecture. Beyond that, I'm open to suggestions. The hard constraint is not regressing the properties above: stateless frontend on reconnect and tab-open, backend as sole rendering authority, correct content on page refresh, filter functions receiving full context, and each Socket.IO frame being independently meaningful. If someone sees an approach that meaningfully changes the growth curve without introducing fragile accumulated client-side state, I would genuinely like to hear it.

GiteaMirror commented

2026-04-25 09:46:09 -05:00

@ShirasawaSama commented on GitHub (Apr 17, 2026):

Yes, so one approach is to use JSON Patch and Redis to log each incremental update. That way, the client can retrieve previous data from Redis using the current version index.

@ShirasawaSama commented on GitHub (Apr 17, 2026): Yes, so one approach is to use JSON Patch and Redis to log each incremental update. That way, the client can retrieve previous data from Redis using the current version index.

GiteaMirror commented

2026-04-25 09:46:10 -05:00

@rgaricano commented on GitHub (Apr 17, 2026):

Another solution that can potentially solve this issue is handling the chat as collaborative Yjs doc,

Instead of calling serialize_output(full_output()) on every token, the backend could:

Initialize a Yjs document for each chat message when streaming begins.
Apply incremental updates to the document as tokens arrive.
Emit Yjs updates instead of full serialized content.

The frontend would need to:

Initialize Yjs document when streaming starts.
Apply Yjs updates as they arrive via Socket.IO
Render from Yjs document instead of replacing content.
(This is similar to how the current collaboration provider works in Collaboration.ts)

Benefits of Yjs Approach:

O(N) Bandwidth: Yjs uses efficient delta compression.
Stateless Frontend: Documents can be reconstructed from any state.
Automatic Reconciliation: Handles dropped frames and reconnections.
Multi-tab Support: Natural synchronization across browser tabs.
Filter Function Integration: Filters can operate on Yjs document state.

Challenges of this implementation:

Document Structure Design
The current output array contains structured data with different types (message, reasoning, tool calls). This would need to be mapped to a Yjs document structure.
Backend Serialization
The serialize_output() function would need to work with Yjs documents instead of the output array, potentially requiring a new serialization path.
Filter Function Adaptation
Current filter functions receive the full form data. They would need to be adapted to work with Yjs document state.
Real-time Chat Save
The ENABLE_REALTIME_CHAT_SAVE feature saves serialized content to the database. This would need to work with Yjs document state.

But maybe a pragmatic solution might be an hybrid approach:

Use Yjs for streaming to get O(N) bandwidth
Maintain current serialization for database storage and filter functions
Serialize on-demand when needed (database saves, filter execution)

This would preserve all existing reliability properties while dramatically reducing bandwidth usage.

Sample of the hybrid Yjs Implementation for Chat Streaming:

Backend Implementation

1. Yjs Document Initialization

# backend/open_webui/utils/middleware.py
import yjs
from yjs import Y

async def streaming_chat_response_handler(response, ctx):
    # ... existing code ...

    # Initialize Yjs document for this message
    yjs_doc = Y.Doc()
    yjs_output = yjs_doc.get_array('output')

    # Copy existing output to Yjs document if continuing a message
    if existing_output:
        yjs_output.push(existing_output)

    # ... rest of initialization ...

2. Modified Stream Handler

async def stream_body_handler(response, form_data):
    nonlocal content, usage, output, prior_output, last_response_id, yjs_doc, yjs_output

    # ... existing token processing logic ...

    # Instead of modifying output array directly, update Yjs document
    if reasoning_content:
        if not yjs_output.length or yjs_output[-1].get('type') != 'reasoning':
            reasoning_item = {
                'type': 'reasoning',
                'id': output_id('r'),
                'status': 'in_progress',
                'start_tag': '</think>',
                'end_tag': '</think>',
                'attributes': {'type': 'reasoning_content'},
                'content': [],
                'summary': None,
                'started_at': time.time(),
            }
            yjs_output.push([reasoning_item])
        else:
            # Update existing reasoning item
            reasoning_item = dict(yjs_output[-1])
            parts = reasoning_item.get('content', [])
            if parts and parts[-1].get('type') == 'output_text':
                parts[-1]['text'] += reasoning_content
            else:
                parts.append({'type': 'output_text', 'text': reasoning_content})
            yjs_output.delete(yjs_output.length - 1, 1)
            yjs_output.insert(yjs_output.length, [reasoning_item])

    # Similar pattern for regular message content
    if value:
        # ... update yjs_output instead of output array ...
        pass

    # Emit Yjs update instead of full serialization
    yjs_update = Y.encode_state_as_update(yjs_doc)
    await event_emitter({
        'type': 'chat:completion',
        'data': {
            'yjs_update': yjs_update.hex(),
            'message_id': metadata['message_id']
        }
    })

3. On-demand Serialization for Filters

# When filter functions need to process content
def get_full_output_from_yjs():
    """Convert Yjs document to output array format"""
    return list(yjs_output)

# Modified filter processing
data, _ = await process_filter_functions(
    request=request,
    filter_functions=filter_functions,
    filter_type='stream',
    form_data={
        **data,
        'full_output': get_full_output_from_yjs()  # Provide full context
    },
    extra_params={'__body__': form_data, **extra_params},
)

4. Database Save Integration

if ENABLE_REALTIME_CHAT_SAVE:
    # Serialize Yjs document for database storage
    full_output = get_full_output_from_yjs()
    Chats.upsert_message_to_chat_by_id_and_message_id(
        metadata['chat_id'],
        metadata['message_id'],
        {
            'content': serialize_output(full_output),
            'output': full_output,
        },
    )

Frontend Implementation

1. Chat Message Yjs Handler

// src/lib/components/chat/ChatMessageYjs.svelte
import { Doc } from 'yjs'
import { Socket } from 'socket.io-client'

export class ChatMessageYjsHandler {
    private doc: Doc
    private socket: Socket
    private messageId: string

    constructor(messageId: string, socket: Socket) {
        this.messageId = messageId
        this.socket = socket
        this.doc = new Doc()

        this.setupEventListeners()
    }

    private setupEventListeners() {
        this.socket.on('chat:completion', (data) => {
            if (data.message_id === this.messageId && data.yjs_update) {
                // Apply Yjs update
                const update = new Uint8Array(
                    data.yjs_update.match(/.{1,2}/g)?.map(byte => 
                        parseInt(byte, 16)
                    ) || []
                )
                Doc.applyUpdate(this.doc, update)

                // Trigger re-render
                this.renderContent()
            }
        })
    }

    private renderContent() {
        const output = this.doc.getArray('output').toArray()
        const html = serializeOutput(output) // Use existing serialization
        // Update DOM
    }

    // For new tabs joining mid-stream
    public requestFullState() {
        this.socket.emit('chat:message:full_state', {
            message_id: this.messageId
        })
    }
}

2. Integration with ResponseMessage

// src/lib/components/chat/ResponseMessage.svelte
import { ChatMessageYjsHandler } from './ChatMessageYjs.svelte'

let yjsHandler: ChatMessageYjsHandler

onMount(() => {
    if (message.streaming) {
        yjsHandler = new ChatMessageYjsHandler(message.id, socket)
    }
})

// Handle tab reconnection
onMount(() => {
    if (message.streaming && !message.content) {
        // Request full state if joining mid-stream
        yjsHandler?.requestFullState()
    }
})

Backend Full State Support

# backend/open_webui/socket/main.py
@sio.on('chat:message:full_state')
async def handle_full_state_request(sid, data):
    message_id = data.get('message_id')

    # Retrieve current Yjs document state
    # This would need to be stored temporarily during streaming
    yjs_state = get_yjs_document_state(message_id)

    if yjs_state:
        await sio.emit('chat:completion', {
            'message_id': message_id,
            'yjs_update': yjs_state,
            'is_full_state': True
        }, room=sid)

@rgaricano commented on GitHub (Apr 17, 2026): Another solution that can potentially solve this issue is handling the chat as collaborative Yjs doc, Instead of calling serialize_output(full_output()) on every token, the backend could: - Initialize a Yjs document for each chat message when streaming begins. - Apply incremental updates to the document as tokens arrive. - Emit Yjs updates instead of full serialized content. The frontend would need to: - Initialize Yjs document when streaming starts. - Apply Yjs updates as they arrive via Socket.IO - Render from Yjs document instead of replacing content. (This is similar to how the current collaboration provider works in Collaboration.ts) Benefits of Yjs Approach: - O(N) Bandwidth: Yjs uses efficient delta compression. - Stateless Frontend: Documents can be reconstructed from any state. - Automatic Reconciliation: Handles dropped frames and reconnections. - Multi-tab Support: Natural synchronization across browser tabs. - Filter Function Integration: Filters can operate on Yjs document state. Challenges of this implementation: 1. Document Structure Design The current output array contains structured data with different types (message, reasoning, tool calls). This would need to be mapped to a Yjs document structure. 2. Backend Serialization The serialize_output() function would need to work with Yjs documents instead of the output array, potentially requiring a new serialization path. 3. Filter Function Adaptation Current filter functions receive the full form data. They would need to be adapted to work with Yjs document state. 4. Real-time Chat Save The ENABLE_REALTIME_CHAT_SAVE feature saves serialized content to the database. This would need to work with Yjs document state. But maybe a pragmatic solution might be an hybrid approach: - Use Yjs for streaming to get O(N) bandwidth - Maintain current serialization for database storage and filter functions - Serialize on-demand when needed (database saves, filter execution) This would preserve all existing reliability properties while dramatically reducing bandwidth usage. Sample of the hybrid Yjs Implementation for Chat Streaming: ## Backend Implementation ### 1. Yjs Document Initialization ```python # backend/open_webui/utils/middleware.py import yjs from yjs import Y async def streaming_chat_response_handler(response, ctx): # ... existing code ... # Initialize Yjs document for this message yjs_doc = Y.Doc() yjs_output = yjs_doc.get_array('output') # Copy existing output to Yjs document if continuing a message if existing_output: yjs_output.push(existing_output) # ... rest of initialization ... ``` ### 2. Modified Stream Handler ```python async def stream_body_handler(response, form_data): nonlocal content, usage, output, prior_output, last_response_id, yjs_doc, yjs_output # ... existing token processing logic ... # Instead of modifying output array directly, update Yjs document if reasoning_content: if not yjs_output.length or yjs_output[-1].get('type') != 'reasoning': reasoning_item = { 'type': 'reasoning', 'id': output_id('r'), 'status': 'in_progress', 'start_tag': '</think>', 'end_tag': '</think>', 'attributes': {'type': 'reasoning_content'}, 'content': [], 'summary': None, 'started_at': time.time(), } yjs_output.push([reasoning_item]) else: # Update existing reasoning item reasoning_item = dict(yjs_output[-1]) parts = reasoning_item.get('content', []) if parts and parts[-1].get('type') == 'output_text': parts[-1]['text'] += reasoning_content else: parts.append({'type': 'output_text', 'text': reasoning_content}) yjs_output.delete(yjs_output.length - 1, 1) yjs_output.insert(yjs_output.length, [reasoning_item]) # Similar pattern for regular message content if value: # ... update yjs_output instead of output array ... pass # Emit Yjs update instead of full serialization yjs_update = Y.encode_state_as_update(yjs_doc) await event_emitter({ 'type': 'chat:completion', 'data': { 'yjs_update': yjs_update.hex(), 'message_id': metadata['message_id'] } }) ``` ### 3. On-demand Serialization for Filters ```python # When filter functions need to process content def get_full_output_from_yjs(): """Convert Yjs document to output array format""" return list(yjs_output) # Modified filter processing data, _ = await process_filter_functions( request=request, filter_functions=filter_functions, filter_type='stream', form_data={ **data, 'full_output': get_full_output_from_yjs() # Provide full context }, extra_params={'__body__': form_data, **extra_params}, ) ``` ### 4. Database Save Integration ```python if ENABLE_REALTIME_CHAT_SAVE: # Serialize Yjs document for database storage full_output = get_full_output_from_yjs() Chats.upsert_message_to_chat_by_id_and_message_id( metadata['chat_id'], metadata['message_id'], { 'content': serialize_output(full_output), 'output': full_output, }, ) ``` ## Frontend Implementation ### 1. Chat Message Yjs Handler ```typescript // src/lib/components/chat/ChatMessageYjs.svelte import { Doc } from 'yjs' import { Socket } from 'socket.io-client' export class ChatMessageYjsHandler { private doc: Doc private socket: Socket private messageId: string constructor(messageId: string, socket: Socket) { this.messageId = messageId this.socket = socket this.doc = new Doc() this.setupEventListeners() } private setupEventListeners() { this.socket.on('chat:completion', (data) => { if (data.message_id === this.messageId && data.yjs_update) { // Apply Yjs update const update = new Uint8Array( data.yjs_update.match(/.{1,2}/g)?.map(byte => parseInt(byte, 16) ) || [] ) Doc.applyUpdate(this.doc, update) // Trigger re-render this.renderContent() } }) } private renderContent() { const output = this.doc.getArray('output').toArray() const html = serializeOutput(output) // Use existing serialization // Update DOM } // For new tabs joining mid-stream public requestFullState() { this.socket.emit('chat:message:full_state', { message_id: this.messageId }) } } ``` ### 2. Integration with ResponseMessage ```typescript // src/lib/components/chat/ResponseMessage.svelte import { ChatMessageYjsHandler } from './ChatMessageYjs.svelte' let yjsHandler: ChatMessageYjsHandler onMount(() => { if (message.streaming) { yjsHandler = new ChatMessageYjsHandler(message.id, socket) } }) // Handle tab reconnection onMount(() => { if (message.streaming && !message.content) { // Request full state if joining mid-stream yjsHandler?.requestFullState() } }) ``` ## Backend Full State Support ```python # backend/open_webui/socket/main.py @sio.on('chat:message:full_state') async def handle_full_state_request(sid, data): message_id = data.get('message_id') # Retrieve current Yjs document state # This would need to be stored temporarily during streaming yjs_state = get_yjs_document_state(message_id) if yjs_state: await sio.emit('chat:completion', { 'message_id': message_id, 'yjs_update': yjs_state, 'is_full_state': True }, room=sid) ```

GiteaMirror commented

2026-04-25 09:46:10 -05:00

@tkalevra commented on GitHub (Apr 24, 2026):

I utilized ai to write a diff to update the block, I was tempted to follow forward given the trajectory, however 1. I don't code and 2. not trying to step on toes here, I appreciate the dedication and hard work.

I wrote this simply because of my personal use-case, The system under 0.9.1 was not useable in the current state.

Limitations / Partial Fix:

This only addresses the plain text token path (ENABLE_REALTIME_CHAT_SAVE=False hot path)
Tool call argument streaming and reasoning block streaming still use full serialization — CPU will still spike during heavy tool use
Adds a data is not None guard to the non-delta flush branch to prevent emitting null payloads on the text path
Will not survive a container recreation / image pull — reapply after upgrading
Tested on v0.9.1 standard Docker image only (ghcr.io/open-webui/open-webui:v0.9.1)

To revert:
docker cp /mnt/data/docker/open-webui-middleware-0.9.1.py.pre-delta-patch open-webui:/app/backend/open_webui/utils/middleware.py && docker restart open-webui

this will only work for docker standarized containers: eg. image: ghcr.io/open-webui/open-webui:v0.9.1

Credit: analysis and original PR by @Classic298 and @ShirasawaSama (#23735)

copy the offending file out of your live container
docker cp open-webui:/app/backend/open_webui/utils/middleware.py /mnt/data/docker/open-webui-middleware-0.9.1-backup.py
save the patch code to -> owui_delta_patch.py

"""
Patch for OpenWebUI 0.9.1 middleware.py
Fixes O(N^2) CPU/bandwidth growth during streaming (issue #23733)

Targets ONLY the per-text-token emit in stream_body_handler's
ENABLE_REALTIME_CHAT_SAVE=False path. All other serialize_output
calls (tool calls, reasoning blocks, final emissions) are untouched.

Usage:
    python3 owui_delta_patch.py <path_to_middleware.py>
    python3 owui_delta_patch.py --verify-only <path_to_middleware.py>
"""

import sys
import re
import shutil
from pathlib import Path

# The exact block to replace — indentation must match the source file exactly.
# This is the else-branch of ENABLE_REALTIME_CHAT_SAVE inside stream_body_handler.
OLD = '''\
                                        if ENABLE_REALTIME_CHAT_SAVE:
                                            # Save message in the database
                                            await Chats.upsert_message_to_chat_by_id_and_message_id(
                                                metadata['chat_id'],
                                                metadata['message_id'],
                                                {
                                                    'content': serialize_output(full_output()),
                                                    'output': full_output(),
                                                },
                                            )
                                        else:
                                            data = {
                                                'content': serialize_output(full_output()),
                                            }

                                if delta:
                                    delta_count += 1
                                    last_delta_data = data
                                    if delta_count >= delta_chunk_size:
                                        await flush_pending_delta_data(delta_chunk_size)
                                else:
                                    await event_emitter(
                                        {
                                            'type': 'chat:completion',
                                            'data': data,
                                        }
                                    )'''

NEW = '''\
                                        if ENABLE_REALTIME_CHAT_SAVE:
                                            # Save message in the database
                                            await Chats.upsert_message_to_chat_by_id_and_message_id(
                                                metadata['chat_id'],
                                                metadata['message_id'],
                                                {
                                                    'content': serialize_output(full_output()),
                                                    'output': full_output(),
                                                },
                                            )
                                        else:
                                            # Emit only the new token delta instead of
                                            # re-serializing the entire accumulated output
                                            # on every SSE event (fixes O(N^2) CPU/bandwidth
                                            # growth). The frontend chat:message:delta path
                                            # appends value to message.content directly.
                                            # See: https://github.com/open-webui/open-webui/issues/23733
                                            await event_emitter(
                                                {
                                                    'type': 'chat:message:delta',
                                                    'data': {
                                                        'content': value,
                                                    },
                                                }
                                            )
                                            data = None

                                if delta:
                                    delta_count += 1
                                    last_delta_data = data
                                    if delta_count >= delta_chunk_size:
                                        await flush_pending_delta_data(delta_chunk_size)
                                else:
                                    if data is not None:
                                        await event_emitter(
                                            {
                                                'type': 'chat:completion',
                                                'data': data,
                                            }
                                        )'''

def main():
    verify_only = '--verify-only' in sys.argv
    args = [a for a in sys.argv[1:] if not a.startswith('--')]

    if not args:
        print("Usage: python3 owui_delta_patch.py [--verify-only] <path_to_middleware.py>")
        sys.exit(1)

    path = Path(args[0])
    if not path.exists():
        print(f"ERROR: File not found: {path}")
        sys.exit(1)

    content = path.read_text(encoding='utf-8')

    count = content.count(OLD)
    if count == 0:
        print("ERROR: Target block not found. File may already be patched or has changed.")
        print("       Verify the indentation matches exactly.")
        sys.exit(1)
    if count > 1:
        print(f"ERROR: Target block found {count} times — ambiguous, aborting.")
        sys.exit(1)

    print(f"OK: Target block found exactly once at character offset {content.index(OLD)}")

    if verify_only:
        print("--verify-only: no changes written.")
        sys.exit(0)

    # Backup
    backup = path.with_suffix('.py.pre-delta-patch')
    shutil.copy2(path, backup)
    print(f"Backup written: {backup}")

    patched = content.replace(OLD, NEW, 1)

    # Sanity: confirm replacement happened and OLD is gone
    assert patched.count(OLD) == 0, "OLD block still present after replace — aborting"
    assert 'chat:message:delta' in patched, "NEW block not found after replace — aborting"

    path.write_text(patched, encoding='utf-8')
    print(f"Patched: {path}")
    print("Done. Restart the open-webui container to apply.")

if __name__ == '__main__':
    main()

test it first!!!
python3 owui_delta_patch.py --verify-only /mnt/data/docker/open-webui-middleware-0.9.1.py
Confirm output:
OK: Target block found exactly once
do it
python3 owui_delta_patch.py /mnt/data/docker/open-webui-middleware-0.9.1.py
copy it back in
docker cp /mnt/data/docker/open-webui-middleware-0.9.1.py open-webui:/app/backend/open_webui/utils/middleware.py
restart the container
docker restart open-webui

@tkalevra commented on GitHub (Apr 24, 2026): I utilized ai to write a diff to update the block, I was tempted to follow forward given the trajectory, however 1. I don't code and 2. not trying to step on toes here, I appreciate the dedication and hard work. I wrote this simply because of my personal use-case, The system under 0.9.1 was not useable in the current state. **Limitations / Partial Fix:** - This only addresses the plain text token path (ENABLE_REALTIME_CHAT_SAVE=False hot path) - Tool call argument streaming and reasoning block streaming still use full serialization — CPU will still spike during heavy tool use - Adds a `data is not None` guard to the non-delta flush branch to prevent emitting null payloads on the text path - Will not survive a container recreation / image pull — reapply after upgrading - Tested on v0.9.1 standard Docker image only (`ghcr.io/open-webui/open-webui:v0.9.1`) **To revert:** `docker cp /mnt/data/docker/open-webui-middleware-0.9.1.py.pre-delta-patch open-webui:/app/backend/open_webui/utils/middleware.py && docker restart open-webui` * this will only work for docker standarized containers: eg. `image: ghcr.io/open-webui/open-webui:v0.9.1` Credit: analysis and original PR by @Classic298 and @ShirasawaSama (#23735) 1. copy the offending file out of your live container `docker cp open-webui:/app/backend/open_webui/utils/middleware.py /mnt/data/docker/open-webui-middleware-0.9.1-backup.py` 2. save the patch code to -> owui_delta_patch.py ```#!/usr/bin/env python3 """ Patch for OpenWebUI 0.9.1 middleware.py Fixes O(N^2) CPU/bandwidth growth during streaming (issue #23733) Targets ONLY the per-text-token emit in stream_body_handler's ENABLE_REALTIME_CHAT_SAVE=False path. All other serialize_output calls (tool calls, reasoning blocks, final emissions) are untouched. Usage: python3 owui_delta_patch.py <path_to_middleware.py> python3 owui_delta_patch.py --verify-only <path_to_middleware.py> """ import sys import re import shutil from pathlib import Path # The exact block to replace — indentation must match the source file exactly. # This is the else-branch of ENABLE_REALTIME_CHAT_SAVE inside stream_body_handler. OLD = '''\ if ENABLE_REALTIME_CHAT_SAVE: # Save message in the database await Chats.upsert_message_to_chat_by_id_and_message_id( metadata['chat_id'], metadata['message_id'], { 'content': serialize_output(full_output()), 'output': full_output(), }, ) else: data = { 'content': serialize_output(full_output()), } if delta: delta_count += 1 last_delta_data = data if delta_count >= delta_chunk_size: await flush_pending_delta_data(delta_chunk_size) else: await event_emitter( { 'type': 'chat:completion', 'data': data, } )''' NEW = '''\ if ENABLE_REALTIME_CHAT_SAVE: # Save message in the database await Chats.upsert_message_to_chat_by_id_and_message_id( metadata['chat_id'], metadata['message_id'], { 'content': serialize_output(full_output()), 'output': full_output(), }, ) else: # Emit only the new token delta instead of # re-serializing the entire accumulated output # on every SSE event (fixes O(N^2) CPU/bandwidth # growth). The frontend chat:message:delta path # appends value to message.content directly. # See: https://github.com/open-webui/open-webui/issues/23733 await event_emitter( { 'type': 'chat:message:delta', 'data': { 'content': value, }, } ) data = None if delta: delta_count += 1 last_delta_data = data if delta_count >= delta_chunk_size: await flush_pending_delta_data(delta_chunk_size) else: if data is not None: await event_emitter( { 'type': 'chat:completion', 'data': data, } )''' def main(): verify_only = '--verify-only' in sys.argv args = [a for a in sys.argv[1:] if not a.startswith('--')] if not args: print("Usage: python3 owui_delta_patch.py [--verify-only] <path_to_middleware.py>") sys.exit(1) path = Path(args[0]) if not path.exists(): print(f"ERROR: File not found: {path}") sys.exit(1) content = path.read_text(encoding='utf-8') count = content.count(OLD) if count == 0: print("ERROR: Target block not found. File may already be patched or has changed.") print(" Verify the indentation matches exactly.") sys.exit(1) if count > 1: print(f"ERROR: Target block found {count} times — ambiguous, aborting.") sys.exit(1) print(f"OK: Target block found exactly once at character offset {content.index(OLD)}") if verify_only: print("--verify-only: no changes written.") sys.exit(0) # Backup backup = path.with_suffix('.py.pre-delta-patch') shutil.copy2(path, backup) print(f"Backup written: {backup}") patched = content.replace(OLD, NEW, 1) # Sanity: confirm replacement happened and OLD is gone assert patched.count(OLD) == 0, "OLD block still present after replace — aborting" assert 'chat:message:delta' in patched, "NEW block not found after replace — aborting" path.write_text(patched, encoding='utf-8') print(f"Patched: {path}") print("Done. Restart the open-webui container to apply.") if __name__ == '__main__': main() ``` 3. test it first!!! `python3 owui_delta_patch.py --verify-only /mnt/data/docker/open-webui-middleware-0.9.1.py` 4. Confirm output: `OK: Target block found exactly once` 5. do it `python3 owui_delta_patch.py /mnt/data/docker/open-webui-middleware-0.9.1.py` 6. copy it back in `docker cp /mnt/data/docker/open-webui-middleware-0.9.1.py open-webui:/app/backend/open_webui/utils/middleware.py` 7. restart the container `docker restart open-webui`

GiteaMirror commented

2026-04-25 09:46:11 -05:00

@rgaricano commented on GitHub (Apr 25, 2026):

@ShirasawaSama @tkalevra @Classic298 @tjbck
PR: https://github.com/open-webui/open-webui/pull/24126 for use Ydoc for message stream updates (reasoning block) as I mentioned before.
Draft for test

@rgaricano commented on GitHub (Apr 25, 2026): @ShirasawaSama @tkalevra @Classic298 @tjbck PR: https://github.com/open-webui/open-webui/pull/24126 for use Ydoc for message stream updates (reasoning block) as I mentioned before. Draft for test

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/open-webui#35585

[GH-ISSUE #23733] issue/perf: Exponential growth of backend, frontend and network bandwith usage with growing chat length #35585

Socket.IO emits grow O(N²) during LLM streaming: full message is re-serialized on every token

Simple TLDR:

Summary

Reproduction

Root cause

serialize_output() — re-serializes the whole output list on every call

full_output() — always cumulative

Tool-call emit — full re-serialize on each tool-call delta

Main text-delta emit — full re-serialize on each token

delta_chunk_size only batches frequency, not payload size

The emit sink

Frontend amplification

The damage equation

Acknowledgments

Backend Implementation

1. Yjs Document Initialization

2. Modified Stream Handler

3. On-demand Serialization for Filters

4. Database Save Integration

Frontend Implementation

1. Chat Message Yjs Handler

2. Integration with ResponseMessage

Backend Full State Support

`serialize_output()` — re-serializes the whole output list on every call

`full_output()` — always cumulative

`delta_chunk_size` only batches frequency, not payload size