[GH-ISSUE #23418] feat: Performance Bottleneck - 3-Minute Latency with 400K+ Token Context in Roleplay Scenarios #35506

New Issue

GiteaMirror · 2026-04-25T09:43:03-05:00

GiteaMirror commented

2026-04-25 09:43:03 -05:00

Originally created by @a86582751 on GitHub (Apr 5, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/23418

Performance Bottleneck: 3-Minute Latency with 400K+ Token Context in Roleplay Scenarios

Bug Description

When using Open WebUI for long-form roleplay conversations (400K+ tokens, ~36MB chat data), the response latency becomes unbearable (2-3 minutes) despite the model itself being responsive.

Key Observation: The delay occurs before the request reaches the LLM backend.

Environment

Open WebUI Version: v0.8.12 (Docker)
Database: SQLite (181MB webui.db)
Deployment: Docker with Nginx reverse proxy
Backend: CLI Proxy API (CPA) via host.docker.internal:8317
Model: Claude/Gemini via external API

Steps to Reproduce

Create a new chat and engage in long-form roleplay
Accumulate ~40万 tokens (~1248 messages, 36MB JSON)
Send a simple message like "继续"
Observe the timeline:
- Browser shows request pending: ~3 minutes
- CPA receives request: ~51s (model processing)
- Response completes: ~16s (generation time)

Root Cause Analysis

Current Architecture (Problematic)

-- Open WebUI stores entire chat as single JSON blob
CREATE TABLE chat (
    id TEXT PRIMARY KEY,
    chat JSON  -- 36MB of nested JSON
);

Processing Flow per Message:

SELECT chat FROM chat WHERE id = ? → Read 36MB (41ms)
json.loads() → Parse to Python dict (5-10s)
Pydantic validation → (10-20s)
Append new message to 1248-item array (1s)
json.dumps() → Serialize (5-10s)
UPDATE chat SET chat = ? → Write 37MB (10s)
Send to LLM API (16s)

Total: 2-3 minutes, with 80%+ time spent on JSON serialization

Comparison with Efficient Architectures

ChatGPT/Claude approach:

Messages stored as individual rows
Pagination: Load only recent 20-50 messages
Incremental updates: O(1) complexity
Result: <1s perceived latency

Cherry Studio approach:

IndexedDB with Dexie.js
Block-based message storage
Lazy loading with virtual scroll
Result: Smooth performance even with long context

Benchmark Data

Metric	Open WebUI	ChatGPT/Claude
Storage format	Single 36MB JSON	Row-based messages
Update complexity	O(n) - rewrite all	O(1) - append only
Query pattern	Full load	Pagination
400K token latency	2-3 minutes	<3 seconds

Proposed Solutions

Option 1: Message Table Normalization (Recommended)

Split the monolithic JSON into normalized tables:

-- New schema
CREATE TABLE topics (
    id TEXT PRIMARY KEY,
    title TEXT,
    model TEXT,
    created_at TIMESTAMP
);

CREATE TABLE messages (
    id TEXT PRIMARY KEY,
    topic_id TEXT,
    role TEXT,
    content TEXT,
    created_at TIMESTAMP,
    INDEX idx_topic_time (topic_id, created_at)
);

Benefits:

Incremental updates: Only insert new row
Pagination support: LIMIT 50 OFFSET ?
Backward compatible: Keep chat table as fallback

Option 2: Lazy Loading API

Add new endpoint that returns paginated messages:

@app.get("/api/v1/chats/{id}/messages")
def get_messages(id: str, offset: int = 0, limit: int = 20):
    # Only load recent messages
    return db.query(Message).filter(...).offset(offset).limit(limit).all()

Implement sliding window for model context
Keep full history in DB but only send recent N messages
Async background summarization for older context

PR #22681: Bypass RAG for full context files (addresses related performance issue)
Similar performance reports: #20520, #20327, [GH-ISSUE #21858] issue: XSS attack vector (#19594)

Additional Context

This issue particularly affects:

Long-form creative writing (roleplay, fiction)
Multi-session projects (coding, research)
Users requiring cross-device sync (OW's key advantage over local clients)

The current architecture forces users to choose between:

Open WebUI: Cross-device sync but 3-minute latency
Cherry Studio: Fast performance but locked to single device

Workarounds Currently Used

Manual branching: Periodically export and start new chat branches
External tools: Custom scripts to split and archive chat data
Hybrid approach: Use Cherry Studio locally, OW only for archive

Request

Consider prioritizing storage layer refactoring for v2.0 to support:

Row-based message storage
Pagination API
Lazy loading in frontend
Migration path for existing data

This would make Open WebUI truly competitive for professional/long-form use cases.

Labels: performance, enhancement, database, v2.0

Originally created by @a86582751 on GitHub (Apr 5, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/23418 ## Performance Bottleneck: 3-Minute Latency with 400K+ Token Context in Roleplay Scenarios ### Bug Description When using Open WebUI for long-form roleplay conversations (400K+ tokens, ~36MB chat data), the response latency becomes unbearable (2-3 minutes) despite the model itself being responsive. **Key Observation**: The delay occurs **before** the request reaches the LLM backend. ### Environment - **Open WebUI Version**: v0.8.12 (Docker) - **Database**: SQLite (181MB webui.db) - **Deployment**: Docker with Nginx reverse proxy - **Backend**: CLI Proxy API (CPA) via `host.docker.internal:8317` - **Model**: Claude/Gemini via external API ### Steps to Reproduce 1. Create a new chat and engage in long-form roleplay 2. Accumulate ~40万 tokens (~1248 messages, 36MB JSON) 3. Send a simple message like "继续" 4. Observe the timeline: - Browser shows request pending: **~3 minutes** - CPA receives request: **~51s** (model processing) - Response completes: **~16s** (generation time) ### Root Cause Analysis #### Current Architecture (Problematic) ```sql -- Open WebUI stores entire chat as single JSON blob CREATE TABLE chat ( id TEXT PRIMARY KEY, chat JSON -- 36MB of nested JSON ); ``` **Processing Flow per Message**: 1. `SELECT chat FROM chat WHERE id = ?` → Read 36MB (41ms) 2. `json.loads()` → Parse to Python dict (**5-10s**) 3. Pydantic validation → (**10-20s**) 4. Append new message to 1248-item array (**1s**) 5. `json.dumps()` → Serialize (**5-10s**) 6. `UPDATE chat SET chat = ?` → Write 37MB (**10s**) 7. Send to LLM API (**16s**) **Total: 2-3 minutes**, with 80%+ time spent on JSON serialization #### Comparison with Efficient Architectures **ChatGPT/Claude approach**: - Messages stored as individual rows - Pagination: Load only recent 20-50 messages - Incremental updates: O(1) complexity - Result: <1s perceived latency **Cherry Studio approach**: - IndexedDB with Dexie.js - Block-based message storage - Lazy loading with virtual scroll - Result: Smooth performance even with long context ### Benchmark Data | Metric | Open WebUI | ChatGPT/Claude | |--------|-----------|----------------| | Storage format | Single 36MB JSON | Row-based messages | | Update complexity | O(n) - rewrite all | O(1) - append only | | Query pattern | Full load | Pagination | | 400K token latency | 2-3 minutes | <3 seconds | ### Proposed Solutions #### Option 1: Message Table Normalization (Recommended) Split the monolithic JSON into normalized tables: ```sql -- New schema CREATE TABLE topics ( id TEXT PRIMARY KEY, title TEXT, model TEXT, created_at TIMESTAMP ); CREATE TABLE messages ( id TEXT PRIMARY KEY, topic_id TEXT, role TEXT, content TEXT, created_at TIMESTAMP, INDEX idx_topic_time (topic_id, created_at) ); ``` **Benefits**: - Incremental updates: Only insert new row - Pagination support: `LIMIT 50 OFFSET ?` - Backward compatible: Keep `chat` table as fallback #### Option 2: Lazy Loading API Add new endpoint that returns paginated messages: ```python @app.get("/api/v1/chats/{id}/messages") def get_messages(id: str, offset: int = 0, limit: int = 20): # Only load recent messages return db.query(Message).filter(...).offset(offset).limit(limit).all() ``` #### Option 3: Context Compression (Related to PR #22681) - Implement sliding window for model context - Keep full history in DB but only send recent N messages - Async background summarization for older context ### Related Issues - PR #22681: Bypass RAG for full context files (addresses related performance issue) - Similar performance reports: #20520, #20327, #19594 ### Additional Context This issue particularly affects: - **Long-form creative writing** (roleplay, fiction) - **Multi-session projects** (coding, research) - **Users requiring cross-device sync** (OW's key advantage over local clients) The current architecture forces users to choose between: 1. **Open WebUI**: Cross-device sync but 3-minute latency 2. **Cherry Studio**: Fast performance but locked to single device ### Workarounds Currently Used 1. **Manual branching**: Periodically export and start new chat branches 2. **External tools**: Custom scripts to split and archive chat data 3. **Hybrid approach**: Use Cherry Studio locally, OW only for archive ### Request Consider prioritizing storage layer refactoring for v2.0 to support: - Row-based message storage - Pagination API - Lazy loading in frontend - Migration path for existing data This would make Open WebUI truly competitive for professional/long-form use cases. --- **Labels**: `performance`, `enhancement`, `database`, `v2.0`

GiteaMirror closed this issue

2026-04-25 09:43:04 -05:00

GiteaMirror commented

2026-04-25 09:43:04 -05:00

@pr-validator-bot commented on GitHub (Apr 5, 2026):

⚠️ Missing Issue Title Prefix

@a86582751, your issue title is missing a prefix (e.g., bug:, feat:, docs:).

Please update your issue title to include one of the following prefixes:

bug: Bug report or error you've encountered
feat: Feature request or enhancement suggestion
docs: Documentation issue or improvement request
question: Question about usage or functionality
help: Request for help or support

Example: bug: Login fails when using special characters in password

@pr-validator-bot commented on GitHub (Apr 5, 2026): # ⚠️ Missing Issue Title Prefix @a86582751, your issue title is missing a prefix (e.g., `bug:`, `feat:`, `docs:`). Please update your issue title to include one of the following prefixes: - **bug**: Bug report or error you've encountered - **feat**: Feature request or enhancement suggestion - **docs**: Documentation issue or improvement request - **question**: Question about usage or functionality - **help**: Request for help or support Example: `bug: Login fails when using special characters in password`

GiteaMirror commented

2026-04-25 09:43:05 -05:00

@Classic298 commented on GitHub (Apr 8, 2026):

Processing Flow per Message:
SELECT chat FROM chat WHERE id = ? → Read 36MB (41ms)
json.loads() → Parse to Python dict (5-10s)
Pydantic validation → (10-20s)
Append new message to 1248-item array (1s)
json.dumps() → Serialize (5-10s)
UPDATE chat SET chat = ? → Write 37MB (10s)
Send to LLM API (16s)

Did you time / measure this?
I have much longer and larger chats and I don't come even close to these values.

3 minute Latency for the Chat to be sent to the provider is definitely not an Open WebUI fault - parsing a 36MB json doesn't take 3 minutes.

The related issues you referenced are not related

If you load a long chat initially, only 20 messages are being fetched already. So why are you proposing to add what is already in place?

Slop

@Classic298 commented on GitHub (Apr 8, 2026): > Processing Flow per Message: > SELECT chat FROM chat WHERE id = ? → Read 36MB (41ms) > json.loads() → Parse to Python dict (5-10s) > Pydantic validation → (10-20s) > Append new message to 1248-item array (1s) > json.dumps() → Serialize (5-10s) > UPDATE chat SET chat = ? → Write 37MB (10s) > Send to LLM API (16s) Did you time / measure this? I have much longer and larger chats and I don't come even close to these values. 3 minute Latency for the Chat to be sent to the provider is definitely not an Open WebUI fault - parsing a 36MB json doesn't take 3 minutes. The related issues you referenced are not related If you load a long chat initially, only 20 messages are being fetched already. So why are you proposing to add what is already in place? Slop

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/open-webui#35506