[GH-ISSUE #23418] feat: Performance Bottleneck - 3-Minute Latency with 400K+ Token Context in Roleplay Scenarios #35506

Closed
opened 2026-04-25 09:43:03 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @a86582751 on GitHub (Apr 5, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/23418

Performance Bottleneck: 3-Minute Latency with 400K+ Token Context in Roleplay Scenarios

Bug Description

When using Open WebUI for long-form roleplay conversations (400K+ tokens, ~36MB chat data), the response latency becomes unbearable (2-3 minutes) despite the model itself being responsive.

Key Observation: The delay occurs before the request reaches the LLM backend.

Environment

  • Open WebUI Version: v0.8.12 (Docker)
  • Database: SQLite (181MB webui.db)
  • Deployment: Docker with Nginx reverse proxy
  • Backend: CLI Proxy API (CPA) via host.docker.internal:8317
  • Model: Claude/Gemini via external API

Steps to Reproduce

  1. Create a new chat and engage in long-form roleplay
  2. Accumulate ~40万 tokens (~1248 messages, 36MB JSON)
  3. Send a simple message like "继续"
  4. Observe the timeline:
    • Browser shows request pending: ~3 minutes
    • CPA receives request: ~51s (model processing)
    • Response completes: ~16s (generation time)

Root Cause Analysis

Current Architecture (Problematic)

-- Open WebUI stores entire chat as single JSON blob
CREATE TABLE chat (
    id TEXT PRIMARY KEY,
    chat JSON  -- 36MB of nested JSON
);

Processing Flow per Message:

  1. SELECT chat FROM chat WHERE id = ? → Read 36MB (41ms)
  2. json.loads() → Parse to Python dict (5-10s)
  3. Pydantic validation → (10-20s)
  4. Append new message to 1248-item array (1s)
  5. json.dumps() → Serialize (5-10s)
  6. UPDATE chat SET chat = ? → Write 37MB (10s)
  7. Send to LLM API (16s)

Total: 2-3 minutes, with 80%+ time spent on JSON serialization

Comparison with Efficient Architectures

ChatGPT/Claude approach:

  • Messages stored as individual rows
  • Pagination: Load only recent 20-50 messages
  • Incremental updates: O(1) complexity
  • Result: <1s perceived latency

Cherry Studio approach:

  • IndexedDB with Dexie.js
  • Block-based message storage
  • Lazy loading with virtual scroll
  • Result: Smooth performance even with long context

Benchmark Data

Metric Open WebUI ChatGPT/Claude
Storage format Single 36MB JSON Row-based messages
Update complexity O(n) - rewrite all O(1) - append only
Query pattern Full load Pagination
400K token latency 2-3 minutes <3 seconds

Proposed Solutions

Split the monolithic JSON into normalized tables:

-- New schema
CREATE TABLE topics (
    id TEXT PRIMARY KEY,
    title TEXT,
    model TEXT,
    created_at TIMESTAMP
);

CREATE TABLE messages (
    id TEXT PRIMARY KEY,
    topic_id TEXT,
    role TEXT,
    content TEXT,
    created_at TIMESTAMP,
    INDEX idx_topic_time (topic_id, created_at)
);

Benefits:

  • Incremental updates: Only insert new row
  • Pagination support: LIMIT 50 OFFSET ?
  • Backward compatible: Keep chat table as fallback

Option 2: Lazy Loading API

Add new endpoint that returns paginated messages:

@app.get("/api/v1/chats/{id}/messages")
def get_messages(id: str, offset: int = 0, limit: int = 20):
    # Only load recent messages
    return db.query(Message).filter(...).offset(offset).limit(limit).all()
  • Implement sliding window for model context
  • Keep full history in DB but only send recent N messages
  • Async background summarization for older context

Additional Context

This issue particularly affects:

  • Long-form creative writing (roleplay, fiction)
  • Multi-session projects (coding, research)
  • Users requiring cross-device sync (OW's key advantage over local clients)

The current architecture forces users to choose between:

  1. Open WebUI: Cross-device sync but 3-minute latency
  2. Cherry Studio: Fast performance but locked to single device

Workarounds Currently Used

  1. Manual branching: Periodically export and start new chat branches
  2. External tools: Custom scripts to split and archive chat data
  3. Hybrid approach: Use Cherry Studio locally, OW only for archive

Request

Consider prioritizing storage layer refactoring for v2.0 to support:

  • Row-based message storage
  • Pagination API
  • Lazy loading in frontend
  • Migration path for existing data

This would make Open WebUI truly competitive for professional/long-form use cases.


Labels: performance, enhancement, database, v2.0

Originally created by @a86582751 on GitHub (Apr 5, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/23418 ## Performance Bottleneck: 3-Minute Latency with 400K+ Token Context in Roleplay Scenarios ### Bug Description When using Open WebUI for long-form roleplay conversations (400K+ tokens, ~36MB chat data), the response latency becomes unbearable (2-3 minutes) despite the model itself being responsive. **Key Observation**: The delay occurs **before** the request reaches the LLM backend. ### Environment - **Open WebUI Version**: v0.8.12 (Docker) - **Database**: SQLite (181MB webui.db) - **Deployment**: Docker with Nginx reverse proxy - **Backend**: CLI Proxy API (CPA) via `host.docker.internal:8317` - **Model**: Claude/Gemini via external API ### Steps to Reproduce 1. Create a new chat and engage in long-form roleplay 2. Accumulate ~40万 tokens (~1248 messages, 36MB JSON) 3. Send a simple message like "继续" 4. Observe the timeline: - Browser shows request pending: **~3 minutes** - CPA receives request: **~51s** (model processing) - Response completes: **~16s** (generation time) ### Root Cause Analysis #### Current Architecture (Problematic) ```sql -- Open WebUI stores entire chat as single JSON blob CREATE TABLE chat ( id TEXT PRIMARY KEY, chat JSON -- 36MB of nested JSON ); ``` **Processing Flow per Message**: 1. `SELECT chat FROM chat WHERE id = ?` → Read 36MB (41ms) 2. `json.loads()` → Parse to Python dict (**5-10s**) 3. Pydantic validation → (**10-20s**) 4. Append new message to 1248-item array (**1s**) 5. `json.dumps()` → Serialize (**5-10s**) 6. `UPDATE chat SET chat = ?` → Write 37MB (**10s**) 7. Send to LLM API (**16s**) **Total: 2-3 minutes**, with 80%+ time spent on JSON serialization #### Comparison with Efficient Architectures **ChatGPT/Claude approach**: - Messages stored as individual rows - Pagination: Load only recent 20-50 messages - Incremental updates: O(1) complexity - Result: <1s perceived latency **Cherry Studio approach**: - IndexedDB with Dexie.js - Block-based message storage - Lazy loading with virtual scroll - Result: Smooth performance even with long context ### Benchmark Data | Metric | Open WebUI | ChatGPT/Claude | |--------|-----------|----------------| | Storage format | Single 36MB JSON | Row-based messages | | Update complexity | O(n) - rewrite all | O(1) - append only | | Query pattern | Full load | Pagination | | 400K token latency | 2-3 minutes | <3 seconds | ### Proposed Solutions #### Option 1: Message Table Normalization (Recommended) Split the monolithic JSON into normalized tables: ```sql -- New schema CREATE TABLE topics ( id TEXT PRIMARY KEY, title TEXT, model TEXT, created_at TIMESTAMP ); CREATE TABLE messages ( id TEXT PRIMARY KEY, topic_id TEXT, role TEXT, content TEXT, created_at TIMESTAMP, INDEX idx_topic_time (topic_id, created_at) ); ``` **Benefits**: - Incremental updates: Only insert new row - Pagination support: `LIMIT 50 OFFSET ?` - Backward compatible: Keep `chat` table as fallback #### Option 2: Lazy Loading API Add new endpoint that returns paginated messages: ```python @app.get("/api/v1/chats/{id}/messages") def get_messages(id: str, offset: int = 0, limit: int = 20): # Only load recent messages return db.query(Message).filter(...).offset(offset).limit(limit).all() ``` #### Option 3: Context Compression (Related to PR #22681) - Implement sliding window for model context - Keep full history in DB but only send recent N messages - Async background summarization for older context ### Related Issues - PR #22681: Bypass RAG for full context files (addresses related performance issue) - Similar performance reports: #20520, #20327, #19594 ### Additional Context This issue particularly affects: - **Long-form creative writing** (roleplay, fiction) - **Multi-session projects** (coding, research) - **Users requiring cross-device sync** (OW's key advantage over local clients) The current architecture forces users to choose between: 1. **Open WebUI**: Cross-device sync but 3-minute latency 2. **Cherry Studio**: Fast performance but locked to single device ### Workarounds Currently Used 1. **Manual branching**: Periodically export and start new chat branches 2. **External tools**: Custom scripts to split and archive chat data 3. **Hybrid approach**: Use Cherry Studio locally, OW only for archive ### Request Consider prioritizing storage layer refactoring for v2.0 to support: - Row-based message storage - Pagination API - Lazy loading in frontend - Migration path for existing data This would make Open WebUI truly competitive for professional/long-form use cases. --- **Labels**: `performance`, `enhancement`, `database`, `v2.0`
Author
Owner

@pr-validator-bot commented on GitHub (Apr 5, 2026):

⚠️ Missing Issue Title Prefix

@a86582751, your issue title is missing a prefix (e.g., bug:, feat:, docs:).

Please update your issue title to include one of the following prefixes:

  • bug: Bug report or error you've encountered
  • feat: Feature request or enhancement suggestion
  • docs: Documentation issue or improvement request
  • question: Question about usage or functionality
  • help: Request for help or support

Example: bug: Login fails when using special characters in password

<!-- gh-comment-id:4188519698 --> @pr-validator-bot commented on GitHub (Apr 5, 2026): # ⚠️ Missing Issue Title Prefix @a86582751, your issue title is missing a prefix (e.g., `bug:`, `feat:`, `docs:`). Please update your issue title to include one of the following prefixes: - **bug**: Bug report or error you've encountered - **feat**: Feature request or enhancement suggestion - **docs**: Documentation issue or improvement request - **question**: Question about usage or functionality - **help**: Request for help or support Example: `bug: Login fails when using special characters in password`
Author
Owner

@Classic298 commented on GitHub (Apr 8, 2026):

Processing Flow per Message:
SELECT chat FROM chat WHERE id = ? → Read 36MB (41ms)
json.loads() → Parse to Python dict (5-10s)
Pydantic validation → (10-20s)
Append new message to 1248-item array (1s)
json.dumps() → Serialize (5-10s)
UPDATE chat SET chat = ? → Write 37MB (10s)
Send to LLM API (16s)

Did you time / measure this?
I have much longer and larger chats and I don't come even close to these values.

3 minute Latency for the Chat to be sent to the provider is definitely not an Open WebUI fault - parsing a 36MB json doesn't take 3 minutes.

The related issues you referenced are not related

If you load a long chat initially, only 20 messages are being fetched already. So why are you proposing to add what is already in place?

Slop

<!-- gh-comment-id:4204730934 --> @Classic298 commented on GitHub (Apr 8, 2026): > Processing Flow per Message: > SELECT chat FROM chat WHERE id = ? → Read 36MB (41ms) > json.loads() → Parse to Python dict (5-10s) > Pydantic validation → (10-20s) > Append new message to 1248-item array (1s) > json.dumps() → Serialize (5-10s) > UPDATE chat SET chat = ? → Write 37MB (10s) > Send to LLM API (16s) Did you time / measure this? I have much longer and larger chats and I don't come even close to these values. 3 minute Latency for the Chat to be sent to the provider is definitely not an Open WebUI fault - parsing a 36MB json doesn't take 3 minutes. The related issues you referenced are not related If you load a long chat initially, only 20 messages are being fetched already. So why are you proposing to add what is already in place? Slop
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#35506