issue: Severe slowdown when collections are associated to a model (vs. passing knowledges per request). Model editor also takes ~5 minutes to open with large collections #6338

New Issue

GiteaMirror · 2025-11-11T16:51:53-06:00

GiteaMirror commented

2025-11-11 16:51:53 -06:00

Originally created by @galvanoid on GitHub (Sep 7, 2025).

Originally assigned to: @tjbck on GitHub.

Check Existing Issues

I have searched the existing issues and discussions.
I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

0.6.26

Ollama Version (if applicable)

No response

Operating System

Ubuntu 24.04

Browser (if applicable)

Chrome, Edge, Firefox, Chromium

Confirmation

I have read and followed all instructions in README.md.
I am using the latest version of both Open WebUI and Ollama.
I have included the browser console logs.
I have included the Docker container logs.
I have provided every relevant configuration, setting, and environment variable used in my setup.
I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
Start with the initial platform/version/OS and dependencies used,
Specify exact install/launch/configure commands,
List URLs visited, user input (incl. example values/emails/passwords if needed),
Describe all options and toggles enabled or changed,
Include any files or environmental changes,
Identify the expected and actual result at each stage,
Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

Expected:

Comparable latency in both modes, associated (knowledges associated to model) vs. per-request (knowledges sent vía request).

The model editor should open within seconds, regardless of associated collections’ size.

Actual Behavior

Associated mode: high latency; model editor takes minutes to open (e.g., ~5 min).

Per-request mode: fast and stable.

Steps to Reproduce

Case A — Slow: collections associated to the model

In OWUI, associate the large collection(s) (50k-100k documents) to model (in admin settings)

Send a basic chat/RAG query.

Observe very high latency (many seconds/minutes).

Open the model editor for that model: the page takes ~5 minutes to become interactive when large collections are attached. In that time if a llm request is streaming, it stopped.

In this case (knoledges associated to a model), begin of response takes up to 10 min. o more.

Case B — Fast: pass collections per request

Ensure the model has no associated collections.

Call the OWUI chat API with the same model but include the collections in the request body (e.g. knowledges: [...]).

Observe fast responses and a snappy model editor.

In this case, begin of response takes about 1 min.

Same Qdrant, same collection, same embedder, same network. The only change is association vs. per-request collections.

Logs & Screenshots

Qdrant logs during:

model editor open (slow)

a query with associated collections (slow)

the same query with collections in request (fast)

docker logs -f

(Look for /points/scroll vs /search, limit, with_vector, with_payload sizes, etc.)

OWUI logs with debug level (LOG_LEVEL=debug or similar) to see which calls are made when opening the editor and when building the query in the associated path.

Additional Information

What I Tested / Ruled Out

Qdrant is healthy: status: green; direct /search calls are fast when using sensible params (with_vector=false, minimal with_payload.include, small limit).

Network is not the issue: OWUI and Qdrant co-located or same LAN; negligible RTT.

Embedder is fine: snowflake-arctic-embed2 (1024D) matches collection; no on-the-fly re-embedding.

Payload indices: Tried adding/removing payload indexes (source, file_id, start_index). The associated vs. per-request gap remains.

RAG optimizations (e.g., filtering to start_index=0 for summaries, “summary-mirror” collections, pre/post-retrieval hooks) speed up Qdrant, but the slowdown only appears when collections are associated to the model.

Extra symptom: opening the model editor with large associated collections is extremely slow ⇒ likely doing heavy enumeration/scroll or a large prefetch to build UI metadata.

Additional Observation: Qdrant activity starts immediately with per-request collections, but is delayed when collections are associated to the model

Using btop to watch CPU:

Per-request collections (fast path): right after I send the chat request, the Qdrant process ramps up CPU immediately (all 30 cores light up within ~<1s).

Associated collections (slow path): after I send the same request, there’s a noticeable idle gap before Qdrant shows any CPU activity. Only after several seconds does Qdrant begin to work.

Hypothesis for Maintainers

In the “associated collections” path, OWUI might:

Do prefetch of many points/chunks to build file lists, previews, or counts (possibly via /points/scroll), with liberal with_payload or with_vector=true, or high limit.

Issue N+1 queries per collection to compute aggregates (e.g., dedup of file_id), instead of using lighter endpoints (e.g., /points/count) or lazy evaluation.

Use search_params.exact=true or defaults that trigger full scans.

In contrast, the “per-request” path seems to use a leaner retrieval: typically 1–2 /search calls with with_vector=false, minimal payload.include, and small limit, hence fast.

Originally created by @galvanoid on GitHub (Sep 7, 2025). Originally assigned to: @tjbck on GitHub. ### Check Existing Issues - [x] I have searched the existing issues and discussions. - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version 0.6.26 ### Ollama Version (if applicable) _No response_ ### Operating System Ubuntu 24.04 ### Browser (if applicable) Chrome, Edge, Firefox, Chromium ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior Expected: Comparable latency in both modes, associated (knowledges associated to model) vs. per-request (knowledges sent vía request). The model editor should open within seconds, regardless of associated collections’ size. ### Actual Behavior Associated mode: high latency; model editor takes minutes to open (e.g., ~5 min). Per-request mode: fast and stable. ### Steps to Reproduce Case A — Slow: collections associated to the model In OWUI, associate the large collection(s) (50k-100k documents) to model (in admin settings) Send a basic chat/RAG query. Observe very high latency (many seconds/minutes). Open the model editor for that model: the page takes ~5 minutes to become interactive when large collections are attached. In that time if a llm request is streaming, it stopped. In this case (knoledges associated to a model), begin of response takes up to 10 min. o more. Case B — Fast: pass collections per request Ensure the model has no associated collections. Call the OWUI chat API with the same model but include the collections in the request body (e.g. knowledges: [...]). Observe fast responses and a snappy model editor. In this case, begin of response takes about 1 min. Same Qdrant, same collection, same embedder, same network. The only change is association vs. per-request collections. ### Logs & Screenshots Qdrant logs during: model editor open (slow) a query with associated collections (slow) the same query with collections in request (fast) docker logs -f <qdrant-container> (Look for /points/scroll vs /search, limit, with_vector, with_payload sizes, etc.) OWUI logs with debug level (LOG_LEVEL=debug or similar) to see which calls are made when opening the editor and when building the query in the associated path. ### Additional Information What I Tested / Ruled Out Qdrant is healthy: status: green; direct /search calls are fast when using sensible params (with_vector=false, minimal with_payload.include, small limit). Network is not the issue: OWUI and Qdrant co-located or same LAN; negligible RTT. Embedder is fine: snowflake-arctic-embed2 (1024D) matches collection; no on-the-fly re-embedding. Payload indices: Tried adding/removing payload indexes (source, file_id, start_index). The associated vs. per-request gap remains. RAG optimizations (e.g., filtering to start_index=0 for summaries, “summary-mirror” collections, pre/post-retrieval hooks) speed up Qdrant, but the slowdown only appears when collections are associated to the model. Extra symptom: opening the model editor with large associated collections is extremely slow ⇒ likely doing heavy enumeration/scroll or a large prefetch to build UI metadata. Additional Observation: Qdrant activity starts immediately with per-request collections, but is delayed when collections are associated to the model Using btop to watch CPU: Per-request collections (fast path): right after I send the chat request, the Qdrant process ramps up CPU immediately (all 30 cores light up within ~<1s). Associated collections (slow path): after I send the same request, there’s a noticeable idle gap before Qdrant shows any CPU activity. Only after several seconds does Qdrant begin to work. Hypothesis for Maintainers In the “associated collections” path, OWUI might: Do prefetch of many points/chunks to build file lists, previews, or counts (possibly via /points/scroll), with liberal with_payload or with_vector=true, or high limit. Issue N+1 queries per collection to compute aggregates (e.g., dedup of file_id), instead of using lighter endpoints (e.g., /points/count) or lazy evaluation. Use search_params.exact=true or defaults that trigger full scans. In contrast, the “per-request” path seems to use a leaner retrieval: typically 1–2 /search calls with with_vector=false, minimal payload.include, and small limit, hence fast.

GiteaMirror added the bug label 2025-11-11 16:51:53 -06:00

GiteaMirror commented

2025-11-11 16:51:54 -06:00

@silentoplayz commented on GitHub (Oct 21, 2025):

Related - https://github.com/open-webui/open-webui/issues/17998

@silentoplayz commented on GitHub (Oct 21, 2025): Related - https://github.com/open-webui/open-webui/issues/17998

GiteaMirror commented

2025-11-11 16:51:54 -06:00

@deliciousbob commented on GitHub (Oct 24, 2025):

Hi @galvanoid, we have the same issues with 70K files, starting to get worse with 10K files.
There is always a noticable delay before the Request is sent to the API Endpoints / Vector DB.

The delay seems to correlate with the loading time of the knowledge list in the Chat (via + or #) and Workspace/Knowledge.
Our wait time is approx. 15-30sec. before a list loads, or the request is sent to the Endpoints.

One of the colloborators have already created a workaround to disable file listing in the chat (see https://github.com/open-webui/open-webui/pull/18292) , this reduces the Listing of the knowledge collections, but I assume this has to be extended to more points in the code. (like after sending your prompt)

@deliciousbob commented on GitHub (Oct 24, 2025): Hi @galvanoid, we have the same issues with 70K files, starting to get worse with 10K files. There is always a noticable delay before the Request is sent to the API Endpoints / Vector DB. The delay seems to correlate with the loading time of the knowledge list in the Chat (via + or #) and Workspace/Knowledge. Our wait time is approx. 15-30sec. before a list loads, or the request is sent to the Endpoints. One of the colloborators have already created a workaround to disable file listing in the chat (see https://github.com/open-webui/open-webui/pull/18292) , this reduces the Listing of the knowledge collections, but I assume this has to be extended to more points in the code. (like after sending your prompt)

GiteaMirror referenced this issue

2025-11-11 18:02:18 -06:00

[PR #6338] [CLOSED] retrieval app refactoring no. 1 #8658

GiteaMirror referenced this issue

2025-11-11 18:02:28 -06:00

[PR #6360] [CLOSED] retrieval app refactoring no. 1 #8664

GiteaMirror referenced this issue

2026-04-20 03:45:58 -05:00

[PR #6338] [CLOSED] retrieval app refactoring no. 1 #21862

GiteaMirror referenced this issue

2026-04-20 03:46:13 -05:00

[PR #6360] [CLOSED] retrieval app refactoring no. 1 #21868