[GH-ISSUE #20327] issue: Unable to use any Open WebUI version newer than 0.6.25 due to hybrid search performance #57818

Closed
opened 2026-05-05 21:41:36 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @galvanoid on GitHub (Jan 2, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/20327

Check Existing Issues

  • I have searched for any existing and/or related issues.
  • I have searched for any existing and/or related discussions.
  • I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!).
  • I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

Latest

Ollama Version (if applicable)

No response

Operating System

Ubuntu 24.04

Browser (if applicable)

No response

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided every relevant configuration, setting, and environment variable used in my setup.
  • I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
  • I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
  • Start with the initial platform/version/OS and dependencies used,
  • Specify exact install/launch/configure commands,
  • List URLs visited, user input (incl. example values/emails/passwords if needed),
  • Describe all options and toggles enabled or changed,
  • Include any files or environmental changes,
  • Identify the expected and actual result at each stage,
  • Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

Upgrading Open WebUI to a newer version should not significantly change the latency characteristics of an existing retrieval and reranking workflow when configuration, data, and infrastructure remain the same.

In particular, when BM25 is disabled and only lexical or hybrid retrieval is used, the time between submitting a user query and the first reranker invocation should be comparable to what was observed in version 0.6.25.

Reranking passes should execute with similar performance to previous versions, and increases in total response time should be incremental and proportional to collection size, rather than introducing long idle delays or order-of-magnitude slowdowns.

Under these conditions, newer versions are expected to remain usable for real-world RAG workloads that were already supported in 0.6.25.

Actual Behavior

In versions newer than 0.6.25, the same retrieval and reranking workflow exhibits a significant increase in latency, even when BM25 is explicitly disabled and all other parameters remain unchanged.

After submitting a user query, there is a long delay before any retrieval or reranking activity begins. This delay is visible in the Open WebUI logs as a gap between the initial request and the first reranker invocation.

Once reranking starts, each reranker pass takes noticeably longer than in version 0.6.25. The combined effect is a substantial increase in total response time.

With collections of around 10k files, response generation may take several minutes to begin. With larger collections, response times can extend to tens of minutes, and in some cases the application becomes unresponsive before completing the request.

As a result, workflows that are usable and predictable in version 0.6.25 become impractical in later versions under otherwise identical conditions.

Examples:

In version 0.6.25, hybrid search can be enabled with BM25 disabled, and reranking can be configured with either a single pass or multiple passes through the retrieval generation interface. Under these conditions, the system behaves as expected and remains usable.

As a concrete example, using a model associated with a collection of approximately 10,000 files, total response time is under 30 seconds when three reranking passes are enabled, and around 10 seconds when using a single reranking pass.

In versions released after 0.6.25, the same setup produces very different results. Even with the BM25 slider explicitly set to 0, overall latency increases significantly. When observing the Open WebUI logs, there is a long delay before the first reranker call occurs, followed by reranker invocations that are noticeably slower than in version 0.6.25.

Using the same query, the same collection, and the same model (qwen3-30b-3b), any version newer than 0.6.25 takes approximately three minutes before it even begins generating a response.

It is worth noting that these measurements are based on relatively small collections. In my environment, other models are associated with collections totaling around 160,000 files. With collections of that size, it becomes practically impossible to use versions newer than 0.6.25, as response times can reach 15 to 20 minutes, and in some cases Open WebUI becomes unresponsive before completing the request.

Steps to Reproduce

Deploy Open WebUI version 0.6.25 and configure retrieval with a vector database containing a collection of approximately 10,000 files.

Associate the collection with a model such as qwen3-30b-3b and enable hybrid search, explicitly setting the BM25 slider to 0.

Enable reranking and configure either one reranking pass or multiple reranking passes through the retrieval generation interface.

Submit a query that triggers retrieval and reranking, and observe the time between submitting the query and the start of response generation, as well as the timing of reranker invocations in the logs.

Repeat the same steps using any Open WebUI version newer than 0.6.25, keeping the same model, collection, reranker configuration, hardware, and infrastructure.

Compare the delay before the first reranker call, the duration of individual reranker passes, and the total time until response generation begins.

Logs & Screenshots

Screenshot from Open WebUI v0.6.25 showing the hybrid search configuration used in the tests, with BM25 disabled and reranking enabled. Under this configuration, retrieval and reranking execute with low and predictable latency.

Image

Additional Information

Newer versions introduce enriched BM25-based retrieval, which noticeably improves relevance and source filtering, but cannot currently be used in this workflow due to the performance problem described above.

Originally created by @galvanoid on GitHub (Jan 2, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/20327 ### Check Existing Issues - [x] I have searched for any existing and/or related issues. - [x] I have searched for any existing and/or related discussions. - [x] I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!). - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version Latest ### Ollama Version (if applicable) _No response_ ### Operating System Ubuntu 24.04 ### Browser (if applicable) _No response_ ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior Upgrading Open WebUI to a newer version should not significantly change the latency characteristics of an existing retrieval and reranking workflow when configuration, data, and infrastructure remain the same. In particular, when BM25 is disabled and only lexical or hybrid retrieval is used, the time between submitting a user query and the first reranker invocation should be comparable to what was observed in version 0.6.25. Reranking passes should execute with similar performance to previous versions, and increases in total response time should be incremental and proportional to collection size, rather than introducing long idle delays or order-of-magnitude slowdowns. Under these conditions, newer versions are expected to remain usable for real-world RAG workloads that were already supported in 0.6.25. ### Actual Behavior In versions newer than 0.6.25, the same retrieval and reranking workflow exhibits a significant increase in latency, even when BM25 is explicitly disabled and all other parameters remain unchanged. After submitting a user query, there is a long delay before any retrieval or reranking activity begins. This delay is visible in the Open WebUI logs as a gap between the initial request and the first reranker invocation. Once reranking starts, each reranker pass takes noticeably longer than in version 0.6.25. The combined effect is a substantial increase in total response time. With collections of around 10k files, response generation may take several minutes to begin. With larger collections, response times can extend to tens of minutes, and in some cases the application becomes unresponsive before completing the request. As a result, workflows that are usable and predictable in version 0.6.25 become impractical in later versions under otherwise identical conditions. Examples: In version 0.6.25, hybrid search can be enabled with BM25 disabled, and reranking can be configured with either a single pass or multiple passes through the retrieval generation interface. Under these conditions, the system behaves as expected and remains usable. As a concrete example, using a model associated with a collection of approximately 10,000 files, total response time is under 30 seconds when three reranking passes are enabled, and around 10 seconds when using a single reranking pass. In versions released after 0.6.25, the same setup produces very different results. Even with the BM25 slider explicitly set to 0, overall latency increases significantly. When observing the Open WebUI logs, there is a long delay before the first reranker call occurs, followed by reranker invocations that are noticeably slower than in version 0.6.25. Using the same query, the same collection, and the same model (qwen3-30b-3b), any version newer than 0.6.25 takes approximately three minutes before it even begins generating a response. It is worth noting that these measurements are based on relatively small collections. In my environment, other models are associated with collections totaling around 160,000 files. With collections of that size, it becomes practically impossible to use versions newer than 0.6.25, as response times can reach 15 to 20 minutes, and in some cases Open WebUI becomes unresponsive before completing the request. ### Steps to Reproduce Deploy Open WebUI version 0.6.25 and configure retrieval with a vector database containing a collection of approximately 10,000 files. Associate the collection with a model such as qwen3-30b-3b and enable hybrid search, explicitly setting the BM25 slider to 0. Enable reranking and configure either one reranking pass or multiple reranking passes through the retrieval generation interface. Submit a query that triggers retrieval and reranking, and observe the time between submitting the query and the start of response generation, as well as the timing of reranker invocations in the logs. Repeat the same steps using any Open WebUI version newer than 0.6.25, keeping the same model, collection, reranker configuration, hardware, and infrastructure. Compare the delay before the first reranker call, the duration of individual reranker passes, and the total time until response generation begins. ### Logs & Screenshots Screenshot from Open WebUI v0.6.25 showing the hybrid search configuration used in the tests, with BM25 disabled and reranking enabled. Under this configuration, retrieval and reranking execute with low and predictable latency. <img width="1304" height="190" alt="Image" src="https://github.com/user-attachments/assets/848952e7-0c9d-4b19-bfc9-aa7a38eef407" /> ### Additional Information Newer versions introduce enriched BM25-based retrieval, which noticeably improves relevance and source filtering, but cannot currently be used in this workflow due to the performance problem described above.
GiteaMirror added the bug label 2026-05-05 21:41:36 -05:00
Author
Owner

@owui-terminator[bot] commented on GitHub (Jan 2, 2026):

🔍 Similar Issues Found

I found some existing issues that might be related to this one. Please check if any of these are duplicates or contain helpful solutions:

  1. #20019 issue:
    by j63440490 • Dec 17, 2025 • bug

  2. #19777 issue:
    by Yaute7 • Dec 05, 2025 • bug

  3. #20092 issue:
    by VideoRyan • Dec 22, 2025 • bug

  4. #19864 issue:
    by Haervwe • Dec 10, 2025 • bug

  5. #14529 issue: Open WebUI does not work on versions after version 0.6.7
    by OpenSoftware-World • May 30, 2025 • bug

Show 5 more related issues
  1. #18145 issue: 0.6.33 regression
    by Ark-Levy • Oct 08, 2025 • bug

  2. #19563 issue:
    by naruto7g • Nov 28, 2025 • bug

  3. #16540 issue:
    by Sawrz • Aug 12, 2025 • bug

  4. #16959 issue:
    by Te-eMster • Aug 27, 2025 • bug

  5. #19417 issue: v0.6.37 SQL Error
    by AKHYP • Nov 24, 2025 • bug


💡 Tips:

  • If this is a duplicate, please consider closing this issue and adding any additional details to the existing one
  • If you found a solution in any of these issues, please share it here to help others

This comment was generated automatically by a bot. Please react with a 👍 if this comment was helpful, or a 👎 if it was not.

<!-- gh-comment-id:3705143377 --> @owui-terminator[bot] commented on GitHub (Jan 2, 2026): 🔍 **Similar Issues Found** I found some existing issues that might be related to this one. Please check if any of these are duplicates or contain helpful solutions: 1. [#20019](https://github.com/open-webui/open-webui/issues/20019) **issue:** *by j63440490 • Dec 17, 2025 • `bug`* 2. [#19777](https://github.com/open-webui/open-webui/issues/19777) **issue:** *by Yaute7 • Dec 05, 2025 • `bug`* 3. [#20092](https://github.com/open-webui/open-webui/issues/20092) **issue:** *by VideoRyan • Dec 22, 2025 • `bug`* 4. [#19864](https://github.com/open-webui/open-webui/issues/19864) **issue:** *by Haervwe • Dec 10, 2025 • `bug`* 5. [#14529](https://github.com/open-webui/open-webui/issues/14529) **issue: Open WebUI does not work on versions after version 0.6.7** *by OpenSoftware-World • May 30, 2025 • `bug`* <details> <summary>Show 5 more related issues</summary> 6. [#18145](https://github.com/open-webui/open-webui/issues/18145) **issue: 0.6.33 regression** *by Ark-Levy • Oct 08, 2025 • `bug`* 7. [#19563](https://github.com/open-webui/open-webui/issues/19563) **issue:** *by naruto7g • Nov 28, 2025 • `bug`* 8. [#16540](https://github.com/open-webui/open-webui/issues/16540) **issue:** *by Sawrz • Aug 12, 2025 • `bug`* 9. [#16959](https://github.com/open-webui/open-webui/issues/16959) **issue:** *by Te-eMster • Aug 27, 2025 • `bug`* 10. [#19417](https://github.com/open-webui/open-webui/issues/19417) **issue: v0.6.37 SQL Error** *by AKHYP • Nov 24, 2025 • `bug`* </details> --- 💡 **Tips:** - If this is a duplicate, please consider closing this issue and adding any additional details to the existing one - If you found a solution in any of these issues, please share it here to help others *This comment was generated automatically by a bot.* Please react with a 👍 if this comment was helpful, or a 👎 if it was not.
Author
Owner

@rgaricano commented on GitHub (Jan 2, 2026):

@galvanoid
In 0.6.26 the most significant change was the refactoring of the hybrid search system to use async/await patterns.

  • The query_collection_with_hybrid_search function was converted to use async/await and asyncio.gather for parallel processing of queries across collections.
  • The system now fetches collection data sequentially before processing queries, which could introduce delays with large collections.
  • The reranking system was updated to support user context and external rerankers.

The main bottlenecks appear to be:

  • Sequential Collection Fetching: The code fetches collection data one by one before any parallel processing begins. With large collections (10k-160k files), this could cause significant delays.
  • Async Overhead: The conversion to async may have introduced overhead, especially for the reranking compressor's acompress_documents method.
  • BM25 Processing: Even when BM25 weight is set to 0, the system still initializes BM25 retrievers and processes texts.

FIXES (For reference):

1. Optimize Collection Fetching with Parallel Loading
(the main bottleneck is sequential collection fetching in query_collection_with_hybrid_search)
Replace 6f1486ffd0/backend/open_webui/retrieval/utils.py (L472-L482)
with

async def fetch_collection_data(collection_names: list[str]) -> dict:
    async def fetch_single(collection_name: str):
        try:
            return collection_name, VECTOR_DB_CLIENT.get(
                collection_name=collection_name
            )
        except Exception as e:
            log.exception(f"Failed to fetch collection {collection_name}: {e}")
            return collection_name, None
    
    tasks = [fetch_single(name) for name in collection_names]
    results = await asyncio.gather(*tasks)
    return dict(results)

2. Add Early BM25 Bypass
(even with BM25 weight set to 0, the system still initializes BM25 retrievers)
In 6f1486ffd0/backend/open_webui/retrieval/utils.py (L239)
add

if hybrid_bm25_weight <= 0 and not enable_enriched_texts:
    # Skip BM25 entirely when weight is 0 and enrichment is disabled
    vector_search_retriever = VectorSearchRetriever(
        collection_name=collection_name,
        embedding_function=embedding_function,
        top_k=k,
    )
    
    compressor = RerankCompressor(
        embedding_function=embedding_function,
        top_n=k_reranker,
        reranking_function=reranking_function,
        r_score=r,
    )
    
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor, base_retriever=vector_search_retriever
    )
    
    result = await compression_retriever.ainvoke(query)

Other workarounds for problematic collections:

  • Configure Full Context Mode for Large Collections
  • Disable Hybrid Search Temporarily
  • Optimize Reranking Configuration
<!-- gh-comment-id:3705858106 --> @rgaricano commented on GitHub (Jan 2, 2026): @galvanoid In 0.6.26 the most significant change was the refactoring of the hybrid search system to use async/await patterns. - The `query_collection_with_hybrid_search` function was converted to use `async/await` and `asyncio.gather` for parallel processing of queries across collections. - The system now fetches collection data sequentially before processing queries, which could introduce delays with large collections. - The reranking system was updated to support user context and external rerankers. The main bottlenecks appear to be: - **Sequential Collection Fetching**: The code fetches collection data one by one before any parallel processing begins. With large collections (10k-160k files), this could cause significant delays. - **Async Overhead**: The conversion to async may have introduced overhead, especially for the reranking compressor's acompress_documents method. - **BM25 Processing**: Even when BM25 weight is set to 0, the system still initializes BM25 retrievers and processes texts. FIXES (For reference): **1. Optimize Collection Fetching with Parallel Loading** (the main bottleneck is sequential collection fetching in query_collection_with_hybrid_search) Replace https://github.com/open-webui/open-webui/blob/6f1486ffd0cb288d0e21f41845361924e0d742b3/backend/open_webui/retrieval/utils.py#L472-L482 with ``` async def fetch_collection_data(collection_names: list[str]) -> dict: async def fetch_single(collection_name: str): try: return collection_name, VECTOR_DB_CLIENT.get( collection_name=collection_name ) except Exception as e: log.exception(f"Failed to fetch collection {collection_name}: {e}") return collection_name, None tasks = [fetch_single(name) for name in collection_names] results = await asyncio.gather(*tasks) return dict(results) ``` **2. Add Early BM25 Bypass** (even with BM25 weight set to 0, the system still initializes BM25 retrievers) In https://github.com/open-webui/open-webui/blob/6f1486ffd0cb288d0e21f41845361924e0d742b3/backend/open_webui/retrieval/utils.py#L239 add ``` if hybrid_bm25_weight <= 0 and not enable_enriched_texts: # Skip BM25 entirely when weight is 0 and enrichment is disabled vector_search_retriever = VectorSearchRetriever( collection_name=collection_name, embedding_function=embedding_function, top_k=k, ) compressor = RerankCompressor( embedding_function=embedding_function, top_n=k_reranker, reranking_function=reranking_function, r_score=r, ) compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=vector_search_retriever ) result = await compression_retriever.ainvoke(query) ``` Other workarounds for problematic collections: - Configure Full Context Mode for Large Collections - Disable Hybrid Search Temporarily - Optimize Reranking Configuration
Author
Owner

@galvanoid commented on GitHub (Jan 2, 2026):

@rgaricano

Thanks a lot for the detailed analysis and for taking the time to explain what changed internally after 0.6.25.
This matches very closely what I’m observing in practice, especially the long idle gap before the first reranker call and the fact that performance degrades even when BM25 is explicitly set to 0. The explanation about sequential collection fetching and BM25 still being initialized despite having zero weight makes a lot of sense in light of the timings I’m seeing.
In particular, the “early BM25 bypass” you describe aligns exactly with the behavior I was implicitly relying on in 0.6.25. When BM25 is set to 0 and text enrichment is disabled, having it act as a true disable would restore the expected vector + reranking workflow and avoid the overhead entirely.
Hopefully this can make its way into future versions, as it would allow benefiting from the newer features while keeping the performance characteristics that made 0.6.25 usable in real-world RAG workflows.
Thanks again for the insights and the concrete suggestions.

<!-- gh-comment-id:3706073839 --> @galvanoid commented on GitHub (Jan 2, 2026): @rgaricano Thanks a lot for the detailed analysis and for taking the time to explain what changed internally after 0.6.25. This matches very closely what I’m observing in practice, especially the long idle gap before the first reranker call and the fact that performance degrades even when BM25 is explicitly set to 0. The explanation about sequential collection fetching and BM25 still being initialized despite having zero weight makes a lot of sense in light of the timings I’m seeing. In particular, the “early BM25 bypass” you describe aligns exactly with the behavior I was implicitly relying on in 0.6.25. When BM25 is set to 0 and text enrichment is disabled, having it act as a true disable would restore the expected vector + reranking workflow and avoid the overhead entirely. Hopefully this can make its way into future versions, as it would allow benefiting from the newer features while keeping the performance characteristics that made 0.6.25 usable in real-world RAG workflows. Thanks again for the insights and the concrete suggestions.
Author
Owner

@silentoplayz commented on GitHub (Jan 3, 2026):

Hi @galvanoid! 👋

I've created a PR to address the performance regression you reported: #20342

What's Fixed

The PR implements two optimizations to restore v0.6.25 performance levels:

  1. Parallel collection fetching - Uses asyncio.gather to fetch multiple collections concurrently instead of sequentially, eliminating the N-1 sequential wait bottleneck
  2. Early BM25 bypass - Skips BM25 initialization entirely when weight is 0 and enrichment is disabled, avoiding unnecessary processing

Testing Request

Since you have the perfect test environment with your 10k and 160k file collections, could you help verify this fix?

Expected improvements:

  • ~10k files: Response time should return to ~10-30 seconds (from ~3 minutes)
  • ~160k files: Should eliminate the 15-20 minute timeouts

The changes are conservative and backward-compatible:

  • Only activates when BM25 weight = 0 AND enrichment disabled
  • Falls back to existing behavior otherwise

Please let me know if this resolves the latency issues you experienced! 🚀

<!-- gh-comment-id:3706946247 --> @silentoplayz commented on GitHub (Jan 3, 2026): Hi @galvanoid! 👋 I've created a PR to address the performance regression you reported: **#20342** ## What's Fixed The PR implements two optimizations to restore v0.6.25 performance levels: 1. **Parallel collection fetching** - Uses `asyncio.gather` to fetch multiple collections concurrently instead of sequentially, eliminating the N-1 sequential wait bottleneck 2. **Early BM25 bypass** - Skips BM25 initialization entirely when weight is 0 and enrichment is disabled, avoiding unnecessary processing ## Testing Request Since you have the perfect test environment with your 10k and 160k file collections, could you help verify this fix? **Expected improvements:** - ~10k files: Response time should return to ~10-30 seconds (from ~3 minutes) - ~160k files: Should eliminate the 15-20 minute timeouts The changes are conservative and backward-compatible: - Only activates when BM25 weight = 0 AND enrichment disabled - Falls back to existing behavior otherwise Please let me know if this resolves the latency issues you experienced! 🚀
Author
Owner

@galvanoid commented on GitHub (Jan 4, 2026):

Hi @galvanoid! 👋

I've created a PR to address the performance regression you reported: #20342

What's Fixed

The PR implements two optimizations to restore v0.6.25 performance levels:

  1. Parallel collection fetching - Uses asyncio.gather to fetch multiple collections concurrently instead of sequentially, eliminating the N-1 sequential wait bottleneck
  2. Early BM25 bypass - Skips BM25 initialization entirely when weight is 0 and enrichment is disabled, avoiding unnecessary processing

Testing Request

Since you have the perfect test environment with your 10k and 160k file collections, could you help verify this fix?

Expected improvements:

  • ~10k files: Response time should return to ~10-30 seconds (from ~3 minutes)
  • ~160k files: Should eliminate the 15-20 minute timeouts

The changes are conservative and backward-compatible:

  • Only activates when BM25 weight = 0 AND enrichment disabled
  • Falls back to existing behavior otherwise

Please let me know if this resolves the latency issues you experienced! 🚀

Thanks a lot for the PR!

I ran benchmarks comparing v0.6.25 against the current PR, focusing only on the reranking phase, using an external reranker instrumented specifically for timing analysis.

Methodology

Instead of relying on internal timing or UI-level measurements, I used Open WebUI’s built-in support for external rerankers to attach a custom reranker service implemented with FastAPI.
This external reranker acts as a capture layer, allowing precise measurement of:

Time from CID assignment to first batch arrival

Time between individual batches

Total time from CID assignment to the final batch delivery

This approach ensures:

No changes to OWUI core logic

Identical batch sizes and batch counts

Accurate, server-side timestamps for all reranking events

The same setup was used for both versions (v0.6.25 and PR).

Two collection sizes were tested:

~10k documents (7 batches)

~160k documents (17 batches)

Each scenario was executed twice to account for run-to-run variability.

Results
~10k document collection

v0.6.25

Time to first batch: ~3.0–3.4 s

Total rerank time: ~6.6–7.4 s

Mean batch interval: ~0.6–0.65 s

PR

Time to first batch: ~5.7–6.1 s

Total rerank time: ~28.2–28.8 s

Mean batch interval: ~3.7–3.9 s

--> For this collection size, v0.6.25 is consistently ~4× faster than the PR.

~160k document collection

v0.6.25

Time to first batch: ~2.9–4.3 s

Total rerank time: ~12.7–28.5 s

Mean batch interval: ~0.6–1.4 s

PR

Time to first batch: ~28–31 s

Total rerank time: ~227–231 s

Mean batch interval: ~12.4 s (median ~6 s)

--> For large collections, v0.6.25 is between ~8× and ~18× faster, with an absolute delta of ~200 seconds.

Key observations

Batch counts are identical across versions, ruling out:

Collection size effects

Reranker model performance differences

The regression is dominated by:

Much later first batch delivery

Significantly larger inter-batch gaps

This points to overhead in batch orchestration / hybrid search iteration / scheduling, not in the reranking model itself.

Conclusion

Although the PR improves over some intermediate versions, it remains substantially slower than v0.6.25, especially for large collections where reranking latency reaches multiple minutes.
From a user perspective, v0.6.25 currently offers significantly better latency and responsiveness for reranking-heavy workloads.

PD: The reason why the 10k collection produces 7 batches and the 160k collection produces 17 batches is that only a single reranker pass was active instead of three passes.

In the PR version, even when three reranker passes were enabled in the UI (Retrieval / Generation Query settings), the system was effectively executing only one pass. For consistency and fairness, I therefore disabled multi-pass reranking in both versions, ensuring that PR and v0.6.25 generated exactly the same number of batches.

It is also worth noting that v0.6.25 remains faster even when three reranker passes are enabled, which further reinforces the performance gap observed in these benchmarks.


Encontrados 8 ficheros rerank

  • rerank-0625-10k-2.txt
  • rerank-0625-160k-2.txt
  • rerank-pr-10k-2.txt
  • rerank-0625-160k-1.txt
  • rerank-0625-10k-1.txt
  • rerank-pr-160k-1.txt
  • rerank-pr-10k-1.txt
  • rerank-pr-160k-2.txt

=== RERANK – PER RUN ===
version collection run cid batches t_first_batch_s t_total_s batch_mean_dt_s batch_median_dt_s
0625 10k 1 2bdc9326e94c4eccae28d096a79270ec 7 2.961792 6.644569 0.613211 0.609952
pr 10k 1 0204ee6615564f168264594946c4a08d 7 6.082243 28.839900 3.792290 3.889942
0625 10k 2 029c033fb4514d0e89b45f1f824e7866 7 3.419669 7.363968 0.656794 0.645497
pr 10k 2 e032ce542db94fc8afc06983162cd0f5 7 5.729482 28.260245 3.754476 4.236242
0625 160k 1 ae59f006262740868c5a7358b1dee3dd 17 4.302075 28.451026 1.474028 1.351415
pr 160k 1 e747982c45c34ef88b4b5a79f1a077b9 17 28.176346 227.034976 12.393156 6.091793
0625 160k 2 b873c3a76ced4c168ac221d021b8f108 17 2.913588 12.716752 0.612483 0.606880
pr 160k 2 711af870efa240aa98f6ba29433baee0 17 31.317244 230.785459 12.466538 6.468858

=== RERANK – PR vs 0.6.25 ===
collection run batch_mean_dt_s__0625 batch_mean_dt_s__pr batches__0625 batches__pr t_first_batch_s__0625 t_first_batch_s__pr t_total_s__0625 t_total_s__pr delta_total_s_pr_minus_0625 speedup_0625_over_pr
10k 1 0.613211 3.792290 7 7 2.961792 6.082243 6.644569 28.839900 22.195331 4.340372
10k 2 0.656794 3.754476 7 7 3.419669 5.729482 7.363968 28.260245 20.896277 3.837638
160k 1 1.474028 12.393156 17 17 4.302075 28.176346 28.451026 227.034976 198.583950 7.979852
160k 2 0.612483 12.466538 17 17 2.913588 31.317244 12.716752 230.785459 218.068707 18.148145

Saved:

  • rerank_per_run.csv
  • rerank_pr_vs_0625.csv
Image Image
<!-- gh-comment-id:3708206601 --> @galvanoid commented on GitHub (Jan 4, 2026): > Hi [@galvanoid](https://github.com/galvanoid)! 👋 > > I've created a PR to address the performance regression you reported: **[#20342](https://github.com/open-webui/open-webui/pull/20342)** > > ## What's Fixed > The PR implements two optimizations to restore v0.6.25 performance levels: > > 1. **Parallel collection fetching** - Uses `asyncio.gather` to fetch multiple collections concurrently instead of sequentially, eliminating the N-1 sequential wait bottleneck > 2. **Early BM25 bypass** - Skips BM25 initialization entirely when weight is 0 and enrichment is disabled, avoiding unnecessary processing > > ## Testing Request > Since you have the perfect test environment with your 10k and 160k file collections, could you help verify this fix? > > **Expected improvements:** > > * ~10k files: Response time should return to ~10-30 seconds (from ~3 minutes) > * ~160k files: Should eliminate the 15-20 minute timeouts > > The changes are conservative and backward-compatible: > > * Only activates when BM25 weight = 0 AND enrichment disabled > * Falls back to existing behavior otherwise > > Please let me know if this resolves the latency issues you experienced! 🚀 Thanks a lot for the PR! I ran benchmarks comparing v0.6.25 against the current PR, focusing only on the reranking phase, using an external reranker instrumented specifically for timing analysis. Methodology Instead of relying on internal timing or UI-level measurements, I used Open WebUI’s built-in support for external rerankers to attach a custom reranker service implemented with FastAPI. This external reranker acts as a capture layer, allowing precise measurement of: Time from CID assignment to first batch arrival Time between individual batches Total time from CID assignment to the final batch delivery This approach ensures: No changes to OWUI core logic Identical batch sizes and batch counts Accurate, server-side timestamps for all reranking events The same setup was used for both versions (v0.6.25 and PR). Two collection sizes were tested: ~10k documents (7 batches) ~160k documents (17 batches) Each scenario was executed twice to account for run-to-run variability. Results ~10k document collection v0.6.25 Time to first batch: ~3.0–3.4 s Total rerank time: ~6.6–7.4 s Mean batch interval: ~0.6–0.65 s PR Time to first batch: ~5.7–6.1 s Total rerank time: ~28.2–28.8 s Mean batch interval: ~3.7–3.9 s --> For this collection size, v0.6.25 is consistently ~4× faster than the PR. ~160k document collection v0.6.25 Time to first batch: ~2.9–4.3 s Total rerank time: ~12.7–28.5 s Mean batch interval: ~0.6–1.4 s PR Time to first batch: ~28–31 s Total rerank time: ~227–231 s Mean batch interval: ~12.4 s (median ~6 s) --> For large collections, v0.6.25 is between ~8× and ~18× faster, with an absolute delta of ~200 seconds. Key observations Batch counts are identical across versions, ruling out: Collection size effects Reranker model performance differences The regression is dominated by: Much later first batch delivery Significantly larger inter-batch gaps This points to overhead in batch orchestration / hybrid search iteration / scheduling, not in the reranking model itself. Conclusion Although the PR improves over some intermediate versions, it remains substantially slower than v0.6.25, especially for large collections where reranking latency reaches multiple minutes. From a user perspective, v0.6.25 currently offers significantly better latency and responsiveness for reranking-heavy workloads. PD: The reason why the 10k collection produces 7 batches and the 160k collection produces 17 batches is that only a single reranker pass was active instead of three passes. In the PR version, even when three reranker passes were enabled in the UI (Retrieval / Generation Query settings), the system was effectively executing only one pass. For consistency and fairness, I therefore disabled multi-pass reranking in both versions, ensuring that PR and v0.6.25 generated exactly the same number of batches. It is also worth noting that v0.6.25 remains faster even when three reranker passes are enabled, which further reinforces the performance gap observed in these benchmarks. -------------------------------------------- Encontrados 8 ficheros rerank - rerank-0625-10k-2.txt - rerank-0625-160k-2.txt - rerank-pr-10k-2.txt - rerank-0625-160k-1.txt - rerank-0625-10k-1.txt - rerank-pr-160k-1.txt - rerank-pr-10k-1.txt - rerank-pr-160k-2.txt === RERANK – PER RUN === version collection run cid batches t_first_batch_s t_total_s batch_mean_dt_s batch_median_dt_s 0625 10k 1 2bdc9326e94c4eccae28d096a79270ec 7 2.961792 6.644569 0.613211 0.609952 pr 10k 1 0204ee6615564f168264594946c4a08d 7 6.082243 28.839900 3.792290 3.889942 0625 10k 2 029c033fb4514d0e89b45f1f824e7866 7 3.419669 7.363968 0.656794 0.645497 pr 10k 2 e032ce542db94fc8afc06983162cd0f5 7 5.729482 28.260245 3.754476 4.236242 0625 160k 1 ae59f006262740868c5a7358b1dee3dd 17 4.302075 28.451026 1.474028 1.351415 pr 160k 1 e747982c45c34ef88b4b5a79f1a077b9 17 28.176346 227.034976 12.393156 6.091793 0625 160k 2 b873c3a76ced4c168ac221d021b8f108 17 2.913588 12.716752 0.612483 0.606880 pr 160k 2 711af870efa240aa98f6ba29433baee0 17 31.317244 230.785459 12.466538 6.468858 === RERANK – PR vs 0.6.25 === collection run batch_mean_dt_s__0625 batch_mean_dt_s__pr batches__0625 batches__pr t_first_batch_s__0625 t_first_batch_s__pr t_total_s__0625 t_total_s__pr delta_total_s_pr_minus_0625 speedup_0625_over_pr 10k 1 0.613211 3.792290 7 7 2.961792 6.082243 6.644569 28.839900 22.195331 4.340372 10k 2 0.656794 3.754476 7 7 3.419669 5.729482 7.363968 28.260245 20.896277 3.837638 160k 1 1.474028 12.393156 17 17 4.302075 28.176346 28.451026 227.034976 198.583950 7.979852 160k 2 0.612483 12.466538 17 17 2.913588 31.317244 12.716752 230.785459 218.068707 18.148145 Saved: - rerank_per_run.csv - rerank_pr_vs_0625.csv <img width="1342" height="131" alt="Image" src="https://github.com/user-attachments/assets/66e50948-e4f9-4461-8e71-6b22e288d2b7" /> <img width="1337" height="88" alt="Image" src="https://github.com/user-attachments/assets/97dc30db-9c06-459b-9f61-c11a7113f7b4" />
Author
Owner

@silentoplayz commented on GitHub (Jan 4, 2026):

Hi again @galvanoid! Thanks for the detailed bookmarks. That was incredibly helpful in hopefully pinpointing the bottleneck.

I've pushed a significant update to the PR branch that seeks to directly address your findings:

  1. Optimization: Conditional Fetching: The system now completely skips fetching collection data if BM25 and enrichment are disabled. This should hopefully eliminate the high startup delay you observed for the 160k collection in vector-only mode.
  2. Optimization: Sync Fetch for Single Collection: For cases where data is needed (BM25 enabled) but there's only one collection, we now fetch synchronously to avoid the asyncio overhead you identified.
  3. Fix: Corrected a logic issue where skipping the fetch could inadvertently skip the query.

These changes are live on the PR branch. Could you give it another spin with your benchmark setup?

<!-- gh-comment-id:3708296517 --> @silentoplayz commented on GitHub (Jan 4, 2026): Hi again @galvanoid! Thanks for the detailed bookmarks. That was incredibly helpful in *hopefully* pinpointing the bottleneck. I've pushed a significant update to the PR branch that seeks to directly address your findings: 1. **Optimization: Conditional Fetching**: The system now completely skips fetching collection data if BM25 and enrichment are disabled. This should hopefully eliminate the high startup delay you observed for the 160k collection in vector-only mode. 2. **Optimization: Sync Fetch for Single Collection**: For cases where data *is* needed (BM25 enabled) but there's only one collection, we now fetch synchronously to avoid the `asyncio` overhead you identified. 3. **Fix**: Corrected a logic issue where skipping the fetch could inadvertently skip the query. These changes are live on the PR branch. Could you give it another spin with your benchmark setup?
Author
Owner

@galvanoid commented on GitHub (Jan 4, 2026):

Hi again @galvanoid! Thanks for the detailed bookmarks. That was incredibly helpful in hopefully pinpointing the bottleneck.

I've pushed a significant update to the PR branch that seeks to directly address your findings:

  1. Optimization: Conditional Fetching: The system now completely skips fetching collection data if BM25 and enrichment are disabled. This should hopefully eliminate the high startup delay you observed for the 160k collection in vector-only mode.
  2. Optimization: Sync Fetch for Single Collection: For cases where data is needed (BM25 enabled) but there's only one collection, we now fetch synchronously to avoid the asyncio overhead you identified.
  3. Fix: Corrected a logic issue where skipping the fetch could inadvertently skip the query.

These changes are live on the PR branch. Could you give it another spin with your benchmark setup?

I can reproduce a regression on the updated PR branch.
In vector-only mode (BM25 OFF + enrichment OFF), retrieval starts but immediately logs query_doc_with_hybrid_search:no_docs multiple times.
Importantly, there are no Qdrant calls at all in OWUI logs (points/query / points/scroll never appear), so the pipeline is ending before it reaches vector DB.
My Qdrant config is correct (QDRANT_URI=http://127.0.0.1:6333, multitenancy enabled) and Qdrant is healthy/accessible from inside the OWUI container.
This suggests the new “conditional fetch skip” path may be skipping the step that populates doc/chunk mappings, causing query_doc_with_hybrid_search to run with empty docs and return no_docs without querying Qdrant.

<!-- gh-comment-id:3708531291 --> @galvanoid commented on GitHub (Jan 4, 2026): > Hi again [@galvanoid](https://github.com/galvanoid)! Thanks for the detailed bookmarks. That was incredibly helpful in _hopefully_ pinpointing the bottleneck. > > I've pushed a significant update to the PR branch that seeks to directly address your findings: > > 1. **Optimization: Conditional Fetching**: The system now completely skips fetching collection data if BM25 and enrichment are disabled. This should hopefully eliminate the high startup delay you observed for the 160k collection in vector-only mode. > 2. **Optimization: Sync Fetch for Single Collection**: For cases where data _is_ needed (BM25 enabled) but there's only one collection, we now fetch synchronously to avoid the `asyncio` overhead you identified. > 3. **Fix**: Corrected a logic issue where skipping the fetch could inadvertently skip the query. > > These changes are live on the PR branch. Could you give it another spin with your benchmark setup? I can reproduce a regression on the updated PR branch. In vector-only mode (BM25 OFF + enrichment OFF), retrieval starts but immediately logs query_doc_with_hybrid_search:no_docs <uuid> multiple times. Importantly, there are no Qdrant calls at all in OWUI logs (points/query / points/scroll never appear), so the pipeline is ending before it reaches vector DB. My Qdrant config is correct (QDRANT_URI=http://127.0.0.1:6333, multitenancy enabled) and Qdrant is healthy/accessible from inside the OWUI container. This suggests the new “conditional fetch skip” path may be skipping the step that populates doc/chunk mappings, causing query_doc_with_hybrid_search to run with empty docs and return no_docs without querying Qdrant.
Author
Owner

@silentoplayz commented on GitHub (Jan 5, 2026):

Thanks for catching that @galvanoid! I pushed a fix (commit 895276a08).

What happened: The "conditional skip" optimization was too aggressive. It skipped fetching even though query_doc_with_hybrid_search needs the collection data to validate document existence before proceeding.

The fix: We now always fetch the collection data (as required), but I kept the sync fetch optimization for single collections.

Could you try again? Vector-only mode should work correctly now. 🙏

<!-- gh-comment-id:3708630903 --> @silentoplayz commented on GitHub (Jan 5, 2026): Thanks for catching that @galvanoid! I pushed a fix (commit 895276a08). **What happened:** The "conditional skip" optimization was too aggressive. It skipped fetching even though `query_doc_with_hybrid_search` needs the collection data to validate document existence before proceeding. **The fix:** We now always fetch the collection data (as required), but I kept the sync fetch optimization for single collections. Could you try again? Vector-only mode should work correctly now. 🙏
Author
Owner

@galvanoid commented on GitHub (Jan 5, 2026):

Hi! Quick update after pulling/rebuilding again.

Now the behavior is even more “silent” in my setup:

With Hybrid Search ON, sending a query produces no retrieval activity in the backend logs at all (no “Starting hybrid search”, no no_docs, no Qdrant calls, no reranker calls).

With Hybrid Search OFF, the same query immediately produces the expected retrieval/Qdrant activity in logs.

So it looks like, in the current PR state, the “hybrid” code path is not being entered (or it’s short-circuiting before any of the retrieval logging happens).

<!-- gh-comment-id:3710136977 --> @galvanoid commented on GitHub (Jan 5, 2026): Hi! Quick update after pulling/rebuilding again. Now the behavior is even more “silent” in my setup: With Hybrid Search ON, sending a query produces no retrieval activity in the backend logs at all (no “Starting hybrid search”, no no_docs, no Qdrant calls, no reranker calls). With Hybrid Search OFF, the same query immediately produces the expected retrieval/Qdrant activity in logs. So it looks like, in the current PR state, the “hybrid” code path is not being entered (or it’s short-circuiting before any of the retrieval logging happens).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#57818