mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 19:08:59 -05:00
[GH-ISSUE #20327] issue: Unable to use any Open WebUI version newer than 0.6.25 due to hybrid search performance #34681
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @galvanoid on GitHub (Jan 2, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/20327
Check Existing Issues
Installation Method
Docker
Open WebUI Version
Latest
Ollama Version (if applicable)
No response
Operating System
Ubuntu 24.04
Browser (if applicable)
No response
Confirmation
README.md.Expected Behavior
Upgrading Open WebUI to a newer version should not significantly change the latency characteristics of an existing retrieval and reranking workflow when configuration, data, and infrastructure remain the same.
In particular, when BM25 is disabled and only lexical or hybrid retrieval is used, the time between submitting a user query and the first reranker invocation should be comparable to what was observed in version 0.6.25.
Reranking passes should execute with similar performance to previous versions, and increases in total response time should be incremental and proportional to collection size, rather than introducing long idle delays or order-of-magnitude slowdowns.
Under these conditions, newer versions are expected to remain usable for real-world RAG workloads that were already supported in 0.6.25.
Actual Behavior
In versions newer than 0.6.25, the same retrieval and reranking workflow exhibits a significant increase in latency, even when BM25 is explicitly disabled and all other parameters remain unchanged.
After submitting a user query, there is a long delay before any retrieval or reranking activity begins. This delay is visible in the Open WebUI logs as a gap between the initial request and the first reranker invocation.
Once reranking starts, each reranker pass takes noticeably longer than in version 0.6.25. The combined effect is a substantial increase in total response time.
With collections of around 10k files, response generation may take several minutes to begin. With larger collections, response times can extend to tens of minutes, and in some cases the application becomes unresponsive before completing the request.
As a result, workflows that are usable and predictable in version 0.6.25 become impractical in later versions under otherwise identical conditions.
Examples:
In version 0.6.25, hybrid search can be enabled with BM25 disabled, and reranking can be configured with either a single pass or multiple passes through the retrieval generation interface. Under these conditions, the system behaves as expected and remains usable.
As a concrete example, using a model associated with a collection of approximately 10,000 files, total response time is under 30 seconds when three reranking passes are enabled, and around 10 seconds when using a single reranking pass.
In versions released after 0.6.25, the same setup produces very different results. Even with the BM25 slider explicitly set to 0, overall latency increases significantly. When observing the Open WebUI logs, there is a long delay before the first reranker call occurs, followed by reranker invocations that are noticeably slower than in version 0.6.25.
Using the same query, the same collection, and the same model (qwen3-30b-3b), any version newer than 0.6.25 takes approximately three minutes before it even begins generating a response.
It is worth noting that these measurements are based on relatively small collections. In my environment, other models are associated with collections totaling around 160,000 files. With collections of that size, it becomes practically impossible to use versions newer than 0.6.25, as response times can reach 15 to 20 minutes, and in some cases Open WebUI becomes unresponsive before completing the request.
Steps to Reproduce
Deploy Open WebUI version 0.6.25 and configure retrieval with a vector database containing a collection of approximately 10,000 files.
Associate the collection with a model such as qwen3-30b-3b and enable hybrid search, explicitly setting the BM25 slider to 0.
Enable reranking and configure either one reranking pass or multiple reranking passes through the retrieval generation interface.
Submit a query that triggers retrieval and reranking, and observe the time between submitting the query and the start of response generation, as well as the timing of reranker invocations in the logs.
Repeat the same steps using any Open WebUI version newer than 0.6.25, keeping the same model, collection, reranker configuration, hardware, and infrastructure.
Compare the delay before the first reranker call, the duration of individual reranker passes, and the total time until response generation begins.
Logs & Screenshots
Screenshot from Open WebUI v0.6.25 showing the hybrid search configuration used in the tests, with BM25 disabled and reranking enabled. Under this configuration, retrieval and reranking execute with low and predictable latency.
Additional Information
Newer versions introduce enriched BM25-based retrieval, which noticeably improves relevance and source filtering, but cannot currently be used in this workflow due to the performance problem described above.
@owui-terminator[bot] commented on GitHub (Jan 2, 2026):
🔍 Similar Issues Found
I found some existing issues that might be related to this one. Please check if any of these are duplicates or contain helpful solutions:
#20019 issue:
by j63440490 • Dec 17, 2025 •
bug#19777 issue:
by Yaute7 • Dec 05, 2025 •
bug#20092 issue:
by VideoRyan • Dec 22, 2025 •
bug#19864 issue:
by Haervwe • Dec 10, 2025 •
bug#14529 issue: Open WebUI does not work on versions after version 0.6.7
by OpenSoftware-World • May 30, 2025 •
bugShow 5 more related issues
#18145 issue: 0.6.33 regression
by Ark-Levy • Oct 08, 2025 •
bug#19563 issue:
by naruto7g • Nov 28, 2025 •
bug#16540 issue:
by Sawrz • Aug 12, 2025 •
bug#16959 issue:
by Te-eMster • Aug 27, 2025 •
bug#19417 issue: v0.6.37 SQL Error
by AKHYP • Nov 24, 2025 •
bug💡 Tips:
This comment was generated automatically by a bot. Please react with a 👍 if this comment was helpful, or a 👎 if it was not.
@rgaricano commented on GitHub (Jan 2, 2026):
@galvanoid
In 0.6.26 the most significant change was the refactoring of the hybrid search system to use async/await patterns.
query_collection_with_hybrid_searchfunction was converted to useasync/awaitandasyncio.gatherfor parallel processing of queries across collections.The main bottlenecks appear to be:
FIXES (For reference):
1. Optimize Collection Fetching with Parallel Loading
(the main bottleneck is sequential collection fetching in query_collection_with_hybrid_search)
Replace
6f1486ffd0/backend/open_webui/retrieval/utils.py (L472-L482)with
2. Add Early BM25 Bypass
(even with BM25 weight set to 0, the system still initializes BM25 retrievers)
In
6f1486ffd0/backend/open_webui/retrieval/utils.py (L239)add
Other workarounds for problematic collections:
@galvanoid commented on GitHub (Jan 2, 2026):
@rgaricano
Thanks a lot for the detailed analysis and for taking the time to explain what changed internally after 0.6.25.
This matches very closely what I’m observing in practice, especially the long idle gap before the first reranker call and the fact that performance degrades even when BM25 is explicitly set to 0. The explanation about sequential collection fetching and BM25 still being initialized despite having zero weight makes a lot of sense in light of the timings I’m seeing.
In particular, the “early BM25 bypass” you describe aligns exactly with the behavior I was implicitly relying on in 0.6.25. When BM25 is set to 0 and text enrichment is disabled, having it act as a true disable would restore the expected vector + reranking workflow and avoid the overhead entirely.
Hopefully this can make its way into future versions, as it would allow benefiting from the newer features while keeping the performance characteristics that made 0.6.25 usable in real-world RAG workflows.
Thanks again for the insights and the concrete suggestions.
@silentoplayz commented on GitHub (Jan 3, 2026):
Hi @galvanoid! 👋
I've created a PR to address the performance regression you reported: #20342
What's Fixed
The PR implements two optimizations to restore v0.6.25 performance levels:
asyncio.gatherto fetch multiple collections concurrently instead of sequentially, eliminating the N-1 sequential wait bottleneckTesting Request
Since you have the perfect test environment with your 10k and 160k file collections, could you help verify this fix?
Expected improvements:
The changes are conservative and backward-compatible:
Please let me know if this resolves the latency issues you experienced! 🚀
@galvanoid commented on GitHub (Jan 4, 2026):
Thanks a lot for the PR!
I ran benchmarks comparing v0.6.25 against the current PR, focusing only on the reranking phase, using an external reranker instrumented specifically for timing analysis.
Methodology
Instead of relying on internal timing or UI-level measurements, I used Open WebUI’s built-in support for external rerankers to attach a custom reranker service implemented with FastAPI.
This external reranker acts as a capture layer, allowing precise measurement of:
Time from CID assignment to first batch arrival
Time between individual batches
Total time from CID assignment to the final batch delivery
This approach ensures:
No changes to OWUI core logic
Identical batch sizes and batch counts
Accurate, server-side timestamps for all reranking events
The same setup was used for both versions (v0.6.25 and PR).
Two collection sizes were tested:
~10k documents (7 batches)
~160k documents (17 batches)
Each scenario was executed twice to account for run-to-run variability.
Results
~10k document collection
v0.6.25
Time to first batch: ~3.0–3.4 s
Total rerank time: ~6.6–7.4 s
Mean batch interval: ~0.6–0.65 s
PR
Time to first batch: ~5.7–6.1 s
Total rerank time: ~28.2–28.8 s
Mean batch interval: ~3.7–3.9 s
--> For this collection size, v0.6.25 is consistently ~4× faster than the PR.
~160k document collection
v0.6.25
Time to first batch: ~2.9–4.3 s
Total rerank time: ~12.7–28.5 s
Mean batch interval: ~0.6–1.4 s
PR
Time to first batch: ~28–31 s
Total rerank time: ~227–231 s
Mean batch interval: ~12.4 s (median ~6 s)
--> For large collections, v0.6.25 is between ~8× and ~18× faster, with an absolute delta of ~200 seconds.
Key observations
Batch counts are identical across versions, ruling out:
Collection size effects
Reranker model performance differences
The regression is dominated by:
Much later first batch delivery
Significantly larger inter-batch gaps
This points to overhead in batch orchestration / hybrid search iteration / scheduling, not in the reranking model itself.
Conclusion
Although the PR improves over some intermediate versions, it remains substantially slower than v0.6.25, especially for large collections where reranking latency reaches multiple minutes.
From a user perspective, v0.6.25 currently offers significantly better latency and responsiveness for reranking-heavy workloads.
PD: The reason why the 10k collection produces 7 batches and the 160k collection produces 17 batches is that only a single reranker pass was active instead of three passes.
In the PR version, even when three reranker passes were enabled in the UI (Retrieval / Generation Query settings), the system was effectively executing only one pass. For consistency and fairness, I therefore disabled multi-pass reranking in both versions, ensuring that PR and v0.6.25 generated exactly the same number of batches.
It is also worth noting that v0.6.25 remains faster even when three reranker passes are enabled, which further reinforces the performance gap observed in these benchmarks.
Encontrados 8 ficheros rerank
=== RERANK – PER RUN ===
version collection run cid batches t_first_batch_s t_total_s batch_mean_dt_s batch_median_dt_s
0625 10k 1 2bdc9326e94c4eccae28d096a79270ec 7 2.961792 6.644569 0.613211 0.609952
pr 10k 1 0204ee6615564f168264594946c4a08d 7 6.082243 28.839900 3.792290 3.889942
0625 10k 2 029c033fb4514d0e89b45f1f824e7866 7 3.419669 7.363968 0.656794 0.645497
pr 10k 2 e032ce542db94fc8afc06983162cd0f5 7 5.729482 28.260245 3.754476 4.236242
0625 160k 1 ae59f006262740868c5a7358b1dee3dd 17 4.302075 28.451026 1.474028 1.351415
pr 160k 1 e747982c45c34ef88b4b5a79f1a077b9 17 28.176346 227.034976 12.393156 6.091793
0625 160k 2 b873c3a76ced4c168ac221d021b8f108 17 2.913588 12.716752 0.612483 0.606880
pr 160k 2 711af870efa240aa98f6ba29433baee0 17 31.317244 230.785459 12.466538 6.468858
=== RERANK – PR vs 0.6.25 ===
collection run batch_mean_dt_s__0625 batch_mean_dt_s__pr batches__0625 batches__pr t_first_batch_s__0625 t_first_batch_s__pr t_total_s__0625 t_total_s__pr delta_total_s_pr_minus_0625 speedup_0625_over_pr
10k 1 0.613211 3.792290 7 7 2.961792 6.082243 6.644569 28.839900 22.195331 4.340372
10k 2 0.656794 3.754476 7 7 3.419669 5.729482 7.363968 28.260245 20.896277 3.837638
160k 1 1.474028 12.393156 17 17 4.302075 28.176346 28.451026 227.034976 198.583950 7.979852
160k 2 0.612483 12.466538 17 17 2.913588 31.317244 12.716752 230.785459 218.068707 18.148145
Saved:
@silentoplayz commented on GitHub (Jan 4, 2026):
Hi again @galvanoid! Thanks for the detailed bookmarks. That was incredibly helpful in hopefully pinpointing the bottleneck.
I've pushed a significant update to the PR branch that seeks to directly address your findings:
asynciooverhead you identified.These changes are live on the PR branch. Could you give it another spin with your benchmark setup?
@galvanoid commented on GitHub (Jan 4, 2026):
I can reproduce a regression on the updated PR branch.
In vector-only mode (BM25 OFF + enrichment OFF), retrieval starts but immediately logs query_doc_with_hybrid_search:no_docs multiple times.
Importantly, there are no Qdrant calls at all in OWUI logs (points/query / points/scroll never appear), so the pipeline is ending before it reaches vector DB.
My Qdrant config is correct (QDRANT_URI=http://127.0.0.1:6333, multitenancy enabled) and Qdrant is healthy/accessible from inside the OWUI container.
This suggests the new “conditional fetch skip” path may be skipping the step that populates doc/chunk mappings, causing query_doc_with_hybrid_search to run with empty docs and return no_docs without querying Qdrant.
@silentoplayz commented on GitHub (Jan 5, 2026):
Thanks for catching that @galvanoid! I pushed a fix (commit
895276a08).What happened: The "conditional skip" optimization was too aggressive. It skipped fetching even though
query_doc_with_hybrid_searchneeds the collection data to validate document existence before proceeding.The fix: We now always fetch the collection data (as required), but I kept the sync fetch optimization for single collections.
Could you try again? Vector-only mode should work correctly now. 🙏
@galvanoid commented on GitHub (Jan 5, 2026):
Hi! Quick update after pulling/rebuilding again.
Now the behavior is even more “silent” in my setup:
With Hybrid Search ON, sending a query produces no retrieval activity in the backend logs at all (no “Starting hybrid search”, no no_docs, no Qdrant calls, no reranker calls).
With Hybrid Search OFF, the same query immediately produces the expected retrieval/Qdrant activity in logs.
So it looks like, in the current PR state, the “hybrid” code path is not being entered (or it’s short-circuiting before any of the retrieval logging happens).