[PR #21488] enh: add backward merge for undersized chunks in merge_docs_to_target_size #49149

Open
opened 2026-04-30 01:27:56 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/21488
Author: @Classic298
Created: 2/16/2026
Status: 🔄 Open

Base: devHead: claude/improve-issue-efficiency-hB7AF


📝 Commits (6)

  • d05301d fix: add backward merge for undersized chunks in merge_docs_to_target_size
  • 03dc985 test: add test document generator for Bug 3 backward merge validation
  • 28ffa67 Delete test_merge_bug3.md
  • f8364b8 Delete generate_test_doc.py
  • 777fa6c chore: convert merge_docs_to_target_size to async
  • fc722a6 fix: revert merge_docs_to_target_size to sync

📊 Changes

1 file changed (+48 additions, -10 deletions)

View changed files

📝 backend/open_webui/routers/retrieval.py (+48 -10)

📄 Description

When a tiny chunk (e.g. an isolated heading line) sits between two large chunks, the forward-only merge strategy cannot absorb it: the preceding chunk is already above min_chunk_size_target, and the following chunk is too large for the combined size to fit within max_chunk_size. This leaves the tiny fragment as a standalone chunk.

Add a backward merge pass: before emitting an undersized chunk that failed to merge forward, attempt to append it to the previously emitted chunk (respecting source/file boundaries and max size). This also handles the case where the last chunk in the sequence is undersized.

Addresses #21486

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/21488 **Author:** [@Classic298](https://github.com/Classic298) **Created:** 2/16/2026 **Status:** 🔄 Open **Base:** `dev` ← **Head:** `claude/improve-issue-efficiency-hB7AF` --- ### 📝 Commits (6) - [`d05301d`](https://github.com/open-webui/open-webui/commit/d05301d3fbed6c67c4159f754124a6602fb4b50c) fix: add backward merge for undersized chunks in merge_docs_to_target_size - [`03dc985`](https://github.com/open-webui/open-webui/commit/03dc9857cecc326fd4e5bb639452c6649e7cab01) test: add test document generator for Bug 3 backward merge validation - [`28ffa67`](https://github.com/open-webui/open-webui/commit/28ffa67524faf5061ab3b2315d35d19b9455a353) Delete test_merge_bug3.md - [`f8364b8`](https://github.com/open-webui/open-webui/commit/f8364b840a0558c8ecd7bd62b7ec7b42bce17bf9) Delete generate_test_doc.py - [`777fa6c`](https://github.com/open-webui/open-webui/commit/777fa6ca89a30e36a8d9c14a0b32bd5b7f1d5186) chore: convert merge_docs_to_target_size to async - [`fc722a6`](https://github.com/open-webui/open-webui/commit/fc722a6fa27f39dca18530db4457a5ef1f1426ff) fix: revert merge_docs_to_target_size to sync ### 📊 Changes **1 file changed** (+48 additions, -10 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/routers/retrieval.py` (+48 -10) </details> ### 📄 Description When a tiny chunk (e.g. an isolated heading line) sits between two large chunks, the forward-only merge strategy cannot absorb it: the preceding chunk is already above min_chunk_size_target, and the following chunk is too large for the combined size to fit within max_chunk_size. This leaves the tiny fragment as a standalone chunk. Add a backward merge pass: before emitting an undersized chunk that failed to merge forward, attempt to append it to the previously emitted chunk (respecting source/file boundaries and max size). This also handles the case where the last chunk in the sequence is undersized. Addresses #21486 ### Contributor License Agreement <!-- 🚨 DO NOT DELETE THE TEXT BELOW 🚨 Keep the "Contributor License Agreement" confirmation text intact. Deleting it will trigger the CLA-Bot to INVALIDATE your PR. --> By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. > [!NOTE] > Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-30 01:27:56 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#49149