[PR #21524] [CLOSED] fix: prevent double chunking after markdown header splitting #26121

Closed
opened 2026-04-20 06:20:41 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/21524
Author: @Baireinhold
Created: 2/17/2026
Status: Closed

Base: devHead: fix/markdown-double-chunking


📝 Commits (1)

  • d7dc58d fix: prevent double chunking after markdown header splitting

📊 Changes

1 file changed (+7 additions, -3 deletions)

View changed files

📝 backend/open_webui/routers/retrieval.py (+7 -3)

📄 Description

Pull Request Checklist

  • Target branch: dev
  • Description: Provided below
  • Changelog: Provided below
  • Documentation: No user-facing behavior change, no docs needed
  • Dependencies: None
  • Testing: Manually tested with markdown files — confirmed chunks follow header boundaries instead of being re-split at arbitrary character limits
  • Agentic AI Code: This fix was identified through manual code review and tested by the author
  • Code review: Self-reviewed
  • Design & Architecture: Minimal flag-based guard, no new settings
  • Git Hygiene: Single atomic commit
  • Title Prefix: fix

Changelog Entry

Description

  • Prevent character/token splitter from re-splitting chunks already produced by MarkdownHeaderTextSplitter

Fixed

  • When ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER is enabled, markdown-split chunks fall through unconditionally into the TEXT_SPLITTER branch, which re-splits them via RecursiveCharacterTextSplitter or TokenTextSplitter, destroying the semantic boundaries established by header splitting. Added a markdown_split_done flag to skip the secondary splitter when markdown splitting was already applied.

Additional Information

  • Relates to #21486 (Bug 2)
  • 7 lines added, 3 lines changed in backend/open_webui/routers/retrieval.py
  • No behavioral change when ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER is disabled (flag stays False, splitters run as before)

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/21524 **Author:** [@Baireinhold](https://github.com/Baireinhold) **Created:** 2/17/2026 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `fix/markdown-double-chunking` --- ### 📝 Commits (1) - [`d7dc58d`](https://github.com/open-webui/open-webui/commit/d7dc58d1418bc7c327339e88c315307be97a16d7) fix: prevent double chunking after markdown header splitting ### 📊 Changes **1 file changed** (+7 additions, -3 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/routers/retrieval.py` (+7 -3) </details> ### 📄 Description # Pull Request Checklist - [x] **Target branch:** `dev` - [x] **Description:** Provided below - [x] **Changelog:** Provided below - [x] **Documentation:** No user-facing behavior change, no docs needed - [x] **Dependencies:** None - [x] **Testing:** Manually tested with markdown files — confirmed chunks follow header boundaries instead of being re-split at arbitrary character limits - [x] **Agentic AI Code:** This fix was identified through manual code review and tested by the author - [x] **Code review:** Self-reviewed - [x] **Design & Architecture:** Minimal flag-based guard, no new settings - [x] **Git Hygiene:** Single atomic commit - [x] **Title Prefix:** fix # Changelog Entry ### Description - Prevent character/token splitter from re-splitting chunks already produced by `MarkdownHeaderTextSplitter` ### Fixed - When `ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER` is enabled, markdown-split chunks fall through unconditionally into the `TEXT_SPLITTER` branch, which re-splits them via `RecursiveCharacterTextSplitter` or `TokenTextSplitter`, destroying the semantic boundaries established by header splitting. Added a `markdown_split_done` flag to skip the secondary splitter when markdown splitting was already applied. --- ### Additional Information - Relates to #21486 (Bug 2) - 7 lines added, 3 lines changed in `backend/open_webui/routers/retrieval.py` - No behavioral change when `ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER` is disabled (flag stays `False`, splitters run as before) ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. > [!NOTE] > Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-20 06:20:41 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#26121