[PR #21521] [CLOSED] fix: prevent double chunking after markdown header splitting #49166

Closed
opened 2026-04-30 01:29:04 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/21521
Author: @Baireinhold
Created: 2/17/2026
Status: Closed

Base: mainHead: fix/markdown-double-chunking


📝 Commits (1)

  • b8155bc fix: prevent double chunking after markdown header splitting

📊 Changes

1 file changed (+7 additions, -3 deletions)

View changed files

📝 backend/open_webui/routers/retrieval.py (+7 -3)

📄 Description

Summary

Prevents character/token splitter from re-splitting chunks already produced by MarkdownHeaderTextSplitter.

Relates to #21486 (Bug 2).

Problem

When ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER is enabled, the markdown splitter produces semantically bounded chunks at header boundaries. However, execution falls through unconditionally into the TEXT_SPLITTER branch, which re-splits these chunks using RecursiveCharacterTextSplitter or TokenTextSplitter.

This destroys the semantic boundaries established by the header splitter and produces unpredictable fragment sizes.

Change

Add a markdown_split_done flag (initialized to False) that is set to True after markdown splitting completes. The character/token splitter branches check this flag and skip processing when markdown splitting was already applied.

Impact

  • Markdown-split chunks retain their header-based semantic boundaries
  • No behavioral change when ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER is disabled (flag stays False, splitters run as before)
  • No new config keys or dependencies

Testing

  1. Enable markdown header splitting, set TEXT_SPLITTER=character, CHUNK_SIZE=1500
  2. Upload a markdown file with sections of varying length
  3. Inspect chunks

Before: chunks are re-split at arbitrary 1500-char boundaries, ignoring header structure
After: chunks follow header boundaries; only the markdown splitter determines chunk boundaries


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/21521 **Author:** [@Baireinhold](https://github.com/Baireinhold) **Created:** 2/17/2026 **Status:** ❌ Closed **Base:** `main` ← **Head:** `fix/markdown-double-chunking` --- ### 📝 Commits (1) - [`b8155bc`](https://github.com/open-webui/open-webui/commit/b8155bc017f6f179d4c9cfd75bb7ed7b8d444ca1) fix: prevent double chunking after markdown header splitting ### 📊 Changes **1 file changed** (+7 additions, -3 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/routers/retrieval.py` (+7 -3) </details> ### 📄 Description ## Summary Prevents character/token splitter from re-splitting chunks already produced by `MarkdownHeaderTextSplitter`. Relates to #21486 (Bug 2). ## Problem When `ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER` is enabled, the markdown splitter produces semantically bounded chunks at header boundaries. However, execution falls through unconditionally into the `TEXT_SPLITTER` branch, which re-splits these chunks using `RecursiveCharacterTextSplitter` or `TokenTextSplitter`. This destroys the semantic boundaries established by the header splitter and produces unpredictable fragment sizes. ## Change Add a `markdown_split_done` flag (initialized to `False`) that is set to `True` after markdown splitting completes. The character/token splitter branches check this flag and skip processing when markdown splitting was already applied. ## Impact - Markdown-split chunks retain their header-based semantic boundaries - No behavioral change when `ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER` is disabled (flag stays `False`, splitters run as before) - No new config keys or dependencies ## Testing 1. Enable markdown header splitting, set `TEXT_SPLITTER=character`, `CHUNK_SIZE=1500` 2. Upload a markdown file with sections of varying length 3. Inspect chunks **Before:** chunks are re-split at arbitrary 1500-char boundaries, ignoring header structure **After:** chunks follow header boundaries; only the markdown splitter determines chunk boundaries --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-30 01:29:04 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#49166