mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 10:58:17 -05:00
[PR #21521] [CLOSED] fix: prevent double chunking after markdown header splitting #26118
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/21521
Author: @Baireinhold
Created: 2/17/2026
Status: ❌ Closed
Base:
main← Head:fix/markdown-double-chunking📝 Commits (1)
b8155bcfix: prevent double chunking after markdown header splitting📊 Changes
1 file changed (+7 additions, -3 deletions)
View changed files
📝
backend/open_webui/routers/retrieval.py(+7 -3)📄 Description
Summary
Prevents character/token splitter from re-splitting chunks already produced by
MarkdownHeaderTextSplitter.Relates to #21486 (Bug 2).
Problem
When
ENABLE_MARKDOWN_HEADER_TEXT_SPLITTERis enabled, the markdown splitter produces semantically bounded chunks at header boundaries. However, execution falls through unconditionally into theTEXT_SPLITTERbranch, which re-splits these chunks usingRecursiveCharacterTextSplitterorTokenTextSplitter.This destroys the semantic boundaries established by the header splitter and produces unpredictable fragment sizes.
Change
Add a
markdown_split_doneflag (initialized toFalse) that is set toTrueafter markdown splitting completes. The character/token splitter branches check this flag and skip processing when markdown splitting was already applied.Impact
ENABLE_MARKDOWN_HEADER_TEXT_SPLITTERis disabled (flag staysFalse, splitters run as before)Testing
TEXT_SPLITTER=character,CHUNK_SIZE=1500Before: chunks are re-split at arbitrary 1500-char boundaries, ignoring header structure
After: chunks follow header boundaries; only the markdown splitter determines chunk boundaries
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.