mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 02:48:13 -05:00
[GH-ISSUE #21486] issue: [Bug] MarkdownHeaderTextSplitter loses header metadata and causes double-chunking #58164
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Baireinhold on GitHub (Feb 16, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/21486
Check Existing Issues
Installation Method
Docker
Open WebUI Version
v0.8.1
Ollama Version (if applicable)
No response
Operating System
Windows 11
Browser (if applicable)
No response
Confirmation
README.md.Expected Behavior
When
ENABLE_MARKDOWN_HEADER_TEXT_SPLITTERis enabled, the markdown header splitter should:MarkdownHeaderTextSplitterin each chunk's metadataActual Behavior
Three related bugs in
backend/open_webui/routers/retrieval.pylines 1476-1529:Bug 1 — Header metadata discarded (line 1498)
metadata={**doc.metadata}only copies the parent document's metadata, discardingsplit_chunk.metadatawhich contains the header hierarchy info (e.g.{"Header 1": "Chapter 3", "Header 2": "Section 3.1"}).Bug 2 — Double chunking (lines 1506→1510-1529)
After markdown splitting completes at line 1506, execution falls through unconditionally into the
TEXT_SPLITTERbranch (line 1510+). This re-splits the already well-formed markdown chunks usingRecursiveCharacterTextSplitterorTokenTextSplitter, producing unpredictable fragment sizes and destroying the semantic boundaries established by the header splitter.Bug 3 — Insufficient fragment merging (line 1507-1508)
merge_docs_to_target_sizeonly merges forward. Isolated heading lines (e.g. a line containing only## Chapter 3\n) remain as standalone tiny chunks (< 50 characters), degrading retrieval quality.Steps to Reproduce
Bug 2 evidence — retrieval.py lines 1506→1510:
Bug 3 evidence: With a 50-page academic document (mixed H1-H4), markdown splitting produces ~23% chunks under 50 characters. These are mostly isolated heading lines that
merge_docs_to_target_sizefails to merge because it only merges forward.Additional Information
I have a working fix for all three bugs that I can submit as a PR if maintainers are interested:
split_chunk.metadatainto output Document metadatamarkdown_split_donemetadata flag and guard the TEXT_SPLITTER branch withif not any(doc.metadata.get("markdown_split_done") for doc in docs):The fix is backward-compatible — no new config keys needed, existing
ENABLE_MARKDOWN_HEADER_TEXT_SPLITTERandCHUNK_MIN_SIZE_TARGETsettings are preserved.@Classic298 commented on GitHub (Feb 16, 2026):
How does bug 3 happen?
What is your min chunk size merging value?
@Baireinhold commented on GitHub (Feb 16, 2026):
The merge condition at line 1398 only checks whether the current chunk is below CHUNK_MIN_SIZE_TARGET:
can_merge = (
can_merge_chunks(current_chunk, next_chunk)
and measure_chunk_size(current_content) < min_chunk_size_target # ← only checks current
and measure_chunk_size(proposed_content) <= max_chunk_size
)
This means a tiny fragment following a large chunk is never absorbed. Concrete example with CHUNK_SIZE=2048, CHUNK_MIN_SIZE_TARGET=1024:
Chunk A: "##1.1 Background\n\nLong background..." (1500 chars)
Chunk B: "## 1.2 Motivation\n\nShort line." (40 chars)
Chunk C: "# Chapter 2: Methods\n\nVery long content..." (1900 chars)
Forward merge trace:
current=A(1500), next=B(40) → 1500 < 1024 = false → emit A, current=B
current=B(40), next=C(1900) → 40 < 1024 = true, but 1942 <= 2048 = true → merge B+C
In this case B gets saved. But change C to2020 chars:
current=B(40), next=C(2020) → 40 < 1024 = true, but 2062 <= 2048 = false → emit B as40-char fragment
B can't merge forward (C too large) and there's no backward pass to merge B into A. With structured academic documents (mixed H1-H4 headers), this pattern is common — roughly 23% of chunks end up under 50characters in my testing.
My CHUNK_MIN_SIZE_TARGET value:
The default is 0 (disabled, line 2968-2971in config.py). When it's 0, merge_docs_to_target_size returns immediately at line 1373, so no merging happens at all.
Even with a positive value like 1024, the forward-only strategy still misses fragments where the preceding chunk is already >= CHUNK_MIN_SIZE_TARGET.
Note: Bug 3 is the least critical of the three. Bug 2 (no guard at line 1510, causing RecursiveCharacterTextSplitter to re-split all markdown chunks) makes the merge effort moot anyway. Fixing Bug 1 and Bug 2 would have the highest impact; Bug 3 could be addressed as a follow-up enhancement.
@Classic298 commented on GitHub (Feb 16, 2026):
Ok. I understand.
So in simpler terms: in case the follow up chunk is too large, then merge it with current chunk
so if current chunk is chunk A with 1500 length
and chunk B is 50
and chunk C is 2020 length
Then we do the following check while we are still processing chunk A
while chunk A:
if chunk B below chunk min size target AND
chunk B+C >= chunk size AND
chunk A+B <= chunk size
then merge Chunk A with B
then continue with chunk C since chunk B is already handled
@Classic298 commented on GitHub (Feb 16, 2026):
if you decide to submit PRs, please do so atomically, i.e. one PR per bug.
@Classic298 commented on GitHub (Feb 16, 2026):
or better, a look-back merging instead of look ahead this should be easier to implement
@Classic298 commented on GitHub (Feb 16, 2026):
@Baireinhold feel free to submit all three of your PRs, I did mine just for exploration
But you are totally right, "bug" 3 (though more an enhancement) is worth less if bug 1 and 2 still exist.
@Classic298 commented on GitHub (Feb 21, 2026):
Hey @Baireinhold, thanks for the detailed write-up. Let me share my analysis of the three points:
Bug 2 (double chunking) — This is intended behavior.
The markdown header text splitter is designed as a semantic pre-processing pass, not a replacement for size-based chunking. It splits at header boundaries, but it does not enforce any maximum chunk size. A single section under one header could be tens of thousands of characters long.
The character/token splitter that runs afterward is the size enforcement step. Without it, you'd get chunks that exceed embedding model token limits and potentially blow past vector DB field constraints. For chunks that are already under CHUNK_SIZE, the second pass is a no-op — it leaves them untouched. It only re-splits chunks that are too large, which is necessary.
Skipping the second pass entirely (as your PR #21524 proposed) would remove all size guarantees.
Bug 1 (header metadata) — Not a bug.
The code intentionally sets strip_headers=False, which means the header text is already preserved inside the chunk content itself. Each chunk already starts with its header lines (e.g. "# Chapter 1\n## 1.1 Background\n...").
What your PR #21523 adds is duplicating that same information as key-value pairs in the metadata dictionary (e.g. "Header 1": "Chapter 1"). Nothing in the system reads, queries, or displays those metadata keys. The information is already in the embedded text. This would be dead weight in the vector DB.
"Bug" 3 (forward-only merge) — Legitimate enhancement.
This is the one valid point from the issue. The forward-only merge strategy can leave tiny orphan fragments when a small chunk sits between two large ones. I've already submitted PR #21488 to add backward merging to address this. This is an enhancement rather than a bug.
Either way thanks for making me aware of the possible enhancement here, i hope the PR #21488 gets merged soon