[GH-ISSUE #21486] issue: [Bug] MarkdownHeaderTextSplitter loses header metadata and causes double-chunking #35027

Closed
opened 2026-04-25 09:14:15 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @Baireinhold on GitHub (Feb 16, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/21486

Check Existing Issues

  • I have searched for any existing and/or related issues.
  • I have searched for any existing and/or related discussions.
  • I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!).
  • I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

v0.8.1

Ollama Version (if applicable)

No response

Operating System

Windows 11

Browser (if applicable)

No response

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided every relevant configuration, setting, and environment variable used in my setup.
  • I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
  • I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
  • Start with the initial platform/version/OS and dependencies used,
  • Specify exact install/launch/configure commands,
  • List URLs visited, user input (incl. example values/emails/passwords if needed),
  • Describe all options and toggles enabled or changed,
  • Include any files or environmental changes,
  • Identify the expected and actual result at each stage,
  • Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

When ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER is enabled, the markdown header splitter should:

  1. Preserve header hierarchy metadata (H1/H2/H3 etc.) from MarkdownHeaderTextSplitter in each chunk's metadata
  2. Not re-split already well-formed markdown chunks through the character/token splitter
  3. Properly merge tiny fragments (e.g. isolated heading lines) into adjacent chunks

Actual Behavior

Three related bugs in backend/open_webui/routers/retrieval.py lines 1476-1529:

Bug 1 — Header metadata discarded (line 1498)
metadata={**doc.metadata} only copies the parent document's metadata, discarding split_chunk.metadata which contains the header hierarchy info (e.g. {"Header 1": "Chapter 3", "Header 2": "Section 3.1"}).

Bug 2 — Double chunking (lines 1506→1510-1529)
After markdown splitting completes at line 1506, execution falls through unconditionally into the TEXT_SPLITTER branch (line 1510+). This re-splits the already well-formed markdown chunks using RecursiveCharacterTextSplitter or TokenTextSplitter, producing unpredictable fragment sizes and destroying the semantic boundaries established by the header splitter.

Bug 3 — Insufficient fragment merging (line 1507-1508)
merge_docs_to_target_size only merges forward. Isolated heading lines (e.g. a line containing only ## Chapter 3\n) remain as standalone tiny chunks (< 50 characters), degrading retrieval quality.

Steps to Reproduce

  1. Fresh install Open WebUI v0.8.1 via pip on Windows 11, Python 3.11
  2. Enable markdown header splitting: Admin Panel → Settings → Documents → toggle "Markdown Header Text Splitter" ON
  3. Set TEXT_SPLITTER to "character" (default), CHUNK_SIZE=1500, CHUNK_OVERLAP=100
  4. Create a knowledge base, upload a markdown file with nested headers:
# Chapter 1: Introduction
Some introductory text here spanning multiple sentences.

## 1.1 Background
Background content that is reasonably long.

### 1.1.1 Historical Context
Detailed historical context paragraph.

## 1.2 Motivation
Short motivation.

# Chapter 2: Methods
## 2.1 Approach
Approach details here.

1. After processing, inspect the chunks via ChromaDB or the API.

**Expected:** Each chunk carries metadata like `{"Header 1": "Chapter 1: Introduction", "Header 2": "1.1 Background"}`, chunks are not re-split by the character splitter, no tiny fragments exist.

**Actual:**

- All chunks have identical metadata (only the parent doc metadata, no header info)
- Chunks are re-split by RecursiveCharacterTextSplitter after markdown splitting, producing fragments that break mid-sentence
- Isolated heading lines like "## 1.2 Motivation" appear as standalone ~20-character chunks

### Logs & Screenshots

**Bug 1 evidence** — retrieval.py line 1498:
```python
# Current code:
Document(
    page_content=split_chunk.page_content,
    metadata={**doc.metadata},# ← BUG: split_chunk.metadata is discarded
)

# Fix:
Document(
    page_content=split_chunk.page_content,
    metadata={**doc.metadata, **split_chunk.metadata},  # merge header info
)

Bug 2 evidence — retrieval.py lines 1506→1510:

docs = split_docs          # line 1506: markdown splitting done
# ... merge_docs ...# line 1507-1508
if request.app.state.config.TEXT_SPLITTER in ["", "character"]:  # line 1510
    #← BUG: no guard! falls through unconditionally
    docs = text_splitter.split_documents(docs)  # line 1516: re-splits everything

Bug 3 evidence: With a 50-page academic document (mixed H1-H4), markdown splitting produces ~23% chunks under 50 characters. These are mostly isolated heading lines that merge_docs_to_target_size fails to merge because it only merges forward.

Additional Information

I have a working fix for all three bugs that I can submit as a PR if maintainers are interested:

  1. Merge split_chunk.metadata into output Document metadata
  2. Add a markdown_split_done metadata flag and guard the TEXT_SPLITTER branch with if not any(doc.metadata.get("markdown_split_done") for doc in docs):
  3. Implement bidirectional tiny chunk merging with a configurable threshold

The fix is backward-compatible — no new config keys needed, existing ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER and CHUNK_MIN_SIZE_TARGET settings are preserved.

Originally created by @Baireinhold on GitHub (Feb 16, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/21486 ### Check Existing Issues - [x] I have searched for any existing and/or related issues. - [x] I have searched for any existing and/or related discussions. - [x] I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!). - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version v0.8.1 ### Ollama Version (if applicable) _No response_ ### Operating System Windows 11 ### Browser (if applicable) _No response_ ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior When `ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER` is enabled, the markdown header splitter should: 1. Preserve header hierarchy metadata (H1/H2/H3 etc.) from `MarkdownHeaderTextSplitter` in each chunk's metadata 2. Not re-split already well-formed markdown chunks through the character/token splitter 3. Properly merge tiny fragments (e.g. isolated heading lines) into adjacent chunks ### Actual Behavior Three related bugs in `backend/open_webui/routers/retrieval.py` lines 1476-1529: **Bug 1 — Header metadata discarded (line 1498)** `metadata={**doc.metadata}` only copies the parent document's metadata, discarding `split_chunk.metadata` which contains the header hierarchy info (e.g. `{"Header 1": "Chapter 3", "Header 2": "Section 3.1"}`). **Bug 2 — Double chunking (lines 1506→1510-1529)** After markdown splitting completes at line 1506, execution falls through unconditionally into the `TEXT_SPLITTER` branch (line 1510+). This re-splits the already well-formed markdown chunks using `RecursiveCharacterTextSplitter` or `TokenTextSplitter`, producing unpredictable fragment sizes and destroying the semantic boundaries established by the header splitter. **Bug 3 — Insufficient fragment merging (line 1507-1508)** `merge_docs_to_target_size` only merges forward. Isolated heading lines (e.g. a line containing only `## Chapter 3\n`) remain as standalone tiny chunks (< 50 characters), degrading retrieval quality. ### Steps to Reproduce 1. Fresh install Open WebUI v0.8.1 via pip on Windows 11, Python 3.11 2. Enable markdown header splitting: Admin Panel → Settings → Documents → toggle "Markdown Header Text Splitter" ON 3. Set TEXT_SPLITTER to "character" (default), CHUNK_SIZE=1500, CHUNK_OVERLAP=100 4. Create a knowledge base, upload a markdown file with nested headers: ```md # Chapter 1: Introduction Some introductory text here spanning multiple sentences. ## 1.1 Background Background content that is reasonably long. ### 1.1.1 Historical Context Detailed historical context paragraph. ## 1.2 Motivation Short motivation. # Chapter 2: Methods ## 2.1 Approach Approach details here. 1. After processing, inspect the chunks via ChromaDB or the API. **Expected:** Each chunk carries metadata like `{"Header 1": "Chapter 1: Introduction", "Header 2": "1.1 Background"}`, chunks are not re-split by the character splitter, no tiny fragments exist. **Actual:** - All chunks have identical metadata (only the parent doc metadata, no header info) - Chunks are re-split by RecursiveCharacterTextSplitter after markdown splitting, producing fragments that break mid-sentence - Isolated heading lines like "## 1.2 Motivation" appear as standalone ~20-character chunks ### Logs & Screenshots **Bug 1 evidence** — retrieval.py line 1498: ```python # Current code: Document( page_content=split_chunk.page_content, metadata={**doc.metadata},# ← BUG: split_chunk.metadata is discarded ) # Fix: Document( page_content=split_chunk.page_content, metadata={**doc.metadata, **split_chunk.metadata}, # merge header info ) ``` **Bug 2 evidence** — retrieval.py lines 1506→1510: ```python docs = split_docs # line 1506: markdown splitting done # ... merge_docs ...# line 1507-1508 if request.app.state.config.TEXT_SPLITTER in ["", "character"]: # line 1510 #← BUG: no guard! falls through unconditionally docs = text_splitter.split_documents(docs) # line 1516: re-splits everything ``` **Bug 3 evidence:** With a 50-page academic document (mixed H1-H4), markdown splitting produces ~23% chunks under 50 characters. These are mostly isolated heading lines that `merge_docs_to_target_size` fails to merge because it only merges forward. ### Additional Information I have a working fix for all three bugs that I can submit as a PR if maintainers are interested: 1. Merge `split_chunk.metadata` into output Document metadata 2. Add a `markdown_split_done` metadata flag and guard the TEXT_SPLITTER branch with `if not any(doc.metadata.get("markdown_split_done") for doc in docs):` 3. Implement bidirectional tiny chunk merging with a configurable threshold The fix is backward-compatible — no new config keys needed, existing `ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER` and `CHUNK_MIN_SIZE_TARGET` settings are preserved.
GiteaMirror added the bug label 2026-04-25 09:14:15 -05:00
Author
Owner

@Classic298 commented on GitHub (Feb 16, 2026):

How does bug 3 happen?

What is your min chunk size merging value?

<!-- gh-comment-id:3909568457 --> @Classic298 commented on GitHub (Feb 16, 2026): How does bug 3 happen? What is your min chunk size merging value?
Author
Owner

@Baireinhold commented on GitHub (Feb 16, 2026):

漏洞 3 是如何发生的?

你的最小合并块大小是多少?
How Bug 3 happens:

The merge condition at line 1398 only checks whether the current chunk is below CHUNK_MIN_SIZE_TARGET:

can_merge = (
can_merge_chunks(current_chunk, next_chunk)
and measure_chunk_size(current_content) < min_chunk_size_target # ← only checks current
and measure_chunk_size(proposed_content) <= max_chunk_size
)
This means a tiny fragment following a large chunk is never absorbed. Concrete example with CHUNK_SIZE=2048, CHUNK_MIN_SIZE_TARGET=1024:

Chunk A: "##1.1 Background\n\nLong background..." (1500 chars)
Chunk B: "## 1.2 Motivation\n\nShort line." (40 chars)
Chunk C: "# Chapter 2: Methods\n\nVery long content..." (1900 chars)
Forward merge trace:

current=A(1500), next=B(40) → 1500 < 1024 = false → emit A, current=B
current=B(40), next=C(1900) → 40 < 1024 = true, but 1942 <= 2048 = true → merge B+C
In this case B gets saved. But change C to2020 chars:

current=B(40), next=C(2020) → 40 < 1024 = true, but 2062 <= 2048 = false → emit B as40-char fragment
B can't merge forward (C too large) and there's no backward pass to merge B into A. With structured academic documents (mixed H1-H4 headers), this pattern is common — roughly 23% of chunks end up under 50characters in my testing.

My CHUNK_MIN_SIZE_TARGET value:

The default is 0 (disabled, line 2968-2971in config.py). When it's 0, merge_docs_to_target_size returns immediately at line 1373, so no merging happens at all.

Even with a positive value like 1024, the forward-only strategy still misses fragments where the preceding chunk is already >= CHUNK_MIN_SIZE_TARGET.

Note: Bug 3 is the least critical of the three. Bug 2 (no guard at line 1510, causing RecursiveCharacterTextSplitter to re-split all markdown chunks) makes the merge effort moot anyway. Fixing Bug 1 and Bug 2 would have the highest impact; Bug 3 could be addressed as a follow-up enhancement.

<!-- gh-comment-id:3909679283 --> @Baireinhold commented on GitHub (Feb 16, 2026): > 漏洞 3 是如何发生的? > > 你的最小合并块大小是多少? How Bug 3 happens: The merge condition at line 1398 only checks whether the current chunk is below CHUNK_MIN_SIZE_TARGET: can_merge = ( can_merge_chunks(current_chunk, next_chunk) and measure_chunk_size(current_content) < min_chunk_size_target # ← only checks current and measure_chunk_size(proposed_content) <= max_chunk_size ) This means a tiny fragment following a large chunk is never absorbed. Concrete example with CHUNK_SIZE=2048, CHUNK_MIN_SIZE_TARGET=1024: Chunk A: "##1.1 Background\n\nLong background..." (1500 chars) Chunk B: "## 1.2 Motivation\n\nShort line." (40 chars) Chunk C: "# Chapter 2: Methods\n\nVery long content..." (1900 chars) Forward merge trace: current=A(1500), next=B(40) → 1500 < 1024 = false → emit A, current=B current=B(40), next=C(1900) → 40 < 1024 = true, but 1942 <= 2048 = true → merge B+C In this case B gets saved. But change C to2020 chars: current=B(40), next=C(2020) → 40 < 1024 = true, but 2062 <= 2048 = false → emit B as40-char fragment B can't merge forward (C too large) and there's no backward pass to merge B into A. With structured academic documents (mixed H1-H4 headers), this pattern is common — roughly 23% of chunks end up under 50characters in my testing. My CHUNK_MIN_SIZE_TARGET value: The default is 0 (disabled, line 2968-2971in config.py). When it's 0, merge_docs_to_target_size returns immediately at line 1373, so no merging happens at all. Even with a positive value like 1024, the forward-only strategy still misses fragments where the preceding chunk is already >= CHUNK_MIN_SIZE_TARGET. Note: Bug 3 is the least critical of the three. Bug 2 (no guard at line 1510, causing RecursiveCharacterTextSplitter to re-split all markdown chunks) makes the merge effort moot anyway. Fixing Bug 1 and Bug 2 would have the highest impact; Bug 3 could be addressed as a follow-up enhancement.
Author
Owner

@Classic298 commented on GitHub (Feb 16, 2026):

Ok. I understand.

So in simpler terms: in case the follow up chunk is too large, then merge it with current chunk

so if current chunk is chunk A with 1500 length
and chunk B is 50
and chunk C is 2020 length

Then we do the following check while we are still processing chunk A

while chunk A:
if chunk B below chunk min size target AND
chunk B+C >= chunk size AND
chunk A+B <= chunk size

then merge Chunk A with B

then continue with chunk C since chunk B is already handled

<!-- gh-comment-id:3909719191 --> @Classic298 commented on GitHub (Feb 16, 2026): Ok. I understand. So in simpler terms: in case the follow up chunk is too large, then merge it with current chunk so if current chunk is chunk A with 1500 length and chunk B is 50 and chunk C is 2020 length Then we do the following check while we are still processing chunk A while chunk A: if chunk B below chunk min size target AND chunk B+C >= chunk size AND chunk A+B <= chunk size then merge Chunk A with B then continue with chunk C since chunk B is already handled
Author
Owner

@Classic298 commented on GitHub (Feb 16, 2026):

if you decide to submit PRs, please do so atomically, i.e. one PR per bug.

<!-- gh-comment-id:3909720099 --> @Classic298 commented on GitHub (Feb 16, 2026): if you decide to submit PRs, please do so atomically, i.e. one PR per bug.
Author
Owner

@Classic298 commented on GitHub (Feb 16, 2026):

or better, a look-back merging instead of look ahead this should be easier to implement

<!-- gh-comment-id:3909751033 --> @Classic298 commented on GitHub (Feb 16, 2026): or better, a look-back merging instead of look ahead this should be easier to implement
Author
Owner

@Classic298 commented on GitHub (Feb 16, 2026):

@Baireinhold feel free to submit all three of your PRs, I did mine just for exploration

But you are totally right, "bug" 3 (though more an enhancement) is worth less if bug 1 and 2 still exist.

<!-- gh-comment-id:3909911911 --> @Classic298 commented on GitHub (Feb 16, 2026): @Baireinhold feel free to submit all three of your PRs, I did mine just for exploration But you are totally right, "bug" 3 (though more an enhancement) is worth less if bug 1 and 2 still exist.
Author
Owner

@Classic298 commented on GitHub (Feb 21, 2026):

Hey @Baireinhold, thanks for the detailed write-up. Let me share my analysis of the three points:

Bug 2 (double chunking) — This is intended behavior.

The markdown header text splitter is designed as a semantic pre-processing pass, not a replacement for size-based chunking. It splits at header boundaries, but it does not enforce any maximum chunk size. A single section under one header could be tens of thousands of characters long.

The character/token splitter that runs afterward is the size enforcement step. Without it, you'd get chunks that exceed embedding model token limits and potentially blow past vector DB field constraints. For chunks that are already under CHUNK_SIZE, the second pass is a no-op — it leaves them untouched. It only re-splits chunks that are too large, which is necessary.

Skipping the second pass entirely (as your PR #21524 proposed) would remove all size guarantees.

Bug 1 (header metadata) — Not a bug.

The code intentionally sets strip_headers=False, which means the header text is already preserved inside the chunk content itself. Each chunk already starts with its header lines (e.g. "# Chapter 1\n## 1.1 Background\n...").

What your PR #21523 adds is duplicating that same information as key-value pairs in the metadata dictionary (e.g. "Header 1": "Chapter 1"). Nothing in the system reads, queries, or displays those metadata keys. The information is already in the embedded text. This would be dead weight in the vector DB.

"Bug" 3 (forward-only merge) — Legitimate enhancement.

This is the one valid point from the issue. The forward-only merge strategy can leave tiny orphan fragments when a small chunk sits between two large ones. I've already submitted PR #21488 to add backward merging to address this. This is an enhancement rather than a bug.


Either way thanks for making me aware of the possible enhancement here, i hope the PR #21488 gets merged soon

<!-- gh-comment-id:3939696164 --> @Classic298 commented on GitHub (Feb 21, 2026): Hey @Baireinhold, thanks for the detailed write-up. Let me share my analysis of the three points: **Bug 2 (double chunking) — This is intended behavior.** The markdown header text splitter is designed as a semantic <ins>**pre-processing pass**</ins>, not a replacement for size-based chunking. It splits at header boundaries, but it does not enforce any maximum chunk size. A single section under one header could be tens of thousands of characters long. The character/token splitter that runs afterward is the size enforcement step. Without it, you'd get chunks that exceed embedding model token limits and potentially blow past vector DB field constraints. For chunks that are already under CHUNK_SIZE, the second pass is a no-op — it leaves them untouched. It only re-splits chunks that are too large, which is necessary. Skipping the second pass entirely (as your PR #21524 proposed) would remove all size guarantees. **Bug 1 (header metadata) — Not a bug.** The code intentionally sets strip_headers=False, which means the header text is already preserved inside the chunk content itself. Each chunk already starts with its header lines (e.g. "# Chapter 1\n## 1.1 Background\n..."). What your PR #21523 adds is duplicating that same information as key-value pairs in the metadata dictionary (e.g. "Header 1": "Chapter 1"). Nothing in the system reads, queries, or displays those metadata keys. The information is already in the embedded text. This would be dead weight in the vector DB. **"Bug" 3 (forward-only merge) — Legitimate enhancement.** This is the one valid point from the issue. The forward-only merge strategy can leave tiny orphan fragments when a small chunk sits between two large ones. I've already submitted PR #21488 to add backward merging to address this. This is an enhancement rather than a bug. --- Either way thanks for making me aware of the possible enhancement here, i hope the PR #21488 gets merged soon
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#35027