[PR #21520] [CLOSED] fix: preserve header metadata in markdown splitter #26117

Closed
opened 2026-04-20 06:20:27 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/21520
Author: @Baireinhold
Created: 2/17/2026
Status: Closed

Base: mainHead: fix/markdown-header-metadata


📝 Commits (1)

  • 91507ef fix: preserve header metadata in markdown splitter

📊 Changes

1 file changed (+1 additions, -1 deletions)

View changed files

📝 backend/open_webui/routers/retrieval.py (+1 -1)

📄 Description

Summary

Fixes header metadata loss in MarkdownHeaderTextSplitter output.

Relates to #21486 (Bug 1).

Change

MarkdownHeaderTextSplitter.split_text() returns chunks with metadata containing the header hierarchy, e.g.:

{"Header 1": "Chapter 1", "Header 2": "1.1 Background"}

Currently, only the parent document's metadata is preserved:

metadata={**doc.metadata}  # split_chunk.metadata is discarded

This PR merges both:

metadata={**doc.metadata, **split_chunk.metadata}

Impact

  • Chunks now carry header context, improving retrieval relevance
  • No behavioral change when ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER is disabled
  • No new config keys or dependencies

Testing

  1. Enable markdown header splitting in Admin → Settings → Documents
  2. Upload a markdown file with nested headers (H1-H4)
  3. Inspect chunk metadata via ChromaDB or API

Before: all chunks have identical metadata (parent doc only)
After: each chunk includes Header 1, Header 2, etc. from its section


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/21520 **Author:** [@Baireinhold](https://github.com/Baireinhold) **Created:** 2/17/2026 **Status:** ❌ Closed **Base:** `main` ← **Head:** `fix/markdown-header-metadata` --- ### 📝 Commits (1) - [`91507ef`](https://github.com/open-webui/open-webui/commit/91507ef27df1c0927f4f6ada2ee057bec57ed32c) fix: preserve header metadata in markdown splitter ### 📊 Changes **1 file changed** (+1 additions, -1 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/routers/retrieval.py` (+1 -1) </details> ### 📄 Description ## Summary Fixes header metadata loss in `MarkdownHeaderTextSplitter` output. Relates to #21486 (Bug 1). ## Change `MarkdownHeaderTextSplitter.split_text()` returns chunks with metadata containing the header hierarchy, e.g.: ```python {"Header 1": "Chapter 1", "Header 2": "1.1 Background"} ``` Currently, only the parent document's metadata is preserved: ```python metadata={**doc.metadata} # split_chunk.metadata is discarded ``` This PR merges both: ```python metadata={**doc.metadata, **split_chunk.metadata} ``` ## Impact - Chunks now carry header context, improving retrieval relevance - No behavioral change when `ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER` is disabled - No new config keys or dependencies ## Testing 1. Enable markdown header splitting in Admin → Settings → Documents 2. Upload a markdown file with nested headers (H1-H4) 3. Inspect chunk metadata via ChromaDB or API **Before:** all chunks have identical metadata (parent doc only) **After:** each chunk includes `Header 1`, `Header 2`, etc. from its section --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-20 06:20:27 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#26117