[PR #21523] [CLOSED] fix: preserve header metadata in markdown splitter #41750

Closed
opened 2026-04-25 13:54:16 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/21523
Author: @Baireinhold
Created: 2/17/2026
Status: Closed

Base: devHead: fix/markdown-header-metadata


📝 Commits (1)

  • 79b1f26 fix: preserve header metadata in markdown splitter

📊 Changes

1 file changed (+1 additions, -1 deletions)

View changed files

📝 backend/open_webui/routers/retrieval.py (+1 -1)

📄 Description

Pull Request Checklist

  • Target branch: dev
  • Description: Provided below
  • Changelog: Provided below
  • Documentation: No user-facing behavior change, no docs needed
  • Dependencies: None
  • Testing: Manually tested with markdown files containing nested H1-H4 headers
  • Agentic AI Code: This fix was identified through manual code review and tested by the author
  • Code review: Self-reviewed
  • Design & Architecture: Minimal one-line fix, no new settings
  • Git Hygiene: Single atomic commit
  • Title Prefix: fix

Changelog Entry

Description

  • Fix header metadata being discarded when using MarkdownHeaderTextSplitter

Fixed

  • MarkdownHeaderTextSplitter.split_text() returns chunks with metadata containing the header hierarchy (e.g. {"Header 1": "Chapter 1", "Header 2": "1.1 Background"}), but only the parent document's metadata was preserved via metadata={**doc.metadata}, discarding split_chunk.metadata. Changed to metadata={**doc.metadata, **split_chunk.metadata} to merge both.

Additional Information

  • Relates to #21486 (Bug 1)
  • One-line change in backend/open_webui/routers/retrieval.py

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/21523 **Author:** [@Baireinhold](https://github.com/Baireinhold) **Created:** 2/17/2026 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `fix/markdown-header-metadata` --- ### 📝 Commits (1) - [`79b1f26`](https://github.com/open-webui/open-webui/commit/79b1f26c193b3239482cedcf1e464643c8de4a37) fix: preserve header metadata in markdown splitter ### 📊 Changes **1 file changed** (+1 additions, -1 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/routers/retrieval.py` (+1 -1) </details> ### 📄 Description # Pull Request Checklist - [x] **Target branch:** `dev` - [x] **Description:** Provided below - [x] **Changelog:** Provided below - [x] **Documentation:** No user-facing behavior change, no docs needed - [x] **Dependencies:** None - [x] **Testing:** Manually tested with markdown files containing nested H1-H4 headers - [x] **Agentic AI Code:** This fix was identified through manual code review and tested by the author - [x] **Code review:** Self-reviewed - [x] **Design & Architecture:** Minimal one-line fix, no new settings - [x] **Git Hygiene:** Single atomic commit - [x] **Title Prefix:** fix # Changelog Entry ### Description - Fix header metadata being discarded when using `MarkdownHeaderTextSplitter` ### Fixed - `MarkdownHeaderTextSplitter.split_text()` returns chunks with metadata containing the header hierarchy (e.g. `{"Header 1": "Chapter 1", "Header 2": "1.1 Background"}`), but only the parent document's metadata was preserved via `metadata={**doc.metadata}`, discarding `split_chunk.metadata`. Changed to `metadata={**doc.metadata, **split_chunk.metadata}` to merge both. --- ### Additional Information - Relates to #21486 (Bug 1) - One-line change in `backend/open_webui/routers/retrieval.py` ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. > [!NOTE] > Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 13:54:16 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#41750