[PR #19944] [CLOSED] feat: add two-stage markdown header text splitter with minimum chunk size merging #25405

Closed
opened 2026-04-20 05:55:26 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/19944
Author: @Classic298
Created: 12/14/2025
Status: Closed

Base: devHead: md-splitting


📝 Commits (10+)

📊 Changes

4 files changed (+548 additions, -139 deletions)

View changed files

📝 backend/open_webui/config.py (+114 -7)
📝 backend/open_webui/main.py (+93 -11)
📝 backend/open_webui/routers/retrieval.py (+271 -100)
📝 src/lib/components/admin/Settings/Documents.svelte (+70 -21)

📄 Description

  • Target branch: Verify that the pull request targets the dev branch. Not targeting the dev branch will lead to immediate closure of the PR.
  • Description: Provide a concise description of the changes made in this pull request down below.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: If necessary, update relevant documentation Open WebUI Docs like environment variables, the tutorials, or other documentation sources.
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Perform manual tests to verify the implemented fix/feature works as intended AND does not break any other functionality. Take this as an opportunity to make screenshots of the feature/fix and include it in the PR description.
  • Agentic AI Code: Confirm this Pull Request is not written by any AI Agent or has at least gone through additional human review AND manual testing. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR.
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Title Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:
    • feat: Introduces a new feature or enhancement to the codebase

Changelog Entry

Description

Related: #18715
Related: #19156
Related: #19277

This PR introduces a two-stage document chunking architecture that enables optional markdown header preprocessing before the standard character/token splitting. When enabled, documents are first split by markdown headers (H1-H6), then small chunks are intelligently merged to meet a configurable minimum size threshold, and finally the standard text splitter is applied. This significantly improves chunk quality and reduces embedding costs for markdown-heavy documents.

Not only does this REDUCE embedding costs and REDUCE storage needed for vectors in the database, but also SPEEDS UP the embedding and document processing process and IMPROVES the RAG performance and quality by a lot!

Motivation: Documents with many markdown headers often produce excessively small, low-quality chunks that hurt retrieval performance and waste embedding API calls. This feature allows users to leverage document structure while maintaining semantic coherence.

Added

  • ENABLE_MARKDOWN_HEADER_SPLITTING config option to enable two-stage chunking
  • CHUNK_MIN_SIZE config option for minimum chunk size (interpreted as characters or tokens based on TEXT_SPLITTER setting)
  • Admin UI toggle for "Markdown Header Splitting" with tooltip explanation
  • Admin UI input for "Min Chunk Size" (conditionally displayed when markdown splitting is enabled)
  • Two-stage splitting architecture
    • Stage 1: Optional markdown header preprocessing with minimum chunk merging
    • Stage 2: Character or token splitting (existing behavior)
  • Heading metadata preservation in chunk metadata (headings field)
  • Source/file boundary protection to prevent merging chunks from different documents

Changed

  1. Refactored text splitting logic to support two-stage architecture
  2. Updated RAG config API endpoints to include new configuration options
  3. Removed standalone markdown_header option from TEXT_SPLITTER dropdown (replaced by dedicated toggle)

Removed

  • markdown_header option from TEXT_SPLITTER dropdown (functionality replaced by ENABLE_MARKDOWN_HEADER_SPLITTING toggle with better UX)

Breaking Changes

  • The TEXT_SPLITTER=markdown_header option is no longer supported. Users should enable the new ENABLE_MARKDOWN_HEADER_SPLITTING toggle instead, which provides better control and combines with character/token splitting.

Additional Information

How it works:

  • When ENABLE_MARKDOWN_HEADER_SPLITTING is enabled, documents are first split by markdown headers (H1-H6)
  • If CHUNK_MIN_SIZE > 0, small chunks are merged with subsequent chunks until they meet the minimum size threshold (respecting max chunk size and document boundaries)
  • The merged chunks then go through the standard character or token splitter

Edge cases handled:

  • Chunks from different source files/URLs are never merged together
  • Missing metadata is treated conservatively (no merging if source/file_id is missing)
  • "Best effort" merging: if a chunk can't reach minimum size without exceeding maximum, it's kept as-is

Environment variables:

  • ENABLE_MARKDOWN_HEADER_SPLITTING (default: false)
  • CHUNK_MIN_SIZE (default: 0 - disabled)

Screenshots or Videos

image image image

Real Test 1 - Testing with a web document:

When adding this link from the docs with MIN Chunk size set to 0 (default) and MD Header based splitting on, it creates 588+ chunks

image

When you set the min chunk size to 1000 tokens, the same link creates only 44 chunks! 93% improvement, saves cost, storage and improves RAG performance in markdown heavy documents and web pages

image

(logging visible in screenshots is removed in final PR)

Real Test 2 - Testing with an uploaded file

Min Chunk Size set to Zero:

image

Min Chunk Size set to 1000 token:

image

Test 3

Tested in knowledge base - also works there as intended.

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/19944 **Author:** [@Classic298](https://github.com/Classic298) **Created:** 12/14/2025 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `md-splitting` --- ### 📝 Commits (10+) - [`87d07f3`](https://github.com/open-webui/open-webui/commit/87d07f3651f3934bb4726d274abf5a70225da811) init - [`b7ba583`](https://github.com/open-webui/open-webui/commit/b7ba5830e93ba559267af272c4bb206ee2e6643d) Update Documents.svelte - [`38a5bb5`](https://github.com/open-webui/open-webui/commit/38a5bb55ceaee0f4c63777e126246756f7d1f1c1) Update Documents.svelte - [`ad7d9f7`](https://github.com/open-webui/open-webui/commit/ad7d9f79f592170964d35fde3adaaca89976ad09) Update retrieval.py - [`0b547aa`](https://github.com/open-webui/open-webui/commit/0b547aa10a1f1eb23c6280b7754a93d3b0a4f5c4) Update retrieval.py - [`0c1d56f`](https://github.com/open-webui/open-webui/commit/0c1d56f05d557d9a5240998053ab2b212e642d0d) Update retrieval.py - [`ce9fc54`](https://github.com/open-webui/open-webui/commit/ce9fc5404c12e68d72378501d6e4efd4d85b891c) Update retrieval.py - [`9b56b30`](https://github.com/open-webui/open-webui/commit/9b56b30dc00615eb18f8fad55cbf0abf16500904) Update retrieval.py - [`13d2060`](https://github.com/open-webui/open-webui/commit/13d206094b8bb9fcbe2f5463001d08774db99b78) rename - [`5a0522e`](https://github.com/open-webui/open-webui/commit/5a0522e23ad435807bb90fcbbd5fb7aa4f6981f4) init ### 📊 Changes **4 files changed** (+548 additions, -139 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+114 -7) 📝 `backend/open_webui/main.py` (+93 -11) 📝 `backend/open_webui/routers/retrieval.py` (+271 -100) 📝 `src/lib/components/admin/Settings/Documents.svelte` (+70 -21) </details> ### 📄 Description - [X] **Target branch:** Verify that the pull request targets the `dev` branch. **Not targeting the `dev` branch will lead to immediate closure of the PR.** - [X] **Description:** Provide a concise description of the changes made in this pull request down below. - [X] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [X] **Documentation:** If necessary, update relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs) like environment variables, the tutorials, or other documentation sources. - [X] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - [X] **Testing:** Perform manual tests to **verify the implemented fix/feature works as intended AND does not break any other functionality**. Take this as an opportunity to **make screenshots of the feature/fix and include it in the PR description**. - [X] **Agentic AI Code:** Confirm this Pull Request is **not written by any AI Agent** or has at least **gone through additional human review AND manual testing**. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR. - [X] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [X] **Title Prefix:** To clearly categorize this pull request, prefix the pull request title using one of the following: - **feat**: Introduces a new feature or enhancement to the codebase # Changelog Entry ### Description Related: #18715 Related: #19156 Related: #19277 This PR introduces a two-stage document chunking architecture that enables optional markdown header preprocessing before the standard character/token splitting. When enabled, documents are first split by markdown headers (H1-H6), then small chunks are intelligently merged to meet a configurable minimum size threshold, and finally the standard text splitter is applied. **This significantly improves chunk quality and reduces embedding costs for markdown-heavy documents.** ### **Not only does this REDUCE embedding costs and REDUCE storage needed for vectors in the database, but also SPEEDS UP the embedding and document processing process and IMPROVES the RAG performance and quality by a lot!** **Motivation: Documents with many markdown headers often produce excessively small, low-quality chunks that hurt retrieval performance and waste embedding API calls. This feature allows users to leverage document structure while maintaining semantic coherence.** ### Added - ENABLE_MARKDOWN_HEADER_SPLITTING config option to enable two-stage chunking - CHUNK_MIN_SIZE config option for minimum chunk size (interpreted as characters or tokens based on TEXT_SPLITTER setting) - Admin UI toggle for "Markdown Header Splitting" with tooltip explanation - Admin UI input for "Min Chunk Size" (conditionally displayed when markdown splitting is enabled) - Two-stage splitting architecture - Stage 1: Optional markdown header preprocessing with minimum chunk merging - Stage 2: Character or token splitting (existing behavior) - Heading metadata preservation in chunk metadata (headings field) - Source/file boundary protection to prevent merging chunks from different documents ### Changed 1. Refactored text splitting logic to support two-stage architecture 2. Updated RAG config API endpoints to include new configuration options 3. Removed standalone markdown_header option from TEXT_SPLITTER dropdown (replaced by dedicated toggle) ### Removed - markdown_header option from TEXT_SPLITTER dropdown (functionality replaced by ENABLE_MARKDOWN_HEADER_SPLITTING toggle with better UX) ### Breaking Changes - The TEXT_SPLITTER=markdown_header option is no longer supported. Users should enable the new ENABLE_MARKDOWN_HEADER_SPLITTING toggle instead, which provides better control and combines with character/token splitting. --- ### Additional Information **How it works:** - When ENABLE_MARKDOWN_HEADER_SPLITTING is enabled, documents are first split by markdown headers (H1-H6) - If CHUNK_MIN_SIZE > 0, small chunks are merged with subsequent chunks until they meet the minimum size threshold (respecting max chunk size and document boundaries) - The merged chunks then go through the standard character or token splitter **Edge cases handled:** - Chunks from different source files/URLs are never merged together - Missing metadata is treated conservatively (no merging if source/file_id is missing) - "Best effort" merging: if a chunk can't reach minimum size without exceeding maximum, it's kept as-is **Environment variables:** - ENABLE_MARKDOWN_HEADER_SPLITTING (default: false) - CHUNK_MIN_SIZE (default: 0 - disabled) ### Screenshots or Videos <img width="1478" height="267" alt="image" src="https://github.com/user-attachments/assets/22b4bec8-8e07-40a1-a091-349cdac65a38" /> <img width="1475" height="257" alt="image" src="https://github.com/user-attachments/assets/2bda170b-6c47-4282-b252-5ea9d16ee914" /> <img width="223" height="154" alt="image" src="https://github.com/user-attachments/assets/c28662cf-de5a-4f46-b293-306eab7c5683" /> ## Real Test 1 - Testing with a web document: When adding [this link](https://raw.githubusercontent.com/open-webui/docs/refs/heads/main/docs/getting-started/env-configuration.mdx) from the docs with MIN Chunk size set to 0 (default) and MD Header based splitting on, it creates 588+ chunks <img width="1244" height="114" alt="image" src="https://github.com/user-attachments/assets/ad08c219-c46e-404f-9908-353eb9ae021c" /> When you set the min chunk size to 1000 tokens, the [same link](https://raw.githubusercontent.com/open-webui/docs/refs/heads/main/docs/getting-started/env-configuration.mdx) creates only 44 chunks! 93% improvement, saves cost, storage and improves RAG performance in markdown heavy documents and web pages <img width="1267" height="88" alt="image" src="https://github.com/user-attachments/assets/728bbd91-21e9-4d14-87d0-4362121378fa" /> (logging visible in screenshots is removed in final PR) ## Real Test 2 - Testing with an uploaded file Min Chunk Size set to Zero: <img width="1122" height="148" alt="image" src="https://github.com/user-attachments/assets/5dfd00bf-0a41-4fd7-8f31-6ed58de38514" /> Min Chunk Size set to 1000 token: <img width="1422" height="172" alt="image" src="https://github.com/user-attachments/assets/7467bf0c-44c3-4442-8d23-44d85a82e181" /> ## Test 3 Tested in knowledge base - also works there as intended. ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. > [!NOTE] > Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-20 05:55:26 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#25405