mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 10:58:17 -05:00
[PR #19944] [CLOSED] feat: add two-stage markdown header text splitter with minimum chunk size merging #25405
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/19944
Author: @Classic298
Created: 12/14/2025
Status: ❌ Closed
Base:
dev← Head:md-splitting📝 Commits (10+)
87d07f3initb7ba583Update Documents.svelte38a5bb5Update Documents.sveltead7d9f7Update retrieval.py0b547aaUpdate retrieval.py0c1d56fUpdate retrieval.pyce9fc54Update retrieval.py9b56b30Update retrieval.py13d2060rename5a0522einit📊 Changes
4 files changed (+548 additions, -139 deletions)
View changed files
📝
backend/open_webui/config.py(+114 -7)📝
backend/open_webui/main.py(+93 -11)📝
backend/open_webui/routers/retrieval.py(+271 -100)📝
src/lib/components/admin/Settings/Documents.svelte(+70 -21)📄 Description
devbranch. Not targeting thedevbranch will lead to immediate closure of the PR.Changelog Entry
Description
Related: #18715
Related: #19156
Related: #19277
This PR introduces a two-stage document chunking architecture that enables optional markdown header preprocessing before the standard character/token splitting. When enabled, documents are first split by markdown headers (H1-H6), then small chunks are intelligently merged to meet a configurable minimum size threshold, and finally the standard text splitter is applied. This significantly improves chunk quality and reduces embedding costs for markdown-heavy documents.
Not only does this REDUCE embedding costs and REDUCE storage needed for vectors in the database, but also SPEEDS UP the embedding and document processing process and IMPROVES the RAG performance and quality by a lot!
Motivation: Documents with many markdown headers often produce excessively small, low-quality chunks that hurt retrieval performance and waste embedding API calls. This feature allows users to leverage document structure while maintaining semantic coherence.
Added
Changed
Removed
Breaking Changes
Additional Information
How it works:
Edge cases handled:
Environment variables:
Screenshots or Videos
Real Test 1 - Testing with a web document:
When adding this link from the docs with MIN Chunk size set to 0 (default) and MD Header based splitting on, it creates 588+ chunks
When you set the min chunk size to 1000 tokens, the same link creates only 44 chunks! 93% improvement, saves cost, storage and improves RAG performance in markdown heavy documents and web pages
(logging visible in screenshots is removed in final PR)
Real Test 2 - Testing with an uploaded file
Min Chunk Size set to Zero:
Min Chunk Size set to 1000 token:
Test 3
Tested in knowledge base - also works there as intended.
Contributor License Agreement
By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.