mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 10:58:17 -05:00
[PR #20314] [CLOSED] feat: add CHUNK_MERGE_THRESHOLD for merging small markdown header chunks #64419
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/20314
Author: @Classic298
Created: 1/1/2026
Status: ❌ Closed
Base:
dev← Head:min-chunk-size-merging📝 Commits (2)
574e62dfeat: add CHUNK_MERGE_THRESHOLD config for merging small markdown header chunks8f39de1Update retrieval.py📊 Changes
4 files changed (+129 additions, -0 deletions)
View changed files
📝
backend/open_webui/config.py(+6 -0)📝
backend/open_webui/main.py(+2 -0)📝
backend/open_webui/routers/retrieval.py(+94 -0)📝
src/lib/components/admin/Settings/Documents.svelte(+27 -0)📄 Description
Description
This PR adds chunk merge threshold functionality for the markdown header text splitter.
When enabled, chunks smaller than the configured threshold are intelligently merged with neighboring chunks. This prevents tiny, low-quality chunks that hurt retrieval performance and waste resources.
Related: #18715, #19156, #19277
Benefits
Noticeable speedup in document processing observed during testing.
Added
How It Works
Environment Variables
Changelog
Added
Testing
Tested with web documents and uploaded files. Setting merge threshold to 1000 reduced chunk count by ~93%, significantly improving embedding efficiency and RAG performance.
Contributor License Agreement
By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.