[PR #20314] [CLOSED] feat: add CHUNK_MERGE_THRESHOLD for merging small markdown header chunks #48611

Closed
opened 2026-04-30 00:37:20 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/20314
Author: @Classic298
Created: 1/1/2026
Status: Closed

Base: devHead: min-chunk-size-merging


📝 Commits (2)

  • 574e62d feat: add CHUNK_MERGE_THRESHOLD config for merging small markdown header chunks
  • 8f39de1 Update retrieval.py

📊 Changes

4 files changed (+129 additions, -0 deletions)

View changed files

📝 backend/open_webui/config.py (+6 -0)
📝 backend/open_webui/main.py (+2 -0)
📝 backend/open_webui/routers/retrieval.py (+94 -0)
📝 src/lib/components/admin/Settings/Documents.svelte (+27 -0)

📄 Description

Description

This PR adds chunk merge threshold functionality for the markdown header text splitter.

When enabled, chunks smaller than the configured threshold are intelligently merged with neighboring chunks. This prevents tiny, low-quality chunks that hurt retrieval performance and waste resources.

Related: #18715, #19156, #19277

Benefits

  • Reduces storage costs - fewer chunks means fewer vectors in the database
  • Speeds up retrieval - smaller index to search through
  • Improves RAG quality - better semantic coherence in chunks
  • Improves chunk quality - no more tiny, meaningless fragments
  • Saves embedding costs - fewer API calls for external embedding services
  • Speeds up embedding process - less data to vectorize
  • Reduces compute resources - especially impactful when embedding locally

Noticeable speedup in document processing observed during testing.

Added

  • CHUNK_MERGE_THRESHOLD configuration option (default: 0 = disabled)
  • Admin UI input for "Chunk Merge Threshold" (conditionally displayed when markdown header splitting is enabled)
  • Merging logic that respects:
    • Maximum chunk size (never exceeds CHUNK_SIZE)
    • Source/file boundaries (never merges chunks from different documents)
    • Token vs character measurement based on TEXT_SPLITTER setting

How It Works

  1. Documents are split by markdown headers (existing upstream behavior)
  2. If CHUNK_MERGE_THRESHOLD > 0, chunks smaller than this threshold are merged with neighboring chunks when possible
  3. Standard character/token splitting is then applied as usual

Environment Variables

  • CHUNK_MERGE_THRESHOLD (default: 0 - disabled)

Changelog

Added

  • CHUNK_MERGE_THRESHOLD config option for merging small chunks from markdown header splits
  • Admin UI input for configuring chunk merge threshold

Testing

Tested with web documents and uploaded files. Setting merge threshold to 1000 reduced chunk count by ~93%, significantly improving embedding efficiency and RAG performance.

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/20314 **Author:** [@Classic298](https://github.com/Classic298) **Created:** 1/1/2026 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `min-chunk-size-merging` --- ### 📝 Commits (2) - [`574e62d`](https://github.com/open-webui/open-webui/commit/574e62d3ed981c5b56408df772b27dc27fbe41a6) feat: add CHUNK_MERGE_THRESHOLD config for merging small markdown header chunks - [`8f39de1`](https://github.com/open-webui/open-webui/commit/8f39de1bd91217e20002632739b6195e7b2be4d0) Update retrieval.py ### 📊 Changes **4 files changed** (+129 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+6 -0) 📝 `backend/open_webui/main.py` (+2 -0) 📝 `backend/open_webui/routers/retrieval.py` (+94 -0) 📝 `src/lib/components/admin/Settings/Documents.svelte` (+27 -0) </details> ### 📄 Description ## Description This PR adds chunk merge threshold functionality for the markdown header text splitter. When enabled, chunks smaller than the configured threshold are intelligently merged with neighboring chunks. This prevents tiny, low-quality chunks that hurt retrieval performance and waste resources. **Related:** #18715, #19156, #19277 ## Benefits - **Reduces storage costs** - fewer chunks means fewer vectors in the database - **Speeds up retrieval** - smaller index to search through - **Improves RAG quality** - better semantic coherence in chunks - **Improves chunk quality** - no more tiny, meaningless fragments - **Saves embedding costs** - fewer API calls for external embedding services - **Speeds up embedding process** - less data to vectorize - **Reduces compute resources** - especially impactful when embedding locally Noticeable speedup in document processing observed during testing. ## Added - CHUNK_MERGE_THRESHOLD configuration option (default: 0 = disabled) - Admin UI input for "Chunk Merge Threshold" (conditionally displayed when markdown header splitting is enabled) - Merging logic that respects: - Maximum chunk size (never exceeds CHUNK_SIZE) - Source/file boundaries (never merges chunks from different documents) - Token vs character measurement based on TEXT_SPLITTER setting ## How It Works 1. Documents are split by markdown headers (existing upstream behavior) 2. If CHUNK_MERGE_THRESHOLD > 0, chunks smaller than this threshold are merged with neighboring chunks when possible 3. Standard character/token splitting is then applied as usual ## Environment Variables - CHUNK_MERGE_THRESHOLD (default: 0 - disabled) ## Changelog ### Added - CHUNK_MERGE_THRESHOLD config option for merging small chunks from markdown header splits - Admin UI input for configuring chunk merge threshold ## Testing Tested with web documents and uploaded files. Setting merge threshold to 1000 reduced chunk count by ~93%, significantly improving embedding efficiency and RAG performance. ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. > [!NOTE] > Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-30 00:37:20 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#48611