mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 10:58:17 -05:00
[PR #19086] [CLOSED] feat+perf: Parallel markdown header splitting with configurable minimum chunk merging and refactored markdown header splitting for simpler code base #48132
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/19086
Author: @Classic298
Created: 11/10/2025
Status: ❌ Closed
Base:
dev← Head:markdown-chunking-refac📝 Commits (10+)
7ee9b00Implement message cleaning before API call070a6c6Filter out empty assistant messages before cleaningc88147frefac+feat+breaking: Make markdown header splitting a configurable preprocessing step (#27)192d81dUpdate Chat.svelte01e0ef5Update retrieval.py0e5fa86Update config.pyc2dc863Update main.py2d2eb29Update main.pydef350cUpdate retrieval.py780f92dUpdate retrieval.py📊 Changes
4 files changed (+306 additions, -78 deletions)
View changed files
📝
backend/open_webui/config.py(+17 -0)📝
backend/open_webui/main.py(+6 -0)📝
backend/open_webui/routers/retrieval.py(+186 -43)📝
src/lib/components/admin/Settings/Documents.svelte(+97 -35)📄 Description
devbranch. Not targeting thedevbranch will lead to immediate closure of the PR.Changelog Entry
Description
This PR introduces three major improvements to document chunking:
1. Parallel Markdown Header Processing
ThreadPoolExecutorwith 8 workers for concurrent document processing2. Intelligent Minimum Chunk Merging
CHUNK_MIN_SIZE(character-based) andCHUNK_MIN_TOKENS(token-based) configuration optionsCHUNK_SIZE)"# Header\n\nsome text"from creating tiny, low-quality chunks3. Configurable Two-Stage Chunking Architecture
ENABLE_MARKDOWN_HEADER_SPLITTINGboolean config flagTEXT_SPLITTERconfigBenefits
Performance:
Quality:
Flexibility:
Configuration
New environment variables:
ENABLE_MARKDOWN_HEADER_SPLITTING(default:False)CHUNK_MIN_SIZE(default:0, characters)CHUNK_MIN_TOKENS(default:0, tokens)UI controls added for all three settings in the admin Documents configuration panel.
Related: #18715
Related: #19156
Contributor License Agreement
By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.