[PR #19086] [CLOSED] feat+perf: Parallel markdown header splitting with configurable minimum chunk merging and refactored markdown header splitting for simpler code base #25084

Closed
opened 2026-04-20 05:44:49 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/19086
Author: @Classic298
Created: 11/10/2025
Status: Closed

Base: devHead: markdown-chunking-refac


📝 Commits (10+)

  • 7ee9b00 Implement message cleaning before API call
  • 070a6c6 Filter out empty assistant messages before cleaning
  • c88147f refac+feat+breaking: Make markdown header splitting a configurable preprocessing step (#27)
  • 192d81d Update Chat.svelte
  • 01e0ef5 Update retrieval.py
  • 0e5fa86 Update config.py
  • c2dc863 Update main.py
  • 2d2eb29 Update main.py
  • def350c Update retrieval.py
  • 780f92d Update retrieval.py

📊 Changes

4 files changed (+306 additions, -78 deletions)

View changed files

📝 backend/open_webui/config.py (+17 -0)
📝 backend/open_webui/main.py (+6 -0)
📝 backend/open_webui/routers/retrieval.py (+186 -43)
📝 src/lib/components/admin/Settings/Documents.svelte (+97 -35)

📄 Description

  • Target branch: Verify that the pull request targets the dev branch. Not targeting the dev branch will lead to immediate closure of the PR.
  • Description: Provide a concise description of the changes made in this pull request down below.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: If necessary, update relevant documentation Open WebUI Docs like environment variables, the tutorials, or other documentation sources.
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Perform manual tests to verify the implemented fix/feature works as intended AND does not break any other functionality. Take this as an opportunity to make screenshots of the feature/fix and include it in the PR description.
  • Agentic AI Code: Confirm this Pull Request is not written by any AI Agent or has at least gone through additional human review AND manual testing. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR.
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Title Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • feat: Introduces a new feature or enhancement to the codebase
    • refactor: Code restructuring for better maintainability, readability, or scalability

Changelog Entry

Description

This PR introduces three major improvements to document chunking:

1. Parallel Markdown Header Processing

  • Implemented ThreadPoolExecutor with 8 workers for concurrent document processing
  • Significantly improves performance when processing large batches of markdown files
  • Maintains document order and metadata integrity across parallel operations

2. Intelligent Minimum Chunk Merging

  • Added CHUNK_MIN_SIZE (character-based) and CHUNK_MIN_TOKENS (token-based) configuration options
  • Prevents wasteful super-short chunks that result from markdown header splits
  • Sequential merging algorithm that:
    • Combines chunks below minimum threshold with subsequent chunks
    • Respects maximum chunk size limits (never exceeds CHUNK_SIZE)
    • Preserves document structure and heading metadata
    • Only merges chunks from the same document (never across documents)
  • Example: Prevents "# Header\n\nsome text" from creating tiny, low-quality chunks

3. Configurable Two-Stage Chunking Architecture

  • Added ENABLE_MARKDOWN_HEADER_SPLITTING boolean config flag
  • Refactored retrieval.py to use two-stage splitting:
    • Stage 1: Optional markdown header preprocessing (if enabled) with parallel processing and merging
    • Stage 2: Character or token splitting based on TEXT_SPLITTER config
  • Updated UI to replace markdown_header dropdown option with a checkbox
  • Removed standalone "Markdown (Header)" text splitter option

Benefits

Performance:

  • Up to 8x faster markdown header processing for large document batches
  • Reduced embedding API calls by eliminating tiny chunks

Quality:

  • Prevents low-quality retrievals from excessively short chunks
  • Maintains semantic coherence while respecting token limits
  • Better utilization of embedding model context windows

Flexibility:

  • Works with both character-based and token-based splitting strategies
  • Compatible with all embedding models, including those with strict token limits (e.g., Gemini: 2048 tokens, text-embedding-ada-002: 8191 tokens)
  • "Best effort" merging: tries to meet minimum, accepts below-minimum if necessary to stay under maximum

Configuration

New environment variables:

  • ENABLE_MARKDOWN_HEADER_SPLITTING (default: False)
  • CHUNK_MIN_SIZE (default: 0, characters)
  • CHUNK_MIN_TOKENS (default: 0, tokens)

UI controls added for all three settings in the admin Documents configuration panel.

Related: #18715
Related: #19156

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.
This implementation allows users to enable markdown header splitting as an optional preprocessing step before applying character or token-based chunking. This approach combines the benefits of semantic chunking based on headers with the compatibility advantages of fixed chunk sizes.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/19086 **Author:** [@Classic298](https://github.com/Classic298) **Created:** 11/10/2025 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `markdown-chunking-refac` --- ### 📝 Commits (10+) - [`7ee9b00`](https://github.com/open-webui/open-webui/commit/7ee9b0090af5cc77f2fd6a3bb8f6dbe70bf6f06f) Implement message cleaning before API call - [`070a6c6`](https://github.com/open-webui/open-webui/commit/070a6c631009a3cb6753217fba0c53096a98d523) Filter out empty assistant messages before cleaning - [`c88147f`](https://github.com/open-webui/open-webui/commit/c88147f7b9f7dc49b876e2dc3c34f8a99863dafd) refac+feat+breaking: Make markdown header splitting a configurable preprocessing step (#27) - [`192d81d`](https://github.com/open-webui/open-webui/commit/192d81d7d1b87a78145ddda82e886cf85a46cb5d) Update Chat.svelte - [`01e0ef5`](https://github.com/open-webui/open-webui/commit/01e0ef53570a3c803a8f6752534cdcea9ff852c1) Update retrieval.py - [`0e5fa86`](https://github.com/open-webui/open-webui/commit/0e5fa862198576c0ae3482f2e5432d696cfef1dc) Update config.py - [`c2dc863`](https://github.com/open-webui/open-webui/commit/c2dc86314b8c56fd69789dffda5e374d55ece857) Update main.py - [`2d2eb29`](https://github.com/open-webui/open-webui/commit/2d2eb2976006e8a89fdd860f4044658ec4aab63e) Update main.py - [`def350c`](https://github.com/open-webui/open-webui/commit/def350c9dc3de033659a9a9530e80749ed9546c0) Update retrieval.py - [`780f92d`](https://github.com/open-webui/open-webui/commit/780f92de07275e6f0093513b0897d53cef290095) Update retrieval.py ### 📊 Changes **4 files changed** (+306 additions, -78 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+17 -0) 📝 `backend/open_webui/main.py` (+6 -0) 📝 `backend/open_webui/routers/retrieval.py` (+186 -43) 📝 `src/lib/components/admin/Settings/Documents.svelte` (+97 -35) </details> ### 📄 Description - [X] **Target branch:** Verify that the pull request targets the `dev` branch. **Not targeting the `dev` branch will lead to immediate closure of the PR.** - [X] **Description:** Provide a concise description of the changes made in this pull request down below. - [X] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [X] **Documentation:** If necessary, update relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs) like environment variables, the tutorials, or other documentation sources. - [X] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - [X] **Testing:** Perform manual tests to **verify the implemented fix/feature works as intended AND does not break any other functionality**. Take this as an opportunity to **make screenshots of the feature/fix and include it in the PR description**. - [X] **Agentic AI Code:** Confirm this Pull Request is **not written by any AI Agent** or has at least **gone through additional human review AND manual testing**. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR. - [X] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [X] **Title Prefix:** To clearly categorize this pull request, prefix the pull request title using one of the following: - **BREAKING CHANGE**: Significant changes that may affect compatibility - **feat**: Introduces a new feature or enhancement to the codebase - **refactor**: Code restructuring for better maintainability, readability, or scalability # Changelog Entry ### Description This PR introduces **three major improvements** to document chunking: #### 1. **Parallel Markdown Header Processing** - Implemented `ThreadPoolExecutor` with 8 workers for concurrent document processing - Significantly improves performance when processing large batches of markdown files - Maintains document order and metadata integrity across parallel operations #### 2. **Intelligent Minimum Chunk Merging** - Added `CHUNK_MIN_SIZE` (character-based) and `CHUNK_MIN_TOKENS` (token-based) configuration options - **Prevents wasteful super-short chunks that result from markdown header splits** - Sequential merging algorithm that: - Combines chunks below minimum threshold with subsequent chunks - Respects maximum chunk size limits (never exceeds `CHUNK_SIZE`) - Preserves document structure and heading metadata - Only merges chunks from the same document (never across documents) - Example: Prevents `"# Header\n\nsome text"` from creating tiny, low-quality chunks #### 3. **Configurable Two-Stage Chunking Architecture** - Added `ENABLE_MARKDOWN_HEADER_SPLITTING` boolean config flag - Refactored retrieval.py to use two-stage splitting: - **Stage 1**: Optional markdown header preprocessing (if enabled) with parallel processing and merging - **Stage 2**: Character or token splitting based on `TEXT_SPLITTER` config - Updated UI to replace markdown_header dropdown option with a checkbox - Removed standalone "Markdown (Header)" text splitter option --- ### Benefits **Performance:** - Up to 8x faster markdown header processing for large document batches - Reduced embedding API calls by eliminating tiny chunks **Quality:** - Prevents low-quality retrievals from excessively short chunks - Maintains semantic coherence while respecting token limits - Better utilization of embedding model context windows **Flexibility:** - Works with both character-based and token-based splitting strategies - Compatible with all embedding models, including those with strict token limits (e.g., Gemini: 2048 tokens, text-embedding-ada-002: 8191 tokens) - "Best effort" merging: tries to meet minimum, accepts below-minimum if necessary to stay under maximum --- ### Configuration New environment variables: - `ENABLE_MARKDOWN_HEADER_SPLITTING` (default: `False`) - `CHUNK_MIN_SIZE` (default: `0`, characters) - `CHUNK_MIN_TOKENS` (default: `0`, tokens) UI controls added for all three settings in the admin Documents configuration panel. Related: #18715 Related: #19156 ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. > [!NOTE] > Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in. This implementation allows users to enable markdown header splitting as an optional preprocessing step before applying character or token-based chunking. This approach combines the benefits of semantic chunking based on headers with the compatibility advantages of fixed chunk sizes. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-20 05:44:49 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#25084