[PR #19277] [CLOSED] feat+perf: Parallel markdown header splitting with configurable minimum chunk merging and refactored markdown header splitting for simpler code base #40788

Closed
opened 2026-04-25 13:13:32 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/19277
Author: @Classic298
Created: 11/18/2025
Status: Closed

Base: devHead: markdown-chunking-refacactor


📝 Commits (1)

📊 Changes

4 files changed (+306 additions, -78 deletions)

View changed files

📝 backend/open_webui/config.py (+17 -0)
📝 backend/open_webui/main.py (+6 -0)
📝 backend/open_webui/routers/retrieval.py (+186 -43)
📝 src/lib/components/admin/Settings/Documents.svelte (+97 -35)

📄 Description

  • Target branch: Verify that the pull request targets the dev branch. Not targeting the dev branch will lead to immediate closure of the PR.
  • Description: Provide a concise description of the changes made in this pull request down below.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: If necessary, update relevant documentation Open WebUI Docs like environment variables, the tutorials, or other documentation sources.
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Perform manual tests to verify the implemented fix/feature works as intended AND does not break any other functionality. Take this as an opportunity to make screenshots of the feature/fix and include it in the PR description.
  • Agentic AI Code: Confirm this Pull Request is not written by any AI Agent or has at least gone through additional human review AND manual testing. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR.
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Title Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • feat: Introduces a new feature or enhancement to the codebase
    • refactor: Code restructuring for better maintainability, readability, or scalability

Changelog Entry

Description

This PR introduces three major improvements to document chunking:

1. Parallel Markdown Header Processing

  • Implemented ThreadPoolExecutor with 8 workers for concurrent document processing
  • Significantly improves performance when processing large batches of markdown files
  • Maintains document order and metadata integrity across parallel operations

2. Intelligent Minimum Chunk Merging

  • Added CHUNK_MIN_SIZE (character-based) and CHUNK_MIN_TOKENS (token-based) configuration options
  • Prevents wasteful super-short chunks that result from markdown header splits
  • Sequential merging algorithm that:
    • Combines chunks below minimum threshold with subsequent chunks
    • Respects maximum chunk size limits (never exceeds CHUNK_SIZE)
    • Preserves document structure and heading metadata
    • Only merges chunks from the same document (never across documents)
  • Example: Prevents "# Header\n\nsome text" from creating tiny, low-quality chunks

This prevents multiple things:

  • Have a document with 30 headers in just 3 pages? would result in 30 chunks. But actually the document is just 10000 tokens long. With a minimum token chunk size of 1000 tokens, you can reduce that to 9-10 chunks
  • Semantic meaning of chunks is extremely improved
  • less embedding model requests
  • better performance
  • less storage needed (because less vector data needed to be stored, because less chunks)

3. Configurable Two-Stage Chunking Architecture

  • Added ENABLE_MARKDOWN_HEADER_SPLITTING boolean config flag
  • Refactored retrieval.py to use two-stage splitting:
    • Stage 1: Optional markdown header preprocessing (if enabled) with parallel processing and merging
    • Stage 2: Character or token splitting based on TEXT_SPLITTER config
  • Updated UI to replace markdown_header dropdown option with a checkbox
  • Removed standalone "Markdown (Header)" text splitter option

Benefits

Performance:

  • Up to 8x faster markdown header processing for large document batches
  • Reduced embedding API calls by eliminating tiny chunks

Quality:

  • Prevents low-quality retrievals from excessively short chunks
  • Maintains semantic coherence while respecting token limits
  • Better utilization of embedding model context windows

Flexibility:

  • Works with both character-based and token-based splitting strategies
  • Compatible with all embedding models, including those with strict token limits (e.g., Gemini: 2048 tokens, text-embedding-ada-002: 8191 tokens)
  • "Best effort" merging: tries to meet minimum, accepts below-minimum if necessary to stay under maximum

Configuration

New environment variables:

  • ENABLE_MARKDOWN_HEADER_SPLITTING (default: False)
  • CHUNK_MIN_SIZE (default: 0, characters)
  • CHUNK_MIN_TOKENS (default: 0, tokens)

UI controls added for all three settings in the admin Documents configuration panel.

Related: #18715
Related: #19156

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.
This implementation allows users to enable markdown header splitting as an optional preprocessing step before applying character or token-based chunking. This approach combines the benefits of semantic chunking based on headers with the compatibility advantages of fixed chunk sizes.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/19277 **Author:** [@Classic298](https://github.com/Classic298) **Created:** 11/18/2025 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `markdown-chunking-refacactor` --- ### 📝 Commits (1) - [`266159b`](https://github.com/open-webui/open-webui/commit/266159b69a4e7612a275a0751ad6d1b23932653d) Markdown chunking refac (#73) ### 📊 Changes **4 files changed** (+306 additions, -78 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+17 -0) 📝 `backend/open_webui/main.py` (+6 -0) 📝 `backend/open_webui/routers/retrieval.py` (+186 -43) 📝 `src/lib/components/admin/Settings/Documents.svelte` (+97 -35) </details> ### 📄 Description - [X] **Target branch:** Verify that the pull request targets the `dev` branch. **Not targeting the `dev` branch will lead to immediate closure of the PR.** - [X] **Description:** Provide a concise description of the changes made in this pull request down below. - [X] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [X] **Documentation:** If necessary, update relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs) like environment variables, the tutorials, or other documentation sources. - [X] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - [X] **Testing:** Perform manual tests to **verify the implemented fix/feature works as intended AND does not break any other functionality**. Take this as an opportunity to **make screenshots of the feature/fix and include it in the PR description**. - [X] **Agentic AI Code:** Confirm this Pull Request is **not written by any AI Agent** or has at least **gone through additional human review AND manual testing**. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR. - [X] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [X] **Title Prefix:** To clearly categorize this pull request, prefix the pull request title using one of the following: - **BREAKING CHANGE**: Significant changes that may affect compatibility - **feat**: Introduces a new feature or enhancement to the codebase - **refactor**: Code restructuring for better maintainability, readability, or scalability # Changelog Entry ### Description This PR introduces **three major improvements** to document chunking: #### 1. **Parallel Markdown Header Processing** - Implemented `ThreadPoolExecutor` with 8 workers for concurrent document processing - Significantly improves performance when processing large batches of markdown files - Maintains document order and metadata integrity across parallel operations #### 2. **Intelligent Minimum Chunk Merging** - Added `CHUNK_MIN_SIZE` (character-based) and `CHUNK_MIN_TOKENS` (token-based) configuration options - **Prevents wasteful super-short chunks that result from markdown header splits** - Sequential merging algorithm that: - Combines chunks below minimum threshold with subsequent chunks - Respects maximum chunk size limits (never exceeds `CHUNK_SIZE`) - Preserves document structure and heading metadata - Only merges chunks from the same document (never across documents) - Example: Prevents `"# Header\n\nsome text"` from creating tiny, low-quality chunks This prevents multiple things: - Have a document with 30 headers in just 3 pages? would result in 30 chunks. But actually the document is just 10000 tokens long. With a minimum token chunk size of 1000 tokens, you can reduce that to 9-10 chunks - Semantic meaning of chunks is extremely improved - less embedding model requests - better performance - less storage needed (because less vector data needed to be stored, because less chunks) #### 3. **Configurable Two-Stage Chunking Architecture** - Added `ENABLE_MARKDOWN_HEADER_SPLITTING` boolean config flag - Refactored retrieval.py to use two-stage splitting: - **Stage 1**: Optional markdown header preprocessing (if enabled) with parallel processing and merging - **Stage 2**: Character or token splitting based on `TEXT_SPLITTER` config - Updated UI to replace markdown_header dropdown option with a checkbox - Removed standalone "Markdown (Header)" text splitter option --- ### Benefits **Performance:** - Up to 8x faster markdown header processing for large document batches - Reduced embedding API calls by eliminating tiny chunks **Quality:** - Prevents low-quality retrievals from excessively short chunks - Maintains semantic coherence while respecting token limits - Better utilization of embedding model context windows **Flexibility:** - Works with both character-based and token-based splitting strategies - Compatible with all embedding models, including those with strict token limits (e.g., Gemini: 2048 tokens, text-embedding-ada-002: 8191 tokens) - "Best effort" merging: tries to meet minimum, accepts below-minimum if necessary to stay under maximum --- ### Configuration New environment variables: - `ENABLE_MARKDOWN_HEADER_SPLITTING` (default: `False`) - `CHUNK_MIN_SIZE` (default: `0`, characters) - `CHUNK_MIN_TOKENS` (default: `0`, tokens) UI controls added for all three settings in the admin Documents configuration panel. Related: #18715 Related: #19156 ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. > [!NOTE] > Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in. This implementation allows users to enable markdown header splitting as an optional preprocessing step before applying character or token-based chunking. This approach combines the benefits of semantic chunking based on headers with the compatibility advantages of fixed chunk sizes. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 13:13:32 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#40788