[PR #19086] refactor+feat+breaking: Make markdown header splitting a configurable preprocessing step #11882

Open
opened 2025-11-11 19:59:30 -06:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/19086
Author: @Classic298
Created: 11/10/2025
Status: 🔄 Open

Base: devHead: markdown-chunking-refac


📝 Commits (4)

  • 7ee9b00 Implement message cleaning before API call
  • 070a6c6 Filter out empty assistant messages before cleaning
  • c88147f refac+feat+breaking: Make markdown header splitting a configurable preprocessing step (#27)
  • 192d81d Update Chat.svelte

📊 Changes

5 files changed (+62 additions, -50 deletions)

View changed files

📝 backend/open_webui/config.py (+6 -0)
📝 backend/open_webui/main.py (+2 -0)
📝 backend/open_webui/routers/retrieval.py (+35 -42)
📝 src/lib/components/admin/Settings/Documents.svelte (+14 -1)
📝 src/lib/components/notes/NoteEditor/Chat.svelte (+5 -7)

📄 Description

  • Target branch: Verify that the pull request targets the dev branch. Not targeting the dev branch will lead to immediate closure of the PR.
  • Description: Provide a concise description of the changes made in this pull request down below.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: If necessary, update relevant documentation Open WebUI Docs like environment variables, the tutorials, or other documentation sources.
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Perform manual tests to verify the implemented fix/feature works as intended AND does not break any other functionality. Take this as an opportunity to make screenshots of the feature/fix and include it in the PR description.
  • Agentic AI Code: Confirm this Pull Request is not written by any AI Agent or has at least gone through additional human review AND manual testing. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR.
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Title Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • feat: Introduces a new feature or enhancement to the codebase
    • refactor: Code restructuring for better maintainability, readability, or scalability

Changelog Entry

Description

Key changes:

  • Added ENABLE_MARKDOWN_HEADER_SPLITTING boolean config flag
  • Refactored retrieval.py to use two-stage splitting:
    • Stage 1: Optional markdown header preprocessing (if enabled)
    • Stage 2: Character or token splitting based on TEXT_SPLITTER config
  • Updated UI to replace markdown_header dropdown option with a checkbox
  • Removed standalone "Markdown (Header)" text splitter option

Benefits:

  • Allows semantic chunking while respecting embedding model token limits
  • Prevents errors with embedding models that have small max token sizes
  • Provides flexibility to combine markdown preprocessing with either character or token splitting

Why This Architecture?
The two-stage approach provides several benefits:

  • Semantic chunking: Documents are first split by markdown headers, preserving document structure
  • Token safety: The second stage ensures chunks don't exceed embedding model limits
  • Flexibility: Works with both character-based and NEW: token-based splitting strategies
  • Compatibility: Prevents errors with embedding models that have strict token limits (e.g., Gemini: 2048 tokens, text-embedding-ada-002: 8191 tokens)

Related: #18715

Changed

  • Refactored document splitting logic in retrieval.py from a three-way if/elif/else (character, token, markdown_header) to a two-stage pipeline
  • Markdown header splitting now acts as a preprocessing step rather than a standalone splitter
  • Users can now combine markdown header splitting with either character OR token splitting

Removed

  1. TEXT_SPLITTER == "markdown_header" branch from splitting logic
  2. "Markdown (Header)" option from UI dropdown ()

Breaking Changes

BREAKING CHANGE: The TEXT_SPLITTER config value "markdown_header" is no longer supported. Users who previously selected "Markdown (Header)" from the dropdown will need to:

  • Select either "Character" or "Token" as their text splitter
  • Enable the new "Enable Markdown Header Splitting" checkbox to restore markdown preprocessing

This provides equivalent functionality with better flexibility, but requires manual reconfiguration for existing users using the markdown header option.


Screenshots or Videos

  • [Attach any relevant screenshots or videos demonstrating the changes]

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.
This implementation allows users to enable markdown header splitting as an optional preprocessing step before applying character or token-based chunking. This approach combines the benefits of semantic chunking based on headers with the compatibility advantages of fixed chunk sizes.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/19086 **Author:** [@Classic298](https://github.com/Classic298) **Created:** 11/10/2025 **Status:** 🔄 Open **Base:** `dev` ← **Head:** `markdown-chunking-refac` --- ### 📝 Commits (4) - [`7ee9b00`](https://github.com/open-webui/open-webui/commit/7ee9b0090af5cc77f2fd6a3bb8f6dbe70bf6f06f) Implement message cleaning before API call - [`070a6c6`](https://github.com/open-webui/open-webui/commit/070a6c631009a3cb6753217fba0c53096a98d523) Filter out empty assistant messages before cleaning - [`c88147f`](https://github.com/open-webui/open-webui/commit/c88147f7b9f7dc49b876e2dc3c34f8a99863dafd) refac+feat+breaking: Make markdown header splitting a configurable preprocessing step (#27) - [`192d81d`](https://github.com/open-webui/open-webui/commit/192d81d7d1b87a78145ddda82e886cf85a46cb5d) Update Chat.svelte ### 📊 Changes **5 files changed** (+62 additions, -50 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+6 -0) 📝 `backend/open_webui/main.py` (+2 -0) 📝 `backend/open_webui/routers/retrieval.py` (+35 -42) 📝 `src/lib/components/admin/Settings/Documents.svelte` (+14 -1) 📝 `src/lib/components/notes/NoteEditor/Chat.svelte` (+5 -7) </details> ### 📄 Description - [X] **Target branch:** Verify that the pull request targets the `dev` branch. **Not targeting the `dev` branch will lead to immediate closure of the PR.** - [X] **Description:** Provide a concise description of the changes made in this pull request down below. - [X] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [X] **Documentation:** If necessary, update relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs) like environment variables, the tutorials, or other documentation sources. - [X] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - [X] **Testing:** Perform manual tests to **verify the implemented fix/feature works as intended AND does not break any other functionality**. Take this as an opportunity to **make screenshots of the feature/fix and include it in the PR description**. - [X] **Agentic AI Code:** Confirm this Pull Request is **not written by any AI Agent** or has at least **gone through additional human review AND manual testing**. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR. - [X] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [X] **Title Prefix:** To clearly categorize this pull request, prefix the pull request title using one of the following: - **BREAKING CHANGE**: Significant changes that may affect compatibility - **feat**: Introduces a new feature or enhancement to the codebase - **refactor**: Code restructuring for better maintainability, readability, or scalability # Changelog Entry ### Description Key changes: - Added ENABLE_MARKDOWN_HEADER_SPLITTING boolean config flag - Refactored retrieval.py to use two-stage splitting: - Stage 1: Optional markdown header preprocessing (if enabled) - Stage 2: Character or token splitting based on TEXT_SPLITTER config - Updated UI to replace markdown_header dropdown option with a checkbox - Removed standalone "Markdown (Header)" text splitter option Benefits: - Allows semantic chunking while respecting embedding model token limits - Prevents errors with embedding models that have small max token sizes - Provides flexibility to combine markdown preprocessing with either character or token splitting **Why This Architecture?** The two-stage approach provides several benefits: - Semantic chunking: Documents are first split by markdown headers, preserving document structure - Token safety: The second stage ensures chunks don't exceed embedding model limits - Flexibility: Works with both character-based **and NEW: ✅ token-based splitting strategies** - Compatibility: Prevents errors with embedding models that have strict token limits **(e.g., Gemini: 2048 tokens, text-embedding-ada-002: 8191 tokens)** Related: #18715 ### Changed - Refactored document splitting logic in retrieval.py from a three-way if/elif/else (character, token, markdown_header) to a two-stage pipeline - Markdown header splitting now acts as a preprocessing step rather than a standalone splitter - Users can now combine markdown header splitting with either character OR token splitting ### Removed 1. TEXT_SPLITTER == "markdown_header" branch from splitting logic 2. "Markdown (Header)" option from UI dropdown (<option value="markdown_header">) ### Breaking Changes **BREAKING CHANGE: The TEXT_SPLITTER config value "markdown_header" is no longer supported. Users who previously selected "Markdown (Header)" from the dropdown will need to:** - Select either "Character" or "Token" as their text splitter - Enable the new "Enable Markdown Header Splitting" checkbox to restore markdown preprocessing **This provides equivalent functionality with better flexibility, but requires manual reconfiguration for existing users using the markdown header option.** --- ### Screenshots or Videos - [Attach any relevant screenshots or videos demonstrating the changes] ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. > [!NOTE] > Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in. This implementation allows users to enable markdown header splitting as an optional preprocessing step before applying character or token-based chunking. This approach combines the benefits of semantic chunking based on headers with the compatibility advantages of fixed chunk sizes. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2025-11-11 19:59:30 -06:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#11882