[PR #24139] feat: add HuggingFace-tokenizer-based token text splitter for RAG chunking #66374

New Issue

GiteaMirror · 2026-05-06T12:42:59-05:00

GiteaMirror commented

2026-05-06 12:42:59 -05:00

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/24139
Author: @kela4
Created: 4/25/2026
Status: 🔄 Open

Base: dev ← Head: feat/token_transformers-text-splitter

📝 Commits (3)

25a0475 feat: add token_transformers text splitter using HuggingFace tokenizer
2b5d6a7 chore: format
abe6a3a chore: i18n

📊 Changes

66 files changed (+615 additions, -16 deletions)

View changed files

📝 backend/open_webui/config.py (+15 -0)
📝 backend/open_webui/main.py (+15 -0)
📝 backend/open_webui/routers/retrieval.py (+141 -0)
📝 src/lib/apis/retrieval/index.ts (+32 -0)
📝 src/lib/components/admin/Settings/Documents.svelte (+107 -16)
📝 src/lib/i18n/locales/ar-BH/translation.json (+5 -0)
📝 src/lib/i18n/locales/ar/translation.json (+5 -0)
📝 src/lib/i18n/locales/az-AZ/translation.json (+5 -0)
📝 src/lib/i18n/locales/bg-BG/translation.json (+5 -0)
📝 src/lib/i18n/locales/bn-BD/translation.json (+5 -0)
📝 src/lib/i18n/locales/bo-TB/translation.json (+5 -0)
📝 src/lib/i18n/locales/bs-BA/translation.json (+5 -0)
📝 src/lib/i18n/locales/ca-ES/translation.json (+5 -0)
📝 src/lib/i18n/locales/ceb-PH/translation.json (+5 -0)
📝 src/lib/i18n/locales/cs-CZ/translation.json (+5 -0)
📝 src/lib/i18n/locales/da-DK/translation.json (+5 -0)
📝 src/lib/i18n/locales/de-DE/translation.json (+5 -0)
📝 src/lib/i18n/locales/dg-DG/translation.json (+5 -0)
📝 src/lib/i18n/locales/el-GR/translation.json (+5 -0)
📝 src/lib/i18n/locales/en-GB/translation.json (+5 -0)

...and 46 more files

📄 Description

Pull Request Checklist

Discussion: #21263

Target branch: Verify that the pull request targets the dev branch. PRs targeting main will be immediately closed.
Description: Provide a concise description of the changes made in this pull request down below.
Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
Documentation: Add docs in Open WebUI Docs Repository. Document user-facing behavior, environment variables, public APIs/interfaces, or deployment steps. PR in Documentation-Repository: https://github.com/open-webui/docs/pull/1219 (pending merge)
Dependencies: Are there any new or upgraded dependencies? If so, explain why, update the changelog/docs, and include any compatibility notes. Actually run the code/function that uses updated library to ensure it doesn't crash.
Testing: Perform manual tests to verify the implemented fix/feature works as intended AND does not break any other functionality. Include reproducible steps to demonstrate the issue before the fix. Test edge cases (URL encoding, HTML entities, types). Take this as an opportunity to make screenshots of the feature/fix and include them in the PR description.
Agentic AI Code: Confirm this Pull Request is not written by any AI Agent or has at least gone through additional human review AND manual testing. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR.
Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
Design & Architecture: Prefer smart defaults over adding new settings; use local state for ephemeral UI logic. Open a Discussion for major architectural or UX changes.
Git Hygiene: Keep PRs atomic (one logical change). Clean up commits and rebase on dev to ensure no unrelated commits (e.g. from main) are included. Push updates to the existing PR branch instead of closing and reopening.
Title Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:
- BREAKING CHANGE: Significant changes that may affect compatibility
- build: Changes that affect the build system or external dependencies
- ci: Changes to our continuous integration processes or workflows
- chore: Refactor, cleanup, or other non-functional code changes
- docs: Documentation update or addition
- feat: Introduces a new feature or enhancement to the codebase
- fix: Bug fix or error correction
- i18n: Internationalization or localization changes
- perf: Performance improvement
- refactor: Code restructuring for better maintainability, readability, or scalability
- style: Changes that do not affect the meaning of the code (white space, formatting, missing semi-colons, etc.)
- test: Adding missing tests or correcting existing tests
- WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

Description

This pull request adds a new text splitter option token_transformers that uses tokenizers which are available on HuggingFace or could be imported using AutoTokenizer from transformers library for accurate token-based text chunking, as an alternative to the existing tiktoken-based token splitter. This primarily supports the usage of non-OpenAI embedding models.

Problem: The existing token-based text splitter Token (Tiktoken) uses OpenAI's encoding, which produces inaccurate token counts for non-OpenAI embedding models. This leads to chunks that are either too large (exceeding the embedding model's max sequence length, causing truncated embeddings) or too small (wasting context window capacity).

Related Discussions: There are several discussions describing failures caused by a mismatch in chunk size calculation of tiktoken-based token splitter when using non-OpenAI embedding models. If too large chunks are sent to an embedding model via external API, it leads to errors like Index out of range or like Context of embedding model exceeds in document embedding process. #21263 #17272 #21985

Solution: The token_transformers splitter uses the exact same tokenizer as the chosen embedding model, ensuring token counts match what the embedding model will actually process.

The implementation and handling of the tokenizer model follows the same implementation pattern as the local embedding model.

Added

New Token (Transformers) text splitter option in RAG settings that uses a HuggingFace tokenizer to measure chunk sizes, ensuring token counts accurately reflect what the embedding model will process.
RAG_TOKENIZER_MODEL config/env var to specify the HuggingFace model (or local path to model snapshot) used as the tokenizer.
RAG_TOKENIZER_MODEL_AUTO_UPDATE config/env var to control whether the tokenizer is automatically downloaded on startup (default: true; disabled when OFFLINE_MODE is active).
RAG_TOKENIZER_MODEL_TRUST_REMOTE_CODE config/env var to allow remote code execution when loading the tokenizer (default: false).
POST /retrieval/tokenizer/update API endpoint for downloading/refreshing the tokenizer model from the Admin UI.
Name input field and download button for the RAG Tokenizer in the UI Admin Settings → Documents

Changed

RAG configuration form and API response now include the RAG_TOKENIZER_MODEL field.

Fixed

Embedding errors (e.g. Index out of range, Context of embedding model exceeds) caused by the use of tiktoken-based token chunking producing inaccurate token counts for non-OpenAI embedding models (such as BGE, GTE, Qwen). With the Token (Transformers) splitter, token-based chunk sizes are measured with the same tokenizer the embedding model uses, so chunks never exceed the model's context limit.

Additional Information

Note on why RecursiveCharacterTextSplitter is used for token-based splitting: The name is potentially misleading — RecursiveCharacterTextSplitter refers to the splitting strategy, not the unit of measurement. The splitter works by recursively trying a list of separator characters (\n\n, \n, , "") to find natural break points in text. What it measures — characters or tokens — is determined entirely by the length_function parameter. By passing token_length (which calls tokenizer.encode()) as the length_function, the splitter still breaks text at natural boundaries (paragraphs, sentences, words) but uses token counts instead of character counts to decide when a chunk is full. This is the canonical LangChain approach for boundary-aware token splitting. The alternative, TokenTextSplitter, encodes the full text into tokens first and slices at exact token boundaries — which means it can split mid-word or mid-sentence, degrading chunk coherence. Using RecursiveCharacterTextSplitter with a tokenizer length_function therefore combines accurate token-based size limits with clean, semantically coherent chunk boundaries.

Note on why not using SentenceTransformersTokenTextSplitter: In discussion #21263 I mentioned SentenceTransformersTokenTextSplitter as a way to do token-based splitting using the embedding model's tokenizer. During development I recognized that it has two weaknesses compared to the new approach: it downloads the whole embedding model every time — even when the embedding model is served via an external API, where no local weights are needed — and it returns chunk text in a normalized form (lowercase, whitespace/newlines stripped), causing the returned chunk text to differ from the original document text.

Note on chunk sizes: The configured CHUNK_SIZE is a maximum, not an exact target. RecursiveCharacterTextSplitter splits at natural boundaries (paragraphs, sentences, whitespace) and closes a chunk at the last boundary before the token limit would be exceeded. A chunk of e.g. 243 tokens when CHUNK_SIZE=250 is expected and correct — the next natural unit of text would have pushed it over 250. Splitting mid-word or mid-sentence to hit exactly 250 tokens would degrade retrieval quality.

Tokenizer priority: RAG_TOKENIZER_MODEL takes priority over ef.tokenizer when both are available. Set RAG_TOKENIZER_MODEL explicitly to override the tokenizer used by the local embedding model (e.g. to use a different model's tokenizer for chunking). Leave RAG_TOKENIZER_MODEL empty to fall back to ef.tokenizer when a local embedding model is loaded.

Tokenizer loading & resolution flow:

flowchart TD
    START["Server startup"] --> S{"RAG_TOKENIZER_MODEL configured?"}
    S -- Yes --> S2["get_rag_tokenizer() → app.state.rag_tokenizer"]
    S -- No --> S3["app.state.rag_tokenizer = None"]

    UI["Admin updates model via UI\n(/retrieval/tokenizer/update)"] --> U1["get_rag_tokenizer(force_update=True)\nforce-downloads from HuggingFace"]
    U1 --> U2["app.state.rag_tokenizer = new tokenizer\napp.state.config.RAG_TOKENIZER_MODEL = new name"]

    CFG["Admin changes model name\nin RAG config (no download)"] --> C1["app.state.rag_tokenizer = None\n(cache invalidated)"]
    C1 --> C2["Lazily reloaded on next ingestion\nvia get_rag_tokenizer()"]

    A["token_transformers splitter selected\nfor ingestion"] --> B{"RAG_TOKENIZER_MODEL configured?"}
    B -- Yes --> F{"app.state.rag_tokenizer cached?"}
    F -- Yes --> G["Use cached tokenizer"]
    F -- No --> H["get_rag_tokenizer() → cache & use"]
    H --> G
    B -- No --> C{"Local embedding model (ef)\nloaded with .tokenizer?"}
    C -- Yes --> D["Use ef.tokenizer"]
    C -- No --> E["Return None"]
    E --> I{"Call site?"}
    I -- "save_docs_to_vector_db" --> J["Raise ValueError: requires tokenizer"]
    I -- "merge_docs_to_target_size" --> K["Log warning, fall back to character counting"]

Error handling: save_docs_to_vector_db and merge_docs_to_target_size treat a missing tokenizer differently by design. In save_docs_to_vector_db (the ingestion entry point) the tokenizer is required and a missing one raises a ValueError immediately — fail fast before any work is done. In merge_docs_to_target_size (which runs later, after chunks already exist) the tokenizer should always be loaded by this point in the normal flow, so a missing tokenizer there indicates an unexpected internal state; it logs a warning and falls back to character counting rather than aborting mid-merge. This is a defensive fallback, not a silent degradation of the happy path.

Screenshots or Videos

Documentation of feature testing:

Current state

Setup: Using the embedding model via an external API e.g. bge-large with Ollama. bge-large has max-content-length of 512 tokens - subtracting special tokens, net max-content-length to set as chunk-size is 510 tokens. Disabling Markdown Header Text Splitter for having best demo environment.

First, set the Text Splitter to Token (Tiktoken):

When uploading files to a chat or knowledge base - embedding errors appear. Two of the four uploaded files failed to embed and are missing.

Also OpenWebUI Logs are presenting some errors.

With the new Text Splitter `Token (Transformers)`

When choosing Token (Transformers) - a new input field appears to specify the Tokenizer Model. If an external API is selected as Embedding Model Engine and Token (Transformers) is selected, the input field becomes required. If using local embedding model, the field is optional.

The upload of the same 4 files succeeds now:

Chunk lengths are now as close to the configured maximum as natural text boundaries allow.
For this demo, a separate log message was added to show the length of the chunks. Chunks stay below the 510-token maximum, with cuts never occurring within a logical unit.

The feature was also tested in an offline environment using a previously downloaded local model snapshot set as the tokenizer path.

I manually tested relevant combinations of following configuration options (while using ENABLE_PERSISTENT_CONFIG=False and also via configuring the configurable values in UI while using ENABLE_PERSISTENT_CONFIG=True). I uploaded files and took a look at the generated chunks.

# export RAG_TOKENIZER_MODEL="BAAI/bge-large-en"
export RAG_TOKENIZER_MODEL="<path-to-cache-dir>/.cache/huggingface/hub/models--BAAI--bge-large-en/snapshots/abe7d9d814b775ca171121fb03f394dc42974275"
# export RAG_TOKENIZER_MODEL_AUTO_UPDATE=True
export RAG_TOKENIZER_MODEL_AUTO_UPDATE=False
export RAG_TOKENIZER_MODEL_TRUST_REMOTE_CODE=True
# export RAG_TOKENIZER_MODEL_TRUST_REMOTE_CODE=False
export PDF_LOADER_MODE="single"
export CHUNK_SIZE=510
# export CHUNK_SIZE=1000 # tested expected warning messages if CHUNK_SIZE exceeds max-model-length while using embedding model via API
export CHUNK_MIN_SIZE_TARGET=0
# export CHUNK_MIN_SIZE_TARGET=100
export CHUNK_OVERLAP=0
# export CHUNK_OVERLAP=100
export ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER=True
# export ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER=False
export RAG_TEXT_SPLITTER="token_transformers"
export RAG_EMBEDDING_MODEL_AUTO_UPDATE=True
export RAG_EMBEDDING_MODEL_TRUST_REMOTE_CODE=True
# export RAG_EMBEDDING_ENGINE=""
export RAG_EMBEDDING_ENGINE="ollama"
# export RAG_EMBEDDING_MODEL="sentence-transformers/all-MiniLM-L6-v2"
export RAG_EMBEDDING_MODEL="bge-large:latest"
export OFFLINE_MODE=True
# export OFFLINE_MODE=False

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/24139 **Author:** [@kela4](https://github.com/kela4) **Created:** 4/25/2026 **Status:** 🔄 Open **Base:** `dev` ← **Head:** `feat/token_transformers-text-splitter` --- ### 📝 Commits (3) - [`25a0475`](https://github.com/open-webui/open-webui/commit/25a0475598aacfd95d929b82fc15ed4573ad266c) feat: add token_transformers text splitter using HuggingFace tokenizer - [`2b5d6a7`](https://github.com/open-webui/open-webui/commit/2b5d6a7a1589d41b0bf5c0a06484dcdde6cde732) chore: format - [`abe6a3a`](https://github.com/open-webui/open-webui/commit/abe6a3ac4c7d537b36211770fb1682d03b0e0f01) chore: i18n ### 📊 Changes **66 files changed** (+615 additions, -16 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+15 -0) 📝 `backend/open_webui/main.py` (+15 -0) 📝 `backend/open_webui/routers/retrieval.py` (+141 -0) 📝 `src/lib/apis/retrieval/index.ts` (+32 -0) 📝 `src/lib/components/admin/Settings/Documents.svelte` (+107 -16) 📝 `src/lib/i18n/locales/ar-BH/translation.json` (+5 -0) 📝 `src/lib/i18n/locales/ar/translation.json` (+5 -0) 📝 `src/lib/i18n/locales/az-AZ/translation.json` (+5 -0) 📝 `src/lib/i18n/locales/bg-BG/translation.json` (+5 -0) 📝 `src/lib/i18n/locales/bn-BD/translation.json` (+5 -0) 📝 `src/lib/i18n/locales/bo-TB/translation.json` (+5 -0) 📝 `src/lib/i18n/locales/bs-BA/translation.json` (+5 -0) 📝 `src/lib/i18n/locales/ca-ES/translation.json` (+5 -0) 📝 `src/lib/i18n/locales/ceb-PH/translation.json` (+5 -0) 📝 `src/lib/i18n/locales/cs-CZ/translation.json` (+5 -0) 📝 `src/lib/i18n/locales/da-DK/translation.json` (+5 -0) 📝 `src/lib/i18n/locales/de-DE/translation.json` (+5 -0) 📝 `src/lib/i18n/locales/dg-DG/translation.json` (+5 -0) 📝 `src/lib/i18n/locales/el-GR/translation.json` (+5 -0) 📝 `src/lib/i18n/locales/en-GB/translation.json` (+5 -0) _...and 46 more files_ </details> ### 📄 Description  # Pull Request Checklist Discussion: #21263 - [x] **Target branch:** Verify that the pull request targets the `dev` branch. **PRs targeting `main` will be immediately closed.** - [x] **Description:** Provide a concise description of the changes made in this pull request down below. - [x] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [x] **Documentation:** Add docs in [Open WebUI Docs Repository](https://github.com/open-webui/docs). Document user-facing behavior, environment variables, public APIs/interfaces, or deployment steps. PR in Documentation-Repository: https://github.com/open-webui/docs/pull/1219 (pending merge) - [x] **Dependencies:** Are there any new or upgraded dependencies? If so, explain why, update the changelog/docs, and include any compatibility notes. Actually run the code/function that uses updated library to ensure it doesn't crash. - [x] **Testing:** Perform manual tests to **verify the implemented fix/feature works as intended AND does not break any other functionality**. Include reproducible steps to demonstrate the issue before the fix. Test edge cases (URL encoding, HTML entities, types). Take this as an opportunity to **make screenshots of the feature/fix and include them in the PR description**. - [x] **Agentic AI Code:** Confirm this Pull Request is **not written by any AI Agent** or has at least **gone through additional human review AND manual testing**. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR. - [x] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [x] **Design & Architecture:** Prefer smart defaults over adding new settings; use local state for ephemeral UI logic. Open a Discussion for major architectural or UX changes. - [x] **Git Hygiene:** Keep PRs atomic (one logical change). Clean up commits and rebase on `dev` to ensure no unrelated commits (e.g. from `main`) are included. Push updates to the existing PR branch instead of closing and reopening. - [x] **Title Prefix:** To clearly categorize this pull request, prefix the pull request title using one of the following: - **BREAKING CHANGE**: Significant changes that may affect compatibility - **build**: Changes that affect the build system or external dependencies - **ci**: Changes to our continuous integration processes or workflows - **chore**: Refactor, cleanup, or other non-functional code changes - **docs**: Documentation update or addition - **feat**: Introduces a new feature or enhancement to the codebase - **fix**: Bug fix or error correction - **i18n**: Internationalization or localization changes - **perf**: Performance improvement - **refactor**: Code restructuring for better maintainability, readability, or scalability - **style**: Changes that do not affect the meaning of the code (white space, formatting, missing semi-colons, etc.) - **test**: Adding missing tests or correcting existing tests - **WIP**: Work in progress, a temporary label for incomplete or ongoing work # Changelog Entry ### Description This pull request adds a new text splitter option `token_transformers` that uses tokenizers which are available on HuggingFace or could be imported using AutoTokenizer from transformers library for accurate token-based text chunking, as an alternative to the existing tiktoken-based token splitter. This primarily supports the usage of non-OpenAI embedding models. **Problem:** The existing token-based text splitter `Token (Tiktoken)` uses OpenAI's encoding, which produces inaccurate token counts for non-OpenAI embedding models. This leads to chunks that are either too large (exceeding the embedding model's max sequence length, causing truncated embeddings) or too small (wasting context window capacity). **Related Discussions:** There are several discussions describing failures caused by a mismatch in chunk size calculation of tiktoken-based token splitter when using non-OpenAI embedding models. If too large chunks are sent to an embedding model via external API, it leads to errors like `Index out of range` or like `Context of embedding model exceeds` in document embedding process. #21263 #17272 #21985 **Solution:** The `token_transformers` splitter uses the exact same tokenizer as the chosen embedding model, ensuring token counts match what the embedding model will actually process. The implementation and handling of the tokenizer model follows the same implementation pattern as the local embedding model. ### Added - New `Token (Transformers)` text splitter option in RAG settings that uses a HuggingFace tokenizer to measure chunk sizes, ensuring token counts accurately reflect what the embedding model will process. - `RAG_TOKENIZER_MODEL` config/env var to specify the HuggingFace model (or local path to model snapshot) used as the tokenizer. - `RAG_TOKENIZER_MODEL_AUTO_UPDATE` config/env var to control whether the tokenizer is automatically downloaded on startup (default: `true`; disabled when `OFFLINE_MODE` is active). - `RAG_TOKENIZER_MODEL_TRUST_REMOTE_CODE` config/env var to allow remote code execution when loading the tokenizer (default: `false`). - `POST /retrieval/tokenizer/update` API endpoint for downloading/refreshing the tokenizer model from the Admin UI. - Name input field and download button for the RAG Tokenizer in the UI **Admin Settings → Documents** ### Changed - RAG configuration form and API response now include the `RAG_TOKENIZER_MODEL` field. ### Fixed - Embedding errors (e.g. `Index out of range`, `Context of embedding model exceeds`) caused by the use of tiktoken-based token chunking producing inaccurate token counts for non-OpenAI embedding models (such as BGE, GTE, Qwen). With the `Token (Transformers)` splitter, token-based chunk sizes are measured with the same tokenizer the embedding model uses, so chunks never exceed the model's context limit. --- ### Additional Information **Note on why `RecursiveCharacterTextSplitter` is used for token-based splitting:** The name is potentially misleading — `RecursiveCharacterTextSplitter` refers to the *splitting strategy*, not the *unit of measurement*. The splitter works by recursively trying a list of separator characters (`\n\n`, `\n`, ` `, `""`) to find natural break points in text. What it measures — characters or tokens — is determined entirely by the `length_function` parameter. By passing `token_length` (which calls `tokenizer.encode()`) as the `length_function`, the splitter still breaks text at natural boundaries (paragraphs, sentences, words) but uses token counts instead of character counts to decide when a chunk is full. This is the canonical LangChain approach for boundary-aware token splitting. The alternative, `TokenTextSplitter`, encodes the full text into tokens first and slices at exact token boundaries — which means it can split mid-word or mid-sentence, degrading chunk coherence. Using `RecursiveCharacterTextSplitter` with a tokenizer `length_function` therefore combines accurate token-based size limits with clean, semantically coherent chunk boundaries. **Note on why not using `SentenceTransformersTokenTextSplitter`:** In discussion #21263 I mentioned SentenceTransformersTokenTextSplitter as a way to do token-based splitting using the embedding model's tokenizer. During development I recognized that it has two weaknesses compared to the new approach: it downloads the whole embedding model every time — even when the embedding model is served via an external API, where no local weights are needed — and it returns chunk text in a normalized form (lowercase, whitespace/newlines stripped), causing the returned chunk text to differ from the original document text. **Note on chunk sizes:** The configured `CHUNK_SIZE` is a **maximum**, not an exact target. `RecursiveCharacterTextSplitter` splits at natural boundaries (paragraphs, sentences, whitespace) and closes a chunk at the last boundary before the token limit would be exceeded. A chunk of e.g. 243 tokens when `CHUNK_SIZE=250` is expected and correct — the next natural unit of text would have pushed it over 250. Splitting mid-word or mid-sentence to hit exactly 250 tokens would degrade retrieval quality. **Tokenizer priority:** `RAG_TOKENIZER_MODEL` takes priority over `ef.tokenizer` when both are available. Set `RAG_TOKENIZER_MODEL` explicitly to override the tokenizer used by the local embedding model (e.g. to use a different model's tokenizer for chunking). Leave `RAG_TOKENIZER_MODEL` empty to fall back to `ef.tokenizer` when a local embedding model is loaded. **Tokenizer loading & resolution flow:** ```mermaid flowchart TD START["Server startup"] --> S{"RAG_TOKENIZER_MODEL configured?"} S -- Yes --> S2["get_rag_tokenizer() → app.state.rag_tokenizer"] S -- No --> S3["app.state.rag_tokenizer = None"] UI["Admin updates model via UI\n(/retrieval/tokenizer/update)"] --> U1["get_rag_tokenizer(force_update=True)\nforce-downloads from HuggingFace"] U1 --> U2["app.state.rag_tokenizer = new tokenizer\napp.state.config.RAG_TOKENIZER_MODEL = new name"] CFG["Admin changes model name\nin RAG config (no download)"] --> C1["app.state.rag_tokenizer = None\n(cache invalidated)"] C1 --> C2["Lazily reloaded on next ingestion\nvia get_rag_tokenizer()"] A["token_transformers splitter selected\nfor ingestion"] --> B{"RAG_TOKENIZER_MODEL configured?"} B -- Yes --> F{"app.state.rag_tokenizer cached?"} F -- Yes --> G["Use cached tokenizer"] F -- No --> H["get_rag_tokenizer() → cache & use"] H --> G B -- No --> C{"Local embedding model (ef)\nloaded with .tokenizer?"} C -- Yes --> D["Use ef.tokenizer"] C -- No --> E["Return None"] E --> I{"Call site?"} I -- "save_docs_to_vector_db" --> J["Raise ValueError: requires tokenizer"] I -- "merge_docs_to_target_size" --> K["Log warning, fall back to character counting"] ``` **Error handling:** `save_docs_to_vector_db` and `merge_docs_to_target_size` treat a missing tokenizer differently by design. In `save_docs_to_vector_db` (the ingestion entry point) the tokenizer is required and a missing one raises a `ValueError` immediately — fail fast before any work is done. In `merge_docs_to_target_size` (which runs later, after chunks already exist) the tokenizer should always be loaded by this point in the normal flow, so a missing tokenizer there indicates an unexpected internal state; it logs a warning and falls back to character counting rather than aborting mid-merge. This is a defensive fallback, not a silent degradation of the happy path. ### Screenshots or Videos Documentation of feature testing: #### Current state Setup: Using the embedding model via an external API e.g. `bge-large` with Ollama. `bge-large` has max-content-length of 512 tokens - subtracting special tokens, net max-content-length to set as chunk-size is 510 tokens. Disabling Markdown Header Text Splitter for having best demo environment. First, set the Text Splitter to `Token (Tiktoken)`: <img width="593" height="569" alt="using-tiktoken" src="https://github.com/user-attachments/assets/e3eef9e2-7b39-4137-86d7-d447dfd6f9f8" /> When uploading files to a chat or knowledge base - embedding errors appear. Two of the four uploaded files failed to embed and are missing. <img width="582" height="230" alt="using-tiktoken-embedding-errors" src="https://github.com/user-attachments/assets/8704cece-62f8-49ea-9474-fa169dcecffb" /> Also OpenWebUI Logs are presenting some errors. <img width="1033" height="139" alt="using-tiktoken-embedding-errors-log" src="https://github.com/user-attachments/assets/cf3c5ebc-6b5c-49b7-aa0e-dfb8bd51007c" /> #### With the new Text Splitter `Token (Transformers)` When choosing `Token (Transformers)` - a new input field appears to specify the Tokenizer Model. If an external API is selected as Embedding Model Engine and `Token (Transformers)` is selected, the input field becomes required. If using local embedding model, the field is optional. <img width="603" height="629" alt="token-transformers-settings" src="https://github.com/user-attachments/assets/5eabdfd6-6937-4f0c-9ce8-52a1d98265b7" /> The upload of the same 4 files succeeds now: <img width="571" height="338" alt="token-transformers-file-uploads" src="https://github.com/user-attachments/assets/5810b815-6c95-482f-87a9-afb32c330971" /> Chunk lengths are now as close to the configured maximum as natural text boundaries allow. For this demo, a separate log message was added to show the length of the chunks. Chunks stay below the 510-token maximum, with cuts never occurring within a logical unit. <img width="1405" height="141" alt="token-transformers-chunk-sizes" src="https://github.com/user-attachments/assets/c041b918-2318-43ad-a79c-eca863b7b002" /> The feature was also tested in an offline environment using a previously downloaded local model snapshot set as the tokenizer path. I manually tested relevant combinations of following configuration options (while using `ENABLE_PERSISTENT_CONFIG=False` and also via configuring the configurable values in UI while using `ENABLE_PERSISTENT_CONFIG=True`). I uploaded files and took a look at the generated chunks. ``` # export RAG_TOKENIZER_MODEL="BAAI/bge-large-en" export RAG_TOKENIZER_MODEL="<path-to-cache-dir>/.cache/huggingface/hub/models--BAAI--bge-large-en/snapshots/abe7d9d814b775ca171121fb03f394dc42974275" # export RAG_TOKENIZER_MODEL_AUTO_UPDATE=True export RAG_TOKENIZER_MODEL_AUTO_UPDATE=False export RAG_TOKENIZER_MODEL_TRUST_REMOTE_CODE=True # export RAG_TOKENIZER_MODEL_TRUST_REMOTE_CODE=False export PDF_LOADER_MODE="single" export CHUNK_SIZE=510 # export CHUNK_SIZE=1000 # tested expected warning messages if CHUNK_SIZE exceeds max-model-length while using embedding model via API export CHUNK_MIN_SIZE_TARGET=0 # export CHUNK_MIN_SIZE_TARGET=100 export CHUNK_OVERLAP=0 # export CHUNK_OVERLAP=100 export ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER=True # export ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER=False export RAG_TEXT_SPLITTER="token_transformers" export RAG_EMBEDDING_MODEL_AUTO_UPDATE=True export RAG_EMBEDDING_MODEL_TRUST_REMOTE_CODE=True # export RAG_EMBEDDING_ENGINE="" export RAG_EMBEDDING_ENGINE="ollama" # export RAG_EMBEDDING_MODEL="sentence-transformers/all-MiniLM-L6-v2" export RAG_EMBEDDING_MODEL="bge-large:latest" export OFFLINE_MODE=True # export OFFLINE_MODE=False ``` ### Contributor License Agreement  - [x] By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. > [!NOTE] > Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

GiteaMirror added the pull-request label 2026-05-06 12:42:59 -05:00

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/open-webui#66374

[PR #24139] feat: add HuggingFace-tokenizer-based token text splitter for RAG chunking #66374