mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 10:58:17 -05:00
[PR #24139] feat: add HuggingFace-tokenizer-based token text splitter for RAG chunking #50516
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/24139
Author: @kela4
Created: 4/25/2026
Status: 🔄 Open
Base:
dev← Head:feat/token_transformers-text-splitter📝 Commits (3)
25a0475feat: add token_transformers text splitter using HuggingFace tokenizer2b5d6a7chore: formatabe6a3achore: i18n📊 Changes
66 files changed (+615 additions, -16 deletions)
View changed files
📝
backend/open_webui/config.py(+15 -0)📝
backend/open_webui/main.py(+15 -0)📝
backend/open_webui/routers/retrieval.py(+141 -0)📝
src/lib/apis/retrieval/index.ts(+32 -0)📝
src/lib/components/admin/Settings/Documents.svelte(+107 -16)📝
src/lib/i18n/locales/ar-BH/translation.json(+5 -0)📝
src/lib/i18n/locales/ar/translation.json(+5 -0)📝
src/lib/i18n/locales/az-AZ/translation.json(+5 -0)📝
src/lib/i18n/locales/bg-BG/translation.json(+5 -0)📝
src/lib/i18n/locales/bn-BD/translation.json(+5 -0)📝
src/lib/i18n/locales/bo-TB/translation.json(+5 -0)📝
src/lib/i18n/locales/bs-BA/translation.json(+5 -0)📝
src/lib/i18n/locales/ca-ES/translation.json(+5 -0)📝
src/lib/i18n/locales/ceb-PH/translation.json(+5 -0)📝
src/lib/i18n/locales/cs-CZ/translation.json(+5 -0)📝
src/lib/i18n/locales/da-DK/translation.json(+5 -0)📝
src/lib/i18n/locales/de-DE/translation.json(+5 -0)📝
src/lib/i18n/locales/dg-DG/translation.json(+5 -0)📝
src/lib/i18n/locales/el-GR/translation.json(+5 -0)📝
src/lib/i18n/locales/en-GB/translation.json(+5 -0)...and 46 more files
📄 Description
Pull Request Checklist
Discussion: #21263
devbranch. PRs targetingmainwill be immediately closed.devto ensure no unrelated commits (e.g. frommain) are included. Push updates to the existing PR branch instead of closing and reopening.Changelog Entry
Description
This pull request adds a new text splitter option
token_transformersthat uses tokenizers which are available on HuggingFace or could be imported using AutoTokenizer from transformers library for accurate token-based text chunking, as an alternative to the existing tiktoken-based token splitter. This primarily supports the usage of non-OpenAI embedding models.Problem: The existing token-based text splitter
Token (Tiktoken)uses OpenAI's encoding, which produces inaccurate token counts for non-OpenAI embedding models. This leads to chunks that are either too large (exceeding the embedding model's max sequence length, causing truncated embeddings) or too small (wasting context window capacity).Related Discussions: There are several discussions describing failures caused by a mismatch in chunk size calculation of tiktoken-based token splitter when using non-OpenAI embedding models. If too large chunks are sent to an embedding model via external API, it leads to errors like
Index out of rangeor likeContext of embedding model exceedsin document embedding process. #21263 #17272 #21985Solution: The
token_transformerssplitter uses the exact same tokenizer as the chosen embedding model, ensuring token counts match what the embedding model will actually process.The implementation and handling of the tokenizer model follows the same implementation pattern as the local embedding model.
Added
Token (Transformers)text splitter option in RAG settings that uses a HuggingFace tokenizer to measure chunk sizes, ensuring token counts accurately reflect what the embedding model will process.RAG_TOKENIZER_MODELconfig/env var to specify the HuggingFace model (or local path to model snapshot) used as the tokenizer.RAG_TOKENIZER_MODEL_AUTO_UPDATEconfig/env var to control whether the tokenizer is automatically downloaded on startup (default:true; disabled whenOFFLINE_MODEis active).RAG_TOKENIZER_MODEL_TRUST_REMOTE_CODEconfig/env var to allow remote code execution when loading the tokenizer (default:false).POST /retrieval/tokenizer/updateAPI endpoint for downloading/refreshing the tokenizer model from the Admin UI.Changed
RAG_TOKENIZER_MODELfield.Fixed
Index out of range,Context of embedding model exceeds) caused by the use of tiktoken-based token chunking producing inaccurate token counts for non-OpenAI embedding models (such as BGE, GTE, Qwen). With theToken (Transformers)splitter, token-based chunk sizes are measured with the same tokenizer the embedding model uses, so chunks never exceed the model's context limit.Additional Information
Note on why
RecursiveCharacterTextSplitteris used for token-based splitting: The name is potentially misleading —RecursiveCharacterTextSplitterrefers to the splitting strategy, not the unit of measurement. The splitter works by recursively trying a list of separator characters (\n\n,\n,,"") to find natural break points in text. What it measures — characters or tokens — is determined entirely by thelength_functionparameter. By passingtoken_length(which callstokenizer.encode()) as thelength_function, the splitter still breaks text at natural boundaries (paragraphs, sentences, words) but uses token counts instead of character counts to decide when a chunk is full. This is the canonical LangChain approach for boundary-aware token splitting. The alternative,TokenTextSplitter, encodes the full text into tokens first and slices at exact token boundaries — which means it can split mid-word or mid-sentence, degrading chunk coherence. UsingRecursiveCharacterTextSplitterwith a tokenizerlength_functiontherefore combines accurate token-based size limits with clean, semantically coherent chunk boundaries.Note on why not using
SentenceTransformersTokenTextSplitter: In discussion #21263 I mentioned SentenceTransformersTokenTextSplitter as a way to do token-based splitting using the embedding model's tokenizer. During development I recognized that it has two weaknesses compared to the new approach: it downloads the whole embedding model every time — even when the embedding model is served via an external API, where no local weights are needed — and it returns chunk text in a normalized form (lowercase, whitespace/newlines stripped), causing the returned chunk text to differ from the original document text.Note on chunk sizes: The configured
CHUNK_SIZEis a maximum, not an exact target.RecursiveCharacterTextSplittersplits at natural boundaries (paragraphs, sentences, whitespace) and closes a chunk at the last boundary before the token limit would be exceeded. A chunk of e.g. 243 tokens whenCHUNK_SIZE=250is expected and correct — the next natural unit of text would have pushed it over 250. Splitting mid-word or mid-sentence to hit exactly 250 tokens would degrade retrieval quality.Tokenizer priority:
RAG_TOKENIZER_MODELtakes priority overef.tokenizerwhen both are available. SetRAG_TOKENIZER_MODELexplicitly to override the tokenizer used by the local embedding model (e.g. to use a different model's tokenizer for chunking). LeaveRAG_TOKENIZER_MODELempty to fall back toef.tokenizerwhen a local embedding model is loaded.Tokenizer loading & resolution flow:
Error handling:
save_docs_to_vector_dbandmerge_docs_to_target_sizetreat a missing tokenizer differently by design. Insave_docs_to_vector_db(the ingestion entry point) the tokenizer is required and a missing one raises aValueErrorimmediately — fail fast before any work is done. Inmerge_docs_to_target_size(which runs later, after chunks already exist) the tokenizer should always be loaded by this point in the normal flow, so a missing tokenizer there indicates an unexpected internal state; it logs a warning and falls back to character counting rather than aborting mid-merge. This is a defensive fallback, not a silent degradation of the happy path.Screenshots or Videos
Documentation of feature testing:
Current state
Setup: Using the embedding model via an external API e.g.
bge-largewith Ollama.bge-largehas max-content-length of 512 tokens - subtracting special tokens, net max-content-length to set as chunk-size is 510 tokens. Disabling Markdown Header Text Splitter for having best demo environment.First, set the Text Splitter to
Token (Tiktoken):When uploading files to a chat or knowledge base - embedding errors appear. Two of the four uploaded files failed to embed and are missing.
Also OpenWebUI Logs are presenting some errors.
With the new Text Splitter
Token (Transformers)When choosing
Token (Transformers)- a new input field appears to specify the Tokenizer Model. If an external API is selected as Embedding Model Engine andToken (Transformers)is selected, the input field becomes required. If using local embedding model, the field is optional.The upload of the same 4 files succeeds now:
Chunk lengths are now as close to the configured maximum as natural text boundaries allow.
For this demo, a separate log message was added to show the length of the chunks. Chunks stay below the 510-token maximum, with cuts never occurring within a logical unit.
The feature was also tested in an offline environment using a previously downloaded local model snapshot set as the tokenizer path.
I manually tested relevant combinations of following configuration options (while using
ENABLE_PERSISTENT_CONFIG=Falseand also via configuring the configurable values in UI while usingENABLE_PERSISTENT_CONFIG=True). I uploaded files and took a look at the generated chunks.Contributor License Agreement
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.