mirror of
https://github.com/open-webui/open-webui.git
synced 2026-03-11 00:04:08 -05:00
issue: Architectural Flaw: RAG pipeline corrupts chunk content when adding files to a Knowledge Base #5879
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @christian-hawk on GitHub (Jul 27, 2025).
Check Existing Issues
Installation Method
Docker
Open WebUI Version
0.6.18
Ollama Version (if applicable)
No response
Operating System
Ubuntu 22.04
Browser (if applicable)
No response
Confirmation
README.md.Expected Behavior
When a file is added to a Knowledge Base, its chunks and their full, rich metadata (including the
headingslist fromMarkdownHeaderTextSplitter) should be perfectly copied or cloned from the originalfile-*collection into the Knowledge Base's collection.The vector representation of a given document chunk should be identical across all collections it belongs to, preserving data integrity and search consistency.
page_content, metadata, and vector representation of any given chunk must be identical across all collections it resides in.page_content, vector, metadata) from the sourcefile-*collection to the target KB collection.Actual Behavior
There is a critical design flaw in the RAG pipeline that causes data corruption when a pre-processed file is added to a Knowledge Base. Instead of performing a clean, 1:1 clone of the original high-fidelity vectors, the system re-processes a lower-fidelity version of the file's content.
This flawed second pass results in chunks being stored in the Knowledge Base collection with different
page_contentand boundaries than the original, high-fidelity chunks. This corruption of the corepage_contentcompromises the integrity of the RAG system, with the loss of metadata serving as irrefutable proof of the faulty process.The primary issue is
page_contentcorruption. The text of chunks within the Knowledge Base collection is inconsistent with the original chunks from the sourcefile-*collection. This can be verified by directly inspecting the vector database. Aspage_contentis the core data used for retrieval, this is a critical flaw that leads to inconsistent vector representations and undermines the reliability of the RAG system.As secondary evidence of this flawed re-processing, rich structural metadata captured by the initial
Loaderpass is completely stripped out.Steps to Reproduce
There is a design issue in the pipeline for adding an already-processed file to a Knowledge Base. The current implementation causes data inconsistency and a loss of rich metadata.
TEXT_SPLITTER=markdown_header.txt.test.mdfile to a new Knowledge Base.file-*collection.page_content.headingsmetadata array is complete: i.e.['Chapter 1', 'Section 1.1'].page_contentmay be different. This is the core data corruption.headingsmetadata array is incomplete and corrupted: i.e.['Section 1.1]. This is the proof of the lossy re-processing.Logs & Screenshots
owui250726.log
Additional Information
No response