issue: Architectural Flaw: RAG pipeline corrupts chunk content when adding files to a Knowledge Base #5879

New Issue

GiteaMirror · 2025-11-11T16:36:48-06:00

GiteaMirror commented

2025-11-11 16:36:48 -06:00

Originally created by @christian-hawk on GitHub (Jul 27, 2025).

Check Existing Issues

I have searched the existing issues and discussions.
I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

0.6.18

Ollama Version (if applicable)

No response

Operating System

Ubuntu 22.04

Browser (if applicable)

No response

Confirmation

I have read and followed all instructions in README.md.
I am using the latest version of both Open WebUI and Ollama.
I have included the browser console logs.
I have included the Docker container logs.
I have provided every relevant configuration, setting, and environment variable used in my setup.
I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
Start with the initial platform/version/OS and dependencies used,
Specify exact install/launch/configure commands,
List URLs visited, user input (incl. example values/emails/passwords if needed),
Describe all options and toggles enabled or changed,
Include any files or environmental changes,
Identify the expected and actual result at each stage,
Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

When a file is added to a Knowledge Base, its chunks and their full, rich metadata (including the headings list from MarkdownHeaderTextSplitter) should be perfectly copied or cloned from the original file-* collection into the Knowledge Base's collection.

The vector representation of a given document chunk should be identical across all collections it belongs to, preserving data integrity and search consistency.

The page_content, metadata, and vector representation of any given chunk must be identical across all collections it resides in.
Adding a file to a Knowledge Base must be a lossless, 1:1 cloning operation of all data points (ID, page_content, vector, metadata) from the source file-* collection to the target KB collection.

Actual Behavior

There is a critical design flaw in the RAG pipeline that causes data corruption when a pre-processed file is added to a Knowledge Base. Instead of performing a clean, 1:1 clone of the original high-fidelity vectors, the system re-processes a lower-fidelity version of the file's content.

This flawed second pass results in chunks being stored in the Knowledge Base collection with different page_content and boundaries than the original, high-fidelity chunks. This corruption of the core page_content compromises the integrity of the RAG system, with the loss of metadata serving as irrefutable proof of the faulty process.

The primary issue is page_content corruption. The text of chunks within the Knowledge Base collection is inconsistent with the original chunks from the source file-* collection. This can be verified by directly inspecting the vector database. As page_content is the core data used for retrieval, this is a critical flaw that leads to inconsistent vector representations and undermines the reliability of the RAG system.
As secondary evidence of this flawed re-processing, rich structural metadata captured by the initial Loader pass is completely stripped out.

Steps to Reproduce

There is a design issue in the pipeline for adding an already-processed file to a Knowledge Base. The current implementation causes data inconsistency and a loss of rich metadata.

Set the environment variable TEXT_SPLITTER=markdown_header.
Create or use any markdown structured file saves as txt.
Upload this test.md file to a new Knowledge Base.
Using a vector database inspection tool, query and retrieve all chunks from the source file-* collection.
- Observe: Note the page_content.
- Observe: Note the headings metadata array is complete: i.e. ['Chapter 1', 'Section 1.1'].
Query and retrieve all chunks for the same file from the target Knowledge Base collection.
- Compare: Note that the page_content may be different. This is the core data corruption.
- Observe: Note that the headings metadata array is incomplete and corrupted: i.e. ['Section 1.1]. This is the proof of the lossy re-processing.

Logs & Screenshots

owui250726.log

Additional Information

No response

Originally created by @christian-hawk on GitHub (Jul 27, 2025). ### Check Existing Issues - [x] I have searched the existing issues and discussions. - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version 0.6.18 ### Ollama Version (if applicable) _No response_ ### Operating System Ubuntu 22.04 ### Browser (if applicable) _No response_ ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior When a file is added to a Knowledge Base, its chunks and their full, rich metadata (including the `headings` list from `MarkdownHeaderTextSplitter`) should be perfectly copied or cloned from the original `file-*` collection into the Knowledge Base's collection. The vector representation of a given document chunk should be identical across all collections it belongs to, preserving data integrity and search consistency. - The `page_content`, metadata, and vector representation of any given chunk must be **identical** across all collections it resides in. - Adding a file to a Knowledge Base must be a lossless, 1:1 cloning operation of all data points (ID, `page_content`, vector, metadata) from the source `file-*` collection to the target KB collection. ### Actual Behavior There is a critical design flaw in the RAG pipeline that causes data corruption when a pre-processed file is added to a Knowledge Base. Instead of performing a clean, 1:1 clone of the original high-fidelity vectors, the system **re-processes a lower-fidelity version of the file's content**. This flawed second pass results in chunks being stored in the Knowledge Base collection with **different `page_content` and boundaries** than the original, high-fidelity chunks. This corruption of the core `page_content` compromises the integrity of the RAG system, with the loss of metadata serving as irrefutable proof of the faulty process. --- - **The primary issue is `page_content` corruption.** The text of chunks within the Knowledge Base collection is inconsistent with the original chunks from the source `file-*` collection. This can be verified by directly inspecting the vector database. As `page_content` is the core data used for retrieval, this is a critical flaw that leads to inconsistent vector representations and undermines the reliability of the RAG system. - As secondary evidence of this flawed re-processing, rich structural metadata captured by the initial `Loader` pass is completely stripped out. ### Steps to Reproduce There is a design issue in the pipeline for adding an already-processed file to a Knowledge Base. The current implementation causes data inconsistency and a loss of rich metadata. 1. Set the environment variable `TEXT_SPLITTER=markdown_header`. 2. Create or use any markdown structured file saves as `txt`. 3. Upload this `test.md` file to a new Knowledge Base. 4. Using a vector database inspection tool, query and retrieve all chunks from the source `file-*` collection. - **Observe:** Note the `page_content`. - **Observe:** Note the `headings` metadata array is complete: i.e. `['Chapter 1', 'Section 1.1']`. 5. Query and retrieve all chunks for the same file from the target Knowledge Base collection. - **Compare:** Note that the `page_content` may be different. **This is the core data corruption.** - **Observe:** Note that the `headings` metadata array is incomplete and corrupted: i.e. `['Section 1.1]`. **This is the proof of the lossy re-processing.** ### Logs & Screenshots [owui250726.log](https://github.com/user-attachments/files/21450846/owui250726.log) ### Additional Information _No response_

GiteaMirror added the bug label 2025-11-11 16:36:48 -06:00

GiteaMirror closed this issue

2025-11-11 16:36:48 -06:00

GiteaMirror referenced this issue

2025-11-11 18:00:04 -06:00

[PR #5879] [MERGED] feat: show the user the entirety of the usage response #8572

GiteaMirror referenced this issue

2026-04-20 03:42:57 -05:00

[PR #5879] [MERGED] feat: show the user the entirety of the usage response #21776

GiteaMirror referenced this issue

2026-04-25 10:54:25 -05:00

[PR #5879] [MERGED] feat: show the user the entirety of the usage response #37406