issue: Architectural Flaw: RAG pipeline corrupts chunk content when adding files to a Knowledge Base #5879

Closed
opened 2025-11-11 16:36:48 -06:00 by GiteaMirror · 0 comments
Owner

Originally created by @christian-hawk on GitHub (Jul 27, 2025).

Check Existing Issues

  • I have searched the existing issues and discussions.
  • I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

0.6.18

Ollama Version (if applicable)

No response

Operating System

Ubuntu 22.04

Browser (if applicable)

No response

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided every relevant configuration, setting, and environment variable used in my setup.
  • I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
  • I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
  • Start with the initial platform/version/OS and dependencies used,
  • Specify exact install/launch/configure commands,
  • List URLs visited, user input (incl. example values/emails/passwords if needed),
  • Describe all options and toggles enabled or changed,
  • Include any files or environmental changes,
  • Identify the expected and actual result at each stage,
  • Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

When a file is added to a Knowledge Base, its chunks and their full, rich metadata (including the headings list from MarkdownHeaderTextSplitter) should be perfectly copied or cloned from the original file-* collection into the Knowledge Base's collection.

The vector representation of a given document chunk should be identical across all collections it belongs to, preserving data integrity and search consistency.

  • The page_content, metadata, and vector representation of any given chunk must be identical across all collections it resides in.
  • Adding a file to a Knowledge Base must be a lossless, 1:1 cloning operation of all data points (ID, page_content, vector, metadata) from the source file-* collection to the target KB collection.

Actual Behavior

There is a critical design flaw in the RAG pipeline that causes data corruption when a pre-processed file is added to a Knowledge Base. Instead of performing a clean, 1:1 clone of the original high-fidelity vectors, the system re-processes a lower-fidelity version of the file's content.

This flawed second pass results in chunks being stored in the Knowledge Base collection with different page_content and boundaries than the original, high-fidelity chunks. This corruption of the core page_content compromises the integrity of the RAG system, with the loss of metadata serving as irrefutable proof of the faulty process.


  • The primary issue is page_content corruption. The text of chunks within the Knowledge Base collection is inconsistent with the original chunks from the source file-* collection. This can be verified by directly inspecting the vector database. As page_content is the core data used for retrieval, this is a critical flaw that leads to inconsistent vector representations and undermines the reliability of the RAG system.

  • As secondary evidence of this flawed re-processing, rich structural metadata captured by the initial Loader pass is completely stripped out.

Steps to Reproduce

There is a design issue in the pipeline for adding an already-processed file to a Knowledge Base. The current implementation causes data inconsistency and a loss of rich metadata.

  1. Set the environment variable TEXT_SPLITTER=markdown_header.
  2. Create or use any markdown structured file saves as txt.
  3. Upload this test.md file to a new Knowledge Base.
  4. Using a vector database inspection tool, query and retrieve all chunks from the source file-* collection.
    • Observe: Note the page_content.
    • Observe: Note the headings metadata array is complete: i.e. ['Chapter 1', 'Section 1.1'].
  5. Query and retrieve all chunks for the same file from the target Knowledge Base collection.
    • Compare: Note that the page_content may be different. This is the core data corruption.
    • Observe: Note that the headings metadata array is incomplete and corrupted: i.e. ['Section 1.1]. This is the proof of the lossy re-processing.

Logs & Screenshots

owui250726.log

Additional Information

No response

Originally created by @christian-hawk on GitHub (Jul 27, 2025). ### Check Existing Issues - [x] I have searched the existing issues and discussions. - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version 0.6.18 ### Ollama Version (if applicable) _No response_ ### Operating System Ubuntu 22.04 ### Browser (if applicable) _No response_ ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior When a file is added to a Knowledge Base, its chunks and their full, rich metadata (including the `headings` list from `MarkdownHeaderTextSplitter`) should be perfectly copied or cloned from the original `file-*` collection into the Knowledge Base's collection. The vector representation of a given document chunk should be identical across all collections it belongs to, preserving data integrity and search consistency. - The `page_content`, metadata, and vector representation of any given chunk must be **identical** across all collections it resides in. - Adding a file to a Knowledge Base must be a lossless, 1:1 cloning operation of all data points (ID, `page_content`, vector, metadata) from the source `file-*` collection to the target KB collection. ### Actual Behavior There is a critical design flaw in the RAG pipeline that causes data corruption when a pre-processed file is added to a Knowledge Base. Instead of performing a clean, 1:1 clone of the original high-fidelity vectors, the system **re-processes a lower-fidelity version of the file's content**. This flawed second pass results in chunks being stored in the Knowledge Base collection with **different `page_content` and boundaries** than the original, high-fidelity chunks. This corruption of the core `page_content` compromises the integrity of the RAG system, with the loss of metadata serving as irrefutable proof of the faulty process. --- - **The primary issue is `page_content` corruption.** The text of chunks within the Knowledge Base collection is inconsistent with the original chunks from the source `file-*` collection. This can be verified by directly inspecting the vector database. As `page_content` is the core data used for retrieval, this is a critical flaw that leads to inconsistent vector representations and undermines the reliability of the RAG system. - As secondary evidence of this flawed re-processing, rich structural metadata captured by the initial `Loader` pass is completely stripped out. ### Steps to Reproduce There is a design issue in the pipeline for adding an already-processed file to a Knowledge Base. The current implementation causes data inconsistency and a loss of rich metadata. 1. Set the environment variable `TEXT_SPLITTER=markdown_header`. 2. Create or use any markdown structured file saves as `txt`. 3. Upload this `test.md` file to a new Knowledge Base. 4. Using a vector database inspection tool, query and retrieve all chunks from the source `file-*` collection. - **Observe:** Note the `page_content`. - **Observe:** Note the `headings` metadata array is complete: i.e. `['Chapter 1', 'Section 1.1']`. 5. Query and retrieve all chunks for the same file from the target Knowledge Base collection. - **Compare:** Note that the `page_content` may be different. **This is the core data corruption.** - **Observe:** Note that the `headings` metadata array is incomplete and corrupted: i.e. `['Section 1.1]`. **This is the proof of the lossy re-processing.** ### Logs & Screenshots [owui250726.log](https://github.com/user-attachments/files/21450846/owui250726.log) ### Additional Information _No response_
GiteaMirror added the bug label 2025-11-11 16:36:48 -06:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#5879