[GH-ISSUE #20948] feat: Preserve File Metadata in Pipelines & Implement Customizable Loader Hooks #34869

Closed
opened 2026-04-25 09:03:08 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @burakkilic11 on GitHub (Jan 26, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/20948

Check Existing Issues

  • I have searched for all existing open AND closed issues and discussions for similar requests. I have found none that is comparable to my request.

Verify Feature Scope

  • I have read through and understood the scope definition for feature requests in the Issues section. I believe my feature request meets the definition and belongs in the Issues section instead of the Discussions.

Problem Description

Currently, when a file is uploaded via the chat "+" button, the internal document loader automatically processes the file, extracts the text, and strips all original file metadata (path, id, files object) before the request reaches the Pipelines (Filters).

Even with a shared volume setup (e.g., mapping /app/backend/data/uploads to both OpenWebUI and Pipelines containers), there is no way for a filter to know which file belongs to the current message because the references are removed from the body and kwargs. This prevents developers from implementing custom OCR (Tesseract), specialized layout analysis, or private local processing within the Pipeline framework.

Desired Solution you'd like

I would like to see two main improvements:

Metadata Preservation: Ensure the original files object (containing id, filename, and path) remains included in the body payload sent to Pipelines, even after the document loader has processed the file.

Loader Hooks: Implement a mechanism that allows Pipelines to "hook" into or override the core document loader stage. This would enable developers to replace the default text extraction logic with custom solutions (like specialized local OCR) directly within the OpenWebUI ecosystem.

Alternatives Considered

Shared Volumes: We tried using shared volumes to access the files directly, but since the file_id or path is stripped from the JSON payload, the Pipeline cannot identify the correct file on disk.

Manual RAG: We considered disabling RAG, but the document loader still executes by default upon upload, resulting in the same metadata loss.

Additional Context

The current payload received by the pipeline is too stripped down for advanced processing:

{
"stream": false,
"model": "your_model_id",
"messages": [{
"role": "user",
"content": "### Task: ... [Extracted text is present, but file references are missing] ..."
}],
"user": { "name": "...", "role": "admin" }
}

Providing a way to intercept the file before or during the loading phase would significantly expand the extensibility of OpenWebUI.

Originally created by @burakkilic11 on GitHub (Jan 26, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/20948 ### Check Existing Issues - [x] I have searched for all existing **open AND closed** issues and discussions for similar requests. I have found none that is comparable to my request. ### Verify Feature Scope - [x] I have read through and understood the scope definition for feature requests in the Issues section. I believe my feature request meets the definition and belongs in the Issues section instead of the Discussions. ### Problem Description Currently, when a file is uploaded via the chat "+" button, the internal document loader automatically processes the file, extracts the text, and strips all original file metadata (path, id, files object) before the request reaches the Pipelines (Filters). Even with a shared volume setup (e.g., mapping /app/backend/data/uploads to both OpenWebUI and Pipelines containers), there is no way for a filter to know which file belongs to the current message because the references are removed from the body and kwargs. This prevents developers from implementing custom OCR (Tesseract), specialized layout analysis, or private local processing within the Pipeline framework. ### Desired Solution you'd like I would like to see two main improvements: Metadata Preservation: Ensure the original files object (containing id, filename, and path) remains included in the body payload sent to Pipelines, even after the document loader has processed the file. Loader Hooks: Implement a mechanism that allows Pipelines to "hook" into or override the core document loader stage. This would enable developers to replace the default text extraction logic with custom solutions (like specialized local OCR) directly within the OpenWebUI ecosystem. ### Alternatives Considered Shared Volumes: We tried using shared volumes to access the files directly, but since the file_id or path is stripped from the JSON payload, the Pipeline cannot identify the correct file on disk. Manual RAG: We considered disabling RAG, but the document loader still executes by default upon upload, resulting in the same metadata loss. ### Additional Context The current payload received by the pipeline is too stripped down for advanced processing: { "stream": false, "model": "your_model_id", "messages": [{ "role": "user", "content": "### Task: ... [Extracted text is present, but file references are missing] ..." }], "user": { "name": "...", "role": "admin" } } Providing a way to intercept the file before or during the loading phase would significantly expand the extensibility of OpenWebUI.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#34869