[PR #21977] fix: include file metadata in knowledge base context sent to LLM #65228

Open
opened 2026-05-06 11:00:37 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/21977
Author: @kjpoccia
Created: 2/28/2026
Status: 🔄 Open

Base: devHead: feat/filename-fileid-kb-search


📝 Commits (1)

  • 38e438a add filename and fileid to returned sources from kb search

📊 Changes

1 file changed (+4 additions, -0 deletions)

View changed files

📝 backend/open_webui/utils/middleware.py (+4 -0)

📄 Description

Pull Request Checklist

Note to first-time contributors: Please open a discussion post in Discussions to discuss your idea/fix with the community before creating a pull request, and describe your changes before submitting a pull request.

This is to ensure large feature PRs are discussed with the community first, before starting work on it. If the community does not want this feature or it is not relevant for Open WebUI as a project, it can be identified in the discussion before working on the feature and submitting the PR.

Before submitting, make sure you've checked the following:

  • Target branch: Verify that the pull request targets the dev branch. PRs targeting main will be immediately closed.
  • Description: Provide a concise description of the changes made in this pull request down below.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: Add docs in Open WebUI Docs Repository. Document user-facing behavior, environment variables, public APIs/interfaces, or deployment steps.
  • Dependencies: Are there any new or upgraded dependencies? If so, explain why, update the changelog/docs, and include any compatibility notes. Actually run the code/function that uses updated library to ensure it doesn't crash.
  • Testing: Perform manual tests to verify the implemented fix/feature works as intended AND does not break any other functionality. Include reproducible steps to demonstrate the issue before the fix. Test edge cases (URL encoding, HTML entities, types). Take this as an opportunity to make screenshots of the feature/fix and include them in the PR description.
  • Agentic AI Code: Confirm this Pull Request is not written by any AI Agent or has at least gone through additional human review AND manual testing. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR.
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Design & Architecture: Prefer smart defaults over adding new settings; use local state for ephemeral UI logic. Open a Discussion for major architectural or UX changes.
  • Git Hygiene: Keep PRs atomic (one logical change). Clean up commits and rebase on dev to ensure no unrelated commits (e.g. from main) are included. Push updates to the existing PR branch instead of closing and reopening.
  • Title Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • build: Changes that affect the build system or external dependencies
    • ci: Changes to our continuous integration processes or workflows
    • chore: Refactor, cleanup, or other non-functional code changes
    • docs: Documentation update or addition
    • feat: Introduces a new feature or enhancement to the codebase
    • fix: Bug fix or error correction
    • i18n: Internationalization or localization changes
    • perf: Performance improvement
    • refactor: Code restructuring for better maintainability, readability, or scalability
    • style: Changes that do not affect the meaning of the code (white space, formatting, missing semi-colons, etc.)
    • test: Adding missing tests or correcting existing tests
    • WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

  • 📄 Preserve file metadata in knowledge base context. Retrieved chunks now include file_id and file_name when formatted for LLM context, preventing document identity loss and aligning with query_knowledge_files behavior.

Description

  • Retrieved knowledge base chunks are currently flattened into LLM context without preserving document-level identity (current source name is the title of the knowledge base itself). This prevents the model from distinguishing between files within the same knowledge base.
    This PR includes file_name and file_id in the formatted context, aligning behavior with the native query_knowledge_files tool.

Added

  • Inclusion of file_id and file_name in knowledge base context formatting.

Changed

  • Updated LLM context construction for knowledge base retrieval to preserve document identity.

Deprecated

  • None

Removed

  • None

Fixed

  • Fixed issue where document identity was lost during knowledge base context formatting, causing ambiguous or misleading multi-document responses.

Security

  • None

Breaking Changes

  • None

Additional Information

  • There are few reasons for this change:
  1. The model can better respond to user queries when it knows the files the information was retrieved from.
  2. If the file_id and file_name are included, we can better take advantage of multistep tool-calling, for instance calling a certain tool when a type of file is returned, or using the retrieved file_id to fetch the original file from Open WebUI's files endpoint.
  3. Consistency with existing query_knowledge_files functionality

Screenshots or Videos

  • In the below example, we have a KB with meeting minutes. When we ask for details surrounding a certain topic, the model returns chunks from various files, but it can't tell the chunks are from different files. To the model, they're all from the same source.
    Screenshot 2026-02-27 at 9 12 11 PM
    Screenshot 2026-02-27 at 9 12 18 PM

  • The below screenshot shows the model's performance after the fix:
    Screenshot 2026-02-28 at 1 42 35 PM

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/21977 **Author:** [@kjpoccia](https://github.com/kjpoccia) **Created:** 2/28/2026 **Status:** 🔄 Open **Base:** `dev` ← **Head:** `feat/filename-fileid-kb-search` --- ### 📝 Commits (1) - [`38e438a`](https://github.com/open-webui/open-webui/commit/38e438a5fb4972e14a33fb495dcd88870e4ba545) add filename and fileid to returned sources from kb search ### 📊 Changes **1 file changed** (+4 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/utils/middleware.py` (+4 -0) </details> ### 📄 Description <!-- ⚠️ CRITICAL CHECKS FOR CONTRIBUTORS (READ, DON'T DELETE) ⚠️ 1. Target the `dev` branch. PRs targeting `main` will be automatically closed. 2. Do NOT delete the CLA section at the bottom. It is required for the bot to accept your PR. --> # Pull Request Checklist ### Note to first-time contributors: Please open a discussion post in [Discussions](https://github.com/open-webui/open-webui/discussions) to discuss your idea/fix with the community before creating a pull request, and describe your changes before submitting a pull request. This is to ensure large feature PRs are discussed with the community first, before starting work on it. If the community does not want this feature or it is not relevant for Open WebUI as a project, it can be identified in the discussion before working on the feature and submitting the PR. **Before submitting, make sure you've checked the following:** - [X] **Target branch:** Verify that the pull request targets the `dev` branch. **PRs targeting `main` will be immediately closed.** - [X] **Description:** Provide a concise description of the changes made in this pull request down below. - [X] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [ ] **Documentation:** Add docs in [Open WebUI Docs Repository](https://github.com/open-webui/docs). Document user-facing behavior, environment variables, public APIs/interfaces, or deployment steps. - [ ] **Dependencies:** Are there any new or upgraded dependencies? If so, explain why, update the changelog/docs, and include any compatibility notes. Actually run the code/function that uses updated library to ensure it doesn't crash. - [X] **Testing:** Perform manual tests to **verify the implemented fix/feature works as intended AND does not break any other functionality**. Include reproducible steps to demonstrate the issue before the fix. Test edge cases (URL encoding, HTML entities, types). Take this as an opportunity to **make screenshots of the feature/fix and include them in the PR description**. - [X] **Agentic AI Code:** Confirm this Pull Request is **not written by any AI Agent** or has at least **gone through additional human review AND manual testing**. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR. - [X] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [X] **Design & Architecture:** Prefer smart defaults over adding new settings; use local state for ephemeral UI logic. Open a Discussion for major architectural or UX changes. - [X] **Git Hygiene:** Keep PRs atomic (one logical change). Clean up commits and rebase on `dev` to ensure no unrelated commits (e.g. from `main`) are included. Push updates to the existing PR branch instead of closing and reopening. - [X] **Title Prefix:** To clearly categorize this pull request, prefix the pull request title using one of the following: - **BREAKING CHANGE**: Significant changes that may affect compatibility - **build**: Changes that affect the build system or external dependencies - **ci**: Changes to our continuous integration processes or workflows - **chore**: Refactor, cleanup, or other non-functional code changes - **docs**: Documentation update or addition - **feat**: Introduces a new feature or enhancement to the codebase - **fix**: Bug fix or error correction - **i18n**: Internationalization or localization changes - **perf**: Performance improvement - **refactor**: Code restructuring for better maintainability, readability, or scalability - **style**: Changes that do not affect the meaning of the code (white space, formatting, missing semi-colons, etc.) - **test**: Adding missing tests or correcting existing tests - **WIP**: Work in progress, a temporary label for incomplete or ongoing work # Changelog Entry - 📄 **Preserve file metadata in knowledge base context.** Retrieved chunks now include `file_id` and `file_name` when formatted for LLM context, preventing document identity loss and aligning with `query_knowledge_files` behavior. ### Description - Retrieved knowledge base chunks are currently flattened into LLM context without preserving document-level identity (**current source name is the title of the knowledge base itself**). This prevents the model from distinguishing between files within the same knowledge base. This PR includes `file_name` and `file_id` in the formatted context, aligning behavior with the native `query_knowledge_files` tool. ### Added - Inclusion of `file_id` and `file_name` in knowledge base context formatting. ### Changed - Updated LLM context construction for knowledge base retrieval to preserve document identity. ### Deprecated - None ### Removed - None ### Fixed - Fixed issue where document identity was lost during knowledge base context formatting, causing ambiguous or misleading multi-document responses. ### Security - None ### Breaking Changes - None --- ### Additional Information - There are few reasons for this change: 1) The model can better respond to user queries when it knows the files the information was retrieved from. 2) If the file_id and file_name are included, we can better take advantage of multistep tool-calling, for instance calling a certain tool when a type of file is returned, or using the retrieved file_id to fetch the original file from Open WebUI's files endpoint. 3) Consistency with existing query_knowledge_files functionality ### Screenshots or Videos - In the below example, we have a KB with meeting minutes. When we ask for details surrounding a certain topic, the model returns chunks from various files, but it can't tell the chunks are from different files. To the model, they're all from the same source. <img width="1029" height="469" alt="Screenshot 2026-02-27 at 9 12 11 PM" src="https://github.com/user-attachments/assets/587ff75f-7cc1-4214-8f24-eccc39cfa450" /> <img width="998" height="485" alt="Screenshot 2026-02-27 at 9 12 18 PM" src="https://github.com/user-attachments/assets/f88ab9eb-0879-4066-b8ce-47adad7b1d41" /> - The below screenshot shows the model's performance after the fix: <img width="847" height="331" alt="Screenshot 2026-02-28 at 1 42 35 PM" src="https://github.com/user-attachments/assets/8f9d5ef1-cbe9-4d8b-b78b-03eede479040" /> ### Contributor License Agreement <!-- 🚨 DO NOT DELETE THE TEXT BELOW 🚨 Keep the "Contributor License Agreement" confirmation text intact. Deleting it will trigger the CLA-Bot to INVALIDATE your PR. --> By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. > [!NOTE] > Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-06 11:00:37 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#65228