[PR #19095] feat: Adding file metadata to hybrid search #11885

Open
opened 2025-11-11 19:59:34 -06:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/19095
Author: @jmleksan
Created: 11/10/2025
Status: 🔄 Open

Base: devHead: enh/hybrid-search-with-metadata


📝 Commits (2)

📊 Changes

1 file changed (+36 additions, -1 deletions)

View changed files

📝 backend/open_webui/retrieval/utils.py (+36 -1)

📄 Description

Pull Request Checklist

Before submitting, make sure you've checked the following:

  • Target branch: Verify that the pull request targets the dev branch. Not targeting the dev branch will lead to immediate closure of the PR.
  • Description: Provide a concise description of the changes made in this pull request down below.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: If necessary, update relevant documentation Open WebUI Docs like environment variables, the tutorials, or other documentation sources.
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Perform manual tests to verify the implemented fix/feature works as intended AND does not break any other functionality. Take this as an opportunity to make screenshots of the feature/fix and include it in the PR description.
  • Agentic AI Code: Confirm this Pull Request is not written by any AI Agent or has at least gone through additional human review AND manual testing. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR.
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Title Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:
    • feat

Changelog Entry

  • 📚 Hybrid search now indexes high-signal metadata (filenames, titles, headings, sources, and snippets) alongside chunk text, enabling keyword queries to surface documents where the term appears only in metadata.

Description

  • Improve hybrid retrieval by blending document content with selected metadata so quires can find keywords that may only appear in file metadata such as file name or source. One use case may be if you want to search for documents of a certain type or that may contain a certain date in them that may only be present in the file name.

Use Case: “Show me our PPO documents”

  • A benefits specialist uploads plan files named PPO_insurance_2024.pdf, medical_PPO_plan.docx, etc.
  • These documents mention “PPO” only in their filenames or titles, so hybrid search on just the body text previously missed them.
  • After the metadata-aware BM25 augmentation, a query like PPO benefits or just PPO now surfaces every relevant document because filenames, titles, headings, and snippets are indexed alongside the content.
  • Partial matches work as well: typing PPO returns all PPO-related files immediately, instead of relying on the term being repeated in the body text.

Added

  • Augmented the BM25 input text with filename, title, section heading, source, and snippet metadata.
  • Tokenized filenames (replacing underscores, dashes, and extensions) to support partial-name matches such as “pdf”.

Fixed

  • Ensured hybrid search returns documents when the query term exists only in metadata (e.g., filenames or web page titles).

Additional Information

  • Affected code: backend/open_webui/retrieval/utils.py

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/19095 **Author:** [@jmleksan](https://github.com/jmleksan) **Created:** 11/10/2025 **Status:** 🔄 Open **Base:** `dev` ← **Head:** `enh/hybrid-search-with-metadata` --- ### 📝 Commits (2) - [`e0d5de1`](https://github.com/open-webui/open-webui/commit/e0d5de16978786b8a7538adf1efcde5258f38faf) Merge pull request #18978 from open-webui/dev - [`289a1b4`](https://github.com/open-webui/open-webui/commit/289a1b44add635a45e758d5d78fac8df905fd62c) Added metadata to hybrid search ### 📊 Changes **1 file changed** (+36 additions, -1 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/retrieval/utils.py` (+36 -1) </details> ### 📄 Description # Pull Request Checklist **Before submitting, make sure you've checked the following:** - [x] **Target branch:** Verify that the pull request targets the `dev` branch. **Not targeting the `dev` branch will lead to immediate closure of the PR.** - [x] **Description:** Provide a concise description of the changes made in this pull request down below. - [x] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [x] **Documentation:** If necessary, update relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs) like environment variables, the tutorials, or other documentation sources. - [x] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - [x] **Testing:** Perform manual tests to **verify the implemented fix/feature works as intended AND does not break any other functionality**. Take this as an opportunity to **make screenshots of the feature/fix and include it in the PR description**. - [x] **Agentic AI Code:** Confirm this Pull Request is **not written by any AI Agent** or has at least **gone through additional human review AND manual testing**. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR. - [x] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [x] **Title Prefix:** To clearly categorize this pull request, prefix the pull request title using one of the following: - **feat** --- # Changelog Entry - 📚 Hybrid search now indexes high-signal metadata (filenames, titles, headings, sources, and snippets) alongside chunk text, enabling keyword queries to surface documents where the term appears only in metadata. ### Description - Improve hybrid retrieval by blending document content with selected metadata so quires can find keywords that may only appear in file metadata such as file name or source. One use case may be if you want to search for documents of a certain type or that may contain a certain date in them that may only be present in the file name. **Use Case: “Show me our PPO documents”** - A benefits specialist uploads plan files named `PPO_insurance_2024.pdf`, `medical_PPO_plan.docx`, etc. - These documents mention “PPO” only in their filenames or titles, so hybrid search on just the body text previously missed them. - After the metadata-aware BM25 augmentation, a query like `PPO benefits` or just `PPO` now surfaces every relevant document because filenames, titles, headings, and snippets are indexed alongside the content. - Partial matches work as well: typing `PPO` returns all PPO-related files immediately, instead of relying on the term being repeated in the body text. ### Added - Augmented the BM25 input text with filename, title, section heading, source, and snippet metadata. - Tokenized filenames (replacing underscores, dashes, and extensions) to support partial-name matches such as “pdf”. ### Fixed - Ensured hybrid search returns documents when the query term exists only in metadata (e.g., filenames or web page titles). --- ### Additional Information - Affected code: `backend/open_webui/retrieval/utils.py` ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2025-11-11 19:59:34 -06:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#11885