[PR #20040] [CLOSED] fix: add duplicate check to batch file processing #25440

Closed
opened 2026-04-20 05:56:19 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/20040
Author: @silentoplayz
Created: 12/19/2025
Status: Closed

Base: devHead: fix/add_files_to_knowledge_batch


📝 Commits (1)

  • ad82706 fix: add duplicate check to batch file processing

📊 Changes

1 file changed (+23 additions, -1 deletions)

View changed files

📝 backend/open_webui/routers/retrieval.py (+23 -1)

📄 Description

Pull Request Checklist

Note to first-time contributors: Please open a discussion post in Discussions to discuss your idea/fix with the community before creating a pull request, and describe your changes before submitting a pull request.

This is to ensure large feature PRs are discussed with the community first, before starting work on it. If the community does not want this feature or it is not relevant for Open WebUI as a project, it can be identified in the discussion before working on the feature and submitting the PR.

Before submitting, make sure you've checked the following:

  • Target branch: Verify that the pull request targets the dev branch. Not targeting the dev branch will lead to immediate closure of the PR.
  • Description: Provide a concise description of the changes made in this pull request down below.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: If necessary, update relevant documentation Open WebUI Docs like environment variables, the tutorials, or other documentation sources.
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Perform manual tests to verify the implemented fix/feature works as intended AND does not break any other functionality. Take this as an opportunity to make screenshots of the feature/fix and include it in the PR description.
  • Agentic AI Code: Confirm this Pull Request is not written by any AI Agent or has at least gone through additional human review AND manual testing. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR.
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Title Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:
    • fix: Bug fix or error correction

Changelog Entry

Description

This PR fixes a bug where add_files_to_knowledge_batch did not check for existing files in the vector database, causing duplicate embeddings when adding the same file multiple times via the batch endpoint. The fix implements a hash check for each file in process_files_batch inside retrieval.py, mirroring the logic of the single-file add endpoint.

Added

  • Duplicate check using file hash in process_files_batch to prevent redundant vector insertions.
  • hash field to the Document metadata during batch creation to enable robust duplicate detection.

Changed

  • Updated backend/open_webui/routers/retrieval.py to calculate file hash early and query the vector database before processing.

Fixed

  • Fixes #10679: Batch add file to knowledge doesn't check for existence (duplicate vectors).

Additional Information

Refactored process_files_batch to:

  1. Calculate SHA256 hash of file content.
  2. Query VECTOR_DB_CLIENT for existing documents with this hash.
  3. If found, log the duplicate and skip processing (returning a failed status with DUPLICATE_CONTENT error).

Verification Output:

A verification script was run to simulate a batch upload of a file that already exists in the vector database.

Running verification for Issue #10679 (Batch Duplicate Check)...
Action: Attempting to process duplicate file batch...
INFO  [open_webui.routers.retrieval] Document with hash 2b63f602a89e88ace6924886ceacd66926b3d7a55dc74a6217b573c58138e961 already exists
✅ SUCCESS: Duplicate file was rejected.
   Error Message: Duplicate content detected. Please provide unique content to proceed.
   Verified: Error message matches DUPLICATE_CONTENT.

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/20040 **Author:** [@silentoplayz](https://github.com/silentoplayz) **Created:** 12/19/2025 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `fix/add_files_to_knowledge_batch` --- ### 📝 Commits (1) - [`ad82706`](https://github.com/open-webui/open-webui/commit/ad82706eac0d35f3fd6f39c4bc4ae494a0740ef0) fix: add duplicate check to batch file processing ### 📊 Changes **1 file changed** (+23 additions, -1 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/routers/retrieval.py` (+23 -1) </details> ### 📄 Description # Pull Request Checklist ### Note to first-time contributors: Please open a discussion post in [Discussions](https://github.com/open-webui/open-webui/discussions) to discuss your idea/fix with the community before creating a pull request, and describe your changes before submitting a pull request. This is to ensure large feature PRs are discussed with the community first, before starting work on it. If the community does not want this feature or it is not relevant for Open WebUI as a project, it can be identified in the discussion before working on the feature and submitting the PR. **Before submitting, make sure you've checked the following:** - [x] **Target branch:** Verify that the pull request targets the `dev` branch. **Not targeting the `dev` branch will lead to immediate closure of the PR.** - [x] **Description:** Provide a concise description of the changes made in this pull request down below. - [x] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [X] **Documentation:** If necessary, update relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs) like environment variables, the tutorials, or other documentation sources. - [X] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - [x] **Testing:** Perform manual tests to **verify the implemented fix/feature works as intended AND does not break any other functionality**. Take this as an opportunity to **make screenshots of the feature/fix and include it in the PR description**. - [x] **Agentic AI Code:** Confirm this Pull Request is **not written by any AI Agent** or has at least **gone through additional human review AND manual testing**. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR. - [x] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [x] **Title Prefix:** To clearly categorize this pull request, prefix the pull request title using one of the following: - **fix**: Bug fix or error correction # Changelog Entry ### Description This PR fixes a bug where `add_files_to_knowledge_batch` did not check for existing files in the vector database, causing duplicate embeddings when adding the same file multiple times via the batch endpoint. The fix implements a hash check for each file in `process_files_batch` inside `retrieval.py`, mirroring the logic of the single-file add endpoint. ### Added - Duplicate check using file hash in `process_files_batch` to prevent redundant vector insertions. - `hash` field to the `Document` metadata during batch creation to enable robust duplicate detection. ### Changed - Updated `backend/open_webui/routers/retrieval.py` to calculate file hash early and query the vector database before processing. ### Fixed - Fixes #10679: Batch add file to knowledge doesn't check for existence (duplicate vectors). --- ### Additional Information Refactored `process_files_batch` to: 1. Calculate SHA256 hash of file content. 2. Query `VECTOR_DB_CLIENT` for existing documents with this hash. 3. If found, log the duplicate and skip processing (returning a failed status with `DUPLICATE_CONTENT` error). **Verification Output:** A verification script was run to simulate a batch upload of a file that already exists in the vector database. ``` Running verification for Issue #10679 (Batch Duplicate Check)... Action: Attempting to process duplicate file batch... INFO [open_webui.routers.retrieval] Document with hash 2b63f602a89e88ace6924886ceacd66926b3d7a55dc74a6217b573c58138e961 already exists ✅ SUCCESS: Duplicate file was rejected. Error Message: Duplicate content detected. Please provide unique content to proceed. Verified: Error message matches DUPLICATE_CONTENT. ``` ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. > [!NOTE] > Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-20 05:56:19 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#25440