[GH-ISSUE #10679] Batch add file to knowledge doesn't check for existence #31519

Open
opened 2026-04-25 05:26:05 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @almajo on GitHub (Feb 24, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/10679

Bug Report


Installation Method

[Describe the method you used to install the project, e.g., git clone, Docker, pip, etc.]
git clone

Confirmation:

  • [X ] I have read and followed all the instructions provided in the README.md.
  • [ X] I am on the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided the exact steps to reproduce the bug in the "Steps to Reproduce" section below.

Expected Behavior:

When adding a single file to a knowledge store (/knowledge/{id}/file/add) it is checked if a document already exists before adding it to vector store. If it is, an error is thrown.

The same behavior is not present when using the /knowledge/{id}/files/batch/add route. This can cause many duplicates in the vector database, long term leading to problems in RAG because of duplicates in top-k results.

I would have expected similar file treatment through both routes.

Actual Behavior:

files with same content are added again (also runs the full rag-pipeline again with split+embedding).

Description

Bug Summary:
[Provide a brief but clear summary of the bug]

Reproduction Details

Steps to Reproduce:

Via API:

  1. add files to openwebui /files/ and store file-ids
  2. create knowledge base
  3. add file-ids to knowledge using the /api/v1/knowledge/{knowledge_id}/files/batch/add endpoint

Go to your database and check collection count

rerun step 3 with same file ids

Go to your database and check collection count

everything will be duplicated

Possible solution

In save_docs_to_vector we check for the hash and cancel (in single docs-mode) if duplicated


    # Check if entries with the same hash (metadata.hash) already exist
    if metadata and "hash" in metadata:
        result = VECTOR_DB_CLIENT.query(
            collection_name=collection_name,
            filter={"hash": metadata["hash"]},
        )

        if result is not None:
            existing_doc_ids = result.ids[0]
            if existing_doc_ids:
                log.info(f"Document with hash {metadata['hash']} already exists")
                raise ValueError(ERROR_MESSAGES.DUPLICATE_CONTENT)

Here we could also check for each doc.metadata.hash if it exists and when it does, handle it accordingly. Throwing an error is hard on a batch api route, rather exclude them from the save_docs_to_vector_db call and return them with process status failed

Originally created by @almajo on GitHub (Feb 24, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/10679 # Bug Report --- ## Installation Method [Describe the method you used to install the project, e.g., git clone, Docker, pip, etc.] git clone **Confirmation:** - [X ] I have read and followed all the instructions provided in the README.md. - [ X] I am on the latest version of both Open WebUI and Ollama. - [ ] I have included the browser console logs. - [ ] I have included the Docker container logs. - [ ] I have provided the exact steps to reproduce the bug in the "Steps to Reproduce" section below. ## Expected Behavior: When adding a single file to a knowledge store (/knowledge/{id}/file/add) it is checked if a document already exists before adding it to vector store. If it is, an error is thrown. The same behavior is not present when using the /knowledge/{id}/files/batch/add route. This can cause many duplicates in the vector database, long term leading to problems in RAG because of duplicates in top-k results. I would have expected similar file treatment through both routes. ## Actual Behavior: files with same content are added again (also runs the full rag-pipeline again with split+embedding). ## Description **Bug Summary:** [Provide a brief but clear summary of the bug] ## Reproduction Details **Steps to Reproduce:** Via API: 1. add files to openwebui /files/ and store file-ids 1. create knowledge base 2. add file-ids to knowledge using the `/api/v1/knowledge/{knowledge_id}/files/batch/add` endpoint Go to your database and check collection count rerun step 3 with same file ids Go to your database and check collection count everything will be duplicated ## Possible solution In `save_docs_to_vector` we check for the hash and cancel (in single docs-mode) if duplicated ```python # Check if entries with the same hash (metadata.hash) already exist if metadata and "hash" in metadata: result = VECTOR_DB_CLIENT.query( collection_name=collection_name, filter={"hash": metadata["hash"]}, ) if result is not None: existing_doc_ids = result.ids[0] if existing_doc_ids: log.info(f"Document with hash {metadata['hash']} already exists") raise ValueError(ERROR_MESSAGES.DUPLICATE_CONTENT) ``` Here we could also check for each doc.metadata.hash if it exists and when it does, handle it accordingly. Throwing an error is hard on a batch api route, rather exclude them from the `save_docs_to_vector_db` call and return them with process status `failed`
GiteaMirror added the bugconfirmed issue labels 2026-04-25 05:26:19 -05:00
Author
Owner

@tjbck commented on GitHub (Feb 25, 2025):

PR Welcome!

<!-- gh-comment-id:2680945841 --> @tjbck commented on GitHub (Feb 25, 2025): PR Welcome!
Author
Owner

@almajo commented on GitHub (Feb 25, 2025):

I will take a look at it later today

<!-- gh-comment-id:2681082184 --> @almajo commented on GitHub (Feb 25, 2025): I will take a look at it later today
Author
Owner

@vinsdragonis commented on GitHub (Mar 21, 2025):

I am facing this index duplication problem even with single file upload while using OpenSearch. From what I have noticed over time, the hash generated and passed is always unique, even for the same file unchanged.

<!-- gh-comment-id:2742371992 --> @vinsdragonis commented on GitHub (Mar 21, 2025): I am facing this index duplication problem even with single file upload while using OpenSearch. From what I have noticed over time, the hash generated and passed is always unique, even for the same file unchanged.
Author
Owner

@almajo commented on GitHub (Mar 21, 2025):

Yes, I've seen the same behaviour @vinsdragonis . you can upload the exact same file twice.

Unfortunately I have a lot on my hands right now so I haven't got the chance to take a look at this further.

<!-- gh-comment-id:2742555196 --> @almajo commented on GitHub (Mar 21, 2025): Yes, I've seen the same behaviour @vinsdragonis . you can upload the exact same file twice. Unfortunately I have a lot on my hands right now so I haven't got the chance to take a look at this further.
Author
Owner

@Mte90 commented on GitHub (Jun 16, 2025):

I can confirm the issue, the file is still located in the disk, but it is uploaded again (the new one is not removed) but also on UI side there isn't the already uploaded file.
So the UX is kind of broken because it is not clear if OWUI is still processing the already uploaded files that weren't processed (this can happen if you restart OWUI as example or do a page refresh of the workspace). The only solution is to remove the workspace and restart, but is not very suitable.

Image

As you can see in this screen the appendix-2.md is not present but it was detected as duplicated but it isn't in the workspace list.

<!-- gh-comment-id:2976710842 --> @Mte90 commented on GitHub (Jun 16, 2025): I can confirm the issue, the file is still located in the disk, but it is uploaded again (the new one is not removed) but also on UI side there isn't the already uploaded file. So the UX is kind of broken because it is not clear if OWUI is still processing the already uploaded files that weren't processed (this can happen if you restart OWUI as example or do a page refresh of the workspace). The only solution is to remove the workspace and restart, but is not very suitable. ![Image](https://github.com/user-attachments/assets/c5566e85-fcea-4e29-8d18-849458dde78c) As you can see in this screen the appendix-2.md is not present but it was detected as duplicated but it isn't in the workspace list.
Author
Owner

@vinsdragonis commented on GitHub (Jun 18, 2025):

The cause of this is the usage of UUID4 to generate hashes for the files. This uses timestamps so it will always be unique. Hence, a duplicate is created. If 3 or 5 is used, this could be rectified.

I'm willing to collaborate on this if anyone's up for it.

<!-- gh-comment-id:2984952292 --> @vinsdragonis commented on GitHub (Jun 18, 2025): The cause of this is the usage of UUID4 to generate hashes for the files. This uses timestamps so it will always be unique. Hence, a duplicate is created. If 3 or 5 is used, this could be rectified. I'm willing to collaborate on this if anyone's up for it.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#31519