mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-07 03:18:23 -05:00
[GH-ISSUE #10679] Batch add file to knowledge doesn't check for existence #54656
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @almajo on GitHub (Feb 24, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/10679
Bug Report
Installation Method
[Describe the method you used to install the project, e.g., git clone, Docker, pip, etc.]
git clone
Confirmation:
Expected Behavior:
When adding a single file to a knowledge store (/knowledge/{id}/file/add) it is checked if a document already exists before adding it to vector store. If it is, an error is thrown.
The same behavior is not present when using the /knowledge/{id}/files/batch/add route. This can cause many duplicates in the vector database, long term leading to problems in RAG because of duplicates in top-k results.
I would have expected similar file treatment through both routes.
Actual Behavior:
files with same content are added again (also runs the full rag-pipeline again with split+embedding).
Description
Bug Summary:
[Provide a brief but clear summary of the bug]
Reproduction Details
Steps to Reproduce:
Via API:
/api/v1/knowledge/{knowledge_id}/files/batch/addendpointGo to your database and check collection count
rerun step 3 with same file ids
Go to your database and check collection count
everything will be duplicated
Possible solution
In
save_docs_to_vectorwe check for the hash and cancel (in single docs-mode) if duplicatedHere we could also check for each doc.metadata.hash if it exists and when it does, handle it accordingly. Throwing an error is hard on a batch api route, rather exclude them from the
save_docs_to_vector_dbcall and return them with process statusfailed@tjbck commented on GitHub (Feb 25, 2025):
PR Welcome!
@almajo commented on GitHub (Feb 25, 2025):
I will take a look at it later today
@vinsdragonis commented on GitHub (Mar 21, 2025):
I am facing this index duplication problem even with single file upload while using OpenSearch. From what I have noticed over time, the hash generated and passed is always unique, even for the same file unchanged.
@almajo commented on GitHub (Mar 21, 2025):
Yes, I've seen the same behaviour @vinsdragonis . you can upload the exact same file twice.
Unfortunately I have a lot on my hands right now so I haven't got the chance to take a look at this further.
@Mte90 commented on GitHub (Jun 16, 2025):
I can confirm the issue, the file is still located in the disk, but it is uploaded again (the new one is not removed) but also on UI side there isn't the already uploaded file.
So the UX is kind of broken because it is not clear if OWUI is still processing the already uploaded files that weren't processed (this can happen if you restart OWUI as example or do a page refresh of the workspace). The only solution is to remove the workspace and restart, but is not very suitable.
As you can see in this screen the appendix-2.md is not present but it was detected as duplicated but it isn't in the workspace list.
@vinsdragonis commented on GitHub (Jun 18, 2025):
The cause of this is the usage of UUID4 to generate hashes for the files. This uses timestamps so it will always be unique. Hence, a duplicate is created. If 3 or 5 is used, this could be rectified.
I'm willing to collaborate on this if anyone's up for it.