mirror of
https://github.com/open-webui/open-webui.git
synced 2026-03-11 00:04:08 -05:00
feat: Avoid duplicate files on storage backend #4839
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @jcsaaddupuy on GitHub (Apr 14, 2025).
Check Existing Issues
Problem Description
Uploading a file twice end up creating the same file twice on storage backend.
Desired Solution you'd like
One solution could be to use the content file hash (md5/ sha1) as the filename, keeping the extension, and uploading the file only if the file does not already exists.
For example (quick and dirty, could be factorized), for local and S3 provider :
A direct benefit would be to avoid filling the instance hard drive, and avoiding uncessary S3 /GCS storage as we would only store non duplicates files.
This would maye make maintainance harder as we would loose the original filename on the storage, but this may be a non issue (it ould not be in my use case).
From my understanding of the code, this would not affect the rest of the file management flow.
Let me know what this inspire you, I'll gladly work on implementing this feature.
Alternatives Considered
No response
Additional Context
No response
@tkg61 commented on GitHub (Apr 14, 2025):
The only thing to think about is document clean up later. If one document is servicing multiple users or knowledge collections and things need to get cleaned up, it would be some extra steps to clean up the data to ensure the files aren't in use somewhere else.
@bleriot14 commented on GitHub (Apr 14, 2025):
This is extremely necessary for very much portion of the users.
@tjbck commented on GitHub (Apr 28, 2025):
Generally sounds good, PR welcome!