feat: Avoid duplicate files on storage backend #4839

Closed
opened 2025-11-11 16:04:22 -06:00 by GiteaMirror · 3 comments
Owner

Originally created by @jcsaaddupuy on GitHub (Apr 14, 2025).

Check Existing Issues

  • I have searched the existing issues and discussions.

Problem Description

Uploading a file twice end up creating the same file twice on storage backend.

Desired Solution you'd like

One solution could be to use the content file hash (md5/ sha1) as the filename, keeping the extension, and uploading the file only if the file does not already exists.

For example (quick and dirty, could be factorized), for local and S3 provider :

class LocalStorageProvider(StorageProvider):
    @staticmethod
    def upload_file(file: BinaryIO, filename: str) -> Tuple[bytes, str]:
        contents = file.read()

        if not contents:
            raise ValueError(ERROR_MESSAGES.EMPTY_CONTENT)

        # Overrides original filename with a predictable name based on the content of the file
        content_sha1=hashlib.sha1(contents).hexdigest()
        _, ext = os.path.splitext(filename) # Keep file extension if any
        filename = f"{content_sha1}{ext}"

        file_path = f"{UPLOAD_DIR}/{filename}"
        with open(file_path, "wb") as f:
            f.write(contents)
        return contents, file_path

class S3StorageProvider(StorageProvider):
    # ...
    def upload_file(self, file: BinaryIO, filename: str) -> Tuple[bytes, str]:
        """Handles uploading of the file to S3 storage."""
        contents, file_path = LocalStorageProvider.upload_file(file, filename)
        try:
            # Overrides original filename with a predictable name based on the content of the file
            sha1=hashlib.sha1(contents).hexdigest()
            _, ext = os.path.splitext(filename) # Keep file extension if any
            filename = f"{sha1}{ext}"

            s3_key = os.path.join(self.key_prefix, filename)

            self.s3_client.upload_file(file_path, self.bucket_name, s3_key)

            return (
                contents,
                "s3://" + self.bucket_name + "/" + s3_key,
            )
        except ClientError as e:
            raise RuntimeError(f"Error uploading file to S3: {e}")

A direct benefit would be to avoid filling the instance hard drive, and avoiding uncessary S3 /GCS storage as we would only store non duplicates files.

This would maye make maintainance harder as we would loose the original filename on the storage, but this may be a non issue (it ould not be in my use case).

From my understanding of the code, this would not affect the rest of the file management flow.

Let me know what this inspire you, I'll gladly work on implementing this feature.

Alternatives Considered

No response

Additional Context

No response

Originally created by @jcsaaddupuy on GitHub (Apr 14, 2025). ### Check Existing Issues - [x] I have searched the existing issues and discussions. ### Problem Description Uploading a file twice end up creating the same file twice on storage backend. ### Desired Solution you'd like One solution could be to use the content file hash (md5/ sha1) as the filename, keeping the extension, and uploading the file only if the file does not already exists. For example (quick and dirty, could be factorized), for local and S3 provider : ```python class LocalStorageProvider(StorageProvider): @staticmethod def upload_file(file: BinaryIO, filename: str) -> Tuple[bytes, str]: contents = file.read() if not contents: raise ValueError(ERROR_MESSAGES.EMPTY_CONTENT) # Overrides original filename with a predictable name based on the content of the file content_sha1=hashlib.sha1(contents).hexdigest() _, ext = os.path.splitext(filename) # Keep file extension if any filename = f"{content_sha1}{ext}" file_path = f"{UPLOAD_DIR}/{filename}" with open(file_path, "wb") as f: f.write(contents) return contents, file_path class S3StorageProvider(StorageProvider): # ... def upload_file(self, file: BinaryIO, filename: str) -> Tuple[bytes, str]: """Handles uploading of the file to S3 storage.""" contents, file_path = LocalStorageProvider.upload_file(file, filename) try: # Overrides original filename with a predictable name based on the content of the file sha1=hashlib.sha1(contents).hexdigest() _, ext = os.path.splitext(filename) # Keep file extension if any filename = f"{sha1}{ext}" s3_key = os.path.join(self.key_prefix, filename) self.s3_client.upload_file(file_path, self.bucket_name, s3_key) return ( contents, "s3://" + self.bucket_name + "/" + s3_key, ) except ClientError as e: raise RuntimeError(f"Error uploading file to S3: {e}") ``` A direct benefit would be to avoid filling the instance hard drive, and avoiding uncessary S3 /GCS storage as we would only store non duplicates files. This would maye make maintainance harder as we would loose the original filename on the storage, but this may be a non issue (it ould not be in my use case). From my understanding of the code, this would not affect the rest of the file management flow. Let me know what this inspire you, I'll gladly work on implementing this feature. ### Alternatives Considered _No response_ ### Additional Context _No response_
Author
Owner

@tkg61 commented on GitHub (Apr 14, 2025):

The only thing to think about is document clean up later. If one document is servicing multiple users or knowledge collections and things need to get cleaned up, it would be some extra steps to clean up the data to ensure the files aren't in use somewhere else.

@tkg61 commented on GitHub (Apr 14, 2025): The only thing to think about is document clean up later. If one document is servicing multiple users or knowledge collections and things need to get cleaned up, it would be some extra steps to clean up the data to ensure the files aren't in use somewhere else.
Author
Owner

@bleriot14 commented on GitHub (Apr 14, 2025):

This is extremely necessary for very much portion of the users.

@bleriot14 commented on GitHub (Apr 14, 2025): This is extremely necessary for very much portion of the users.
Author
Owner

@tjbck commented on GitHub (Apr 28, 2025):

Generally sounds good, PR welcome!

@tjbck commented on GitHub (Apr 28, 2025): Generally sounds good, PR welcome!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#4839