mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-07 11:28:35 -05:00
[GH-ISSUE #5763] enh: allow to use S3 for uploaded files #29644
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @hongbo-miao on GitHub (Sep 27, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/5763
Is your feature request related to a problem? Please describe.
@tjbck clarified Postgres can be used for metadata (Open WebUI config and user chat history) at https://github.com/open-webui/helm-charts/issues/83#issuecomment-2379241585 except for user uploaded files.
Describe the solution you'd like
It would be great to support to use S3 for uploaded files. Thanks! ☺️
Describe alternatives you've considered
Use EBS or EFS for uploaded files.
Additional context
None
@tjbck commented on GitHub (Sep 27, 2024):
PR Welcome!
@ZhangChaoWN commented on GitHub (Sep 30, 2024):
I'm willing to contribute to this feature.
@DucNgn commented on GitHub (Oct 8, 2024):
@ZhangChaoWN
Thanks for taking this on! I'm interested in contributing to this feature as well.
Lmk if you need help finishing it!
@ZhangChaoWN commented on GitHub (Oct 12, 2024):
@DucNgn, I'm super excited that you're interested in helping out on this feature!
Here’s a quick overview of what's been done so far and what still needs attention.
Completed:
TODO:
I have forked this project and pushed my code changes. If you're interested in collaborating on the coding work, feel free to merge my code or ask me to merge yours. Feel free to point out any mistakes or suggest ways to improve the feature. If you have any ideas for additional tasks, Please feel free to share them as well.
@ZhangChaoWN commented on GitHub (Oct 13, 2024):
Squashed commits and rebased onto the latest main branch in the forked repo
@LeoLiuYan commented on GitHub (Oct 14, 2024):
The content of the uploaded file will be indexed in the vector database; is it still necessary to upload it to S3? @tjbck @ZhangChaoWN
@tjbck commented on GitHub (Oct 21, 2024):
Basic S3 storage support has been added to the development branch, and everything should function as expected, except for image and audio cache handling. Testing is encouraged, and additional pull requests to extend S3 support are welcome.
@gmemstr commented on GitHub (Oct 25, 2024):
Looks like the current implementation doesn't quite work
Testing with Cloudflare R2's S3 API. The filenames are present but with a zero size. I think it's because
.read()is being called multiple times?https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
@tjbck commented on GitHub (Oct 26, 2024):
@gmemstr Good catch, should be addressed in dev! More testing wanted here!
@nickfixit commented on GitHub (Oct 27, 2024):
What about mounting a JuiceFS filesystem?
@gmemstr commented on GitHub (Oct 31, 2024):
S3 handling still seems to be broken - the file seems to be uploading properly now, but not retrieved properly
Log
@davizucon commented on GitHub (Nov 1, 2024):
Hey ! well done, I'm looking forward this feature :)
So, taking a look at the code, I started fixing it, but what do you think if we follow the same approach as for vector/dbs?
Where we have a "common CR
UD" contract and then in config env we choose which implementation provider ("local" or "s3") should instantiate, what do you think?@tjbck commented on GitHub (Nov 3, 2024):
@davizucon external vector dbs are already supported, unsure what you meant here.
@tjbck commented on GitHub (Nov 3, 2024):
@gmemstr should be fixed on dev!
@davizucon commented on GitHub (Nov 3, 2024):
@tjbck , thanks for reply.
this is how the classes and functions are organized. Instead of all functions with if/else, you could make specialist classes that each deal with a specific implementation of storage. I mentioned vector just to be used as an example of organization, it follows this design.
@CallumJHays commented on GitHub (Nov 6, 2024):
Hi all, I'm also looking forward to this feature. I agree with @davizucon idea to organise them into separate classes:
StorageProvider(ABC)FileSystemStorageProvider(StorageProvider)S3StorageProvider(StorageProvider)Also noticed that with the existing implementation there is no pagination on
list_objectscalls, which might cause issues after 1000 uploads. There may be some frontend considerations for such large collections that need further thought.Looking to deploy with a stable release relatively soon, so I'm happy to help put together a PR if it would be accepted 😄
@tjbck commented on GitHub (Nov 6, 2024):
@CallumJHays Feel free to make an initial PR, I'll provide guidance/comment where needed!
@weixu365 commented on GitHub (Nov 7, 2024):
Hi @tjbck , I creatd a PR #6773 to fix the S3 bug and also split files mentioned as @CallumJHays and @davizucon
Other changes:
Please let me know your opinion, thanks
@tjbck commented on GitHub (Nov 7, 2024):
@weixu365 I'd appreciate if you could split them up to atomic PRs.
@weixu365 commented on GitHub (Nov 7, 2024):
Hi @tjbck , If you are fine with splitting the file into multiple files, then the following three things need to be in a single PR:
The following two can be separate PRs:
Please let me know your opinion.
@weixu365 commented on GitHub (Nov 7, 2024):
I can also create PRs in the following order:
The 2nd and 3rd are slightly coupled, so it would be easier to merge them into one PR.
@weixu365 commented on GitHub (Nov 11, 2024):
Hi @tjbck, is there any update on the PR? I can split it into a couple of small PRs, but I need your guidance on what changes are accepted in the above 6 atomic changes.
@Mavial commented on GitHub (Nov 15, 2024):
Hey @tjbck, does this mean that the S3 bugfixes are delayed until v5?
@lewis-ing commented on GitHub (Nov 19, 2024):
so, i had upload file for S3,but console print error info:
`INFO [open_webui.apps.webui.routers.files] file.content_type: application/pdf
ERROR [open_webui.apps.retrieval.main] list index out of range
Traceback (most recent call last):
File "D:\sourcecode\soft-factory\ai\open-webui\backend\open_webui\apps\retrieval\main.py", line 835, in process_file
file_path = Storage.get_file(file_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sourcecode\soft-factory\ai\open-webui\backend\open_webui\storage\provider.py", line 144, in get_file
return self._get_file_from_s3(file_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sourcecode\soft-factory\ai\open-webui\backend\open_webui\storage\provider.py", line 71, in _get_file_from_s3
bucket_name, key = file_path.split("//")[1].split("/")
~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
ERROR [open_webui.apps.webui.routers.files] 400: list index out of range
Traceback (most recent call last):
File "D:\sourcecode\soft-factory\ai\open-webui\backend\open_webui\apps\retrieval\main.py", line 835, in process_file
file_path = Storage.get_file(file_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sourcecode\soft-factory\ai\open-webui\backend\open_webui\storage\provider.py", line 144, in get_file
return self._get_file_from_s3(file_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sourcecode\soft-factory\ai\open-webui\backend\open_webui\storage\provider.py", line 71, in _get_file_from_s3
bucket_name, key = file_path.split("//")[1].split("/")
~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\sourcecode\soft-factory\ai\open-webui\backend\open_webui\apps\webui\routers\files.py", line 71, in upload_file
process_file(ProcessFileForm(file_id=id))
File "D:\sourcecode\soft-factory\ai\open-webui\backend\open_webui\apps\retrieval\main.py", line 903, in process_file
raise HTTPException(
fastapi.exceptions.HTTPException: 400: list index out of range
ERROR [open_webui.apps.webui.routers.files] Error processing file: 54add99d-1754-431d-8935-8ecd82794aae
INFO: 127.0.0.1:49411 - "POST /api/v1/files/ HTTP/1.1" 200 OK`
@lewis-ing commented on GitHub (Nov 19, 2024):
So, I figure it, modified provider.py file,
@RobinBially commented on GitHub (Nov 19, 2024):
#7040
@freeload101 commented on GitHub (Dec 23, 2024):
Why not just use one of the other million ways to mount S3 to a path like s3fs ... Or something that does not use fuse whatever...
@dallenpyrah commented on GitHub (Jan 8, 2025):
Really looking forward to this feature, our team is wanting to upload .txt files of our codebase to S3 on merges so we can chat with our entire system in OpenWebUI.
@rragundez commented on GitHub (Jan 15, 2025):
For splitting into classes without modifying the logic in the code #8580
Once merged after any modifications I will add the GCS storage provider which is actually the reason for me to contribute as I need it, to not hack my way into it.
@tjbck commented on GitHub (Jan 17, 2025):
Testing wanted with the latest dev! Might've resolved a lot of issues you guys were facing!
@rragundez commented on GitHub (Jan 19, 2025):
Hi @RobinBially @lewis-ing, I added the PR with the tests and the refactoring of the Storage classes. It seems to me the logic is correct so could it be that you were using the local storage and then changes to the S3 Storage while using the same DB or docker volume? If this is the case, then indeed there might be issues because the file is saves to the database using an ID and a property call path which determines if it is local or S3. So if it was saved the first time using one storage provider and then changed, there would be a mistmatch as the path in the database will be pointed to local while the function being called to load the file will be S3 (for example).
If you still see the error given that you start with a clean environment (DB and docker volume) can you post it here with the scenario, then I can add a test for it and try to solve it.
@hongbo-miao since you opened this Issue can you also double check? thanks
@rragundez commented on GitHub (Jan 19, 2025):
I do think one decision remains, if using an external provider (e.g. S3, GCS, Azure Blob, etc.) should the application still interact with the local filesystem, right now if using S3, files are being saved to both locations on upload and download.
I think this one is for you to decide @tjbck. I can help with the implementation.
My 2cents. From an application, deployment and scaling point of view, the application should never store data, but only use external sources or at least give the possibility to do so. There are known issues when the application is holding data: data deletion on app failure or redeployment, bloating the filesystem, no single source of truth, and different behavior for users depending on which VM/pod their request lands in (distributed deployment).
There are indeed methods to mitigate this, like from infra having a single filesystem that mounts into each VM/pod but then that solution would be infra wise not application wise, and I guess this will reduce the amount of audience that could use this feature appropriately since infra is another different capability.
Now, it is true that other issues might impact the connection to the external storage, like internet connection, firewalls, etc. So in order to be a bit resilient to this, some fail mechanism where if cannot retrieve the file from the external storage then it tries to retrieve it locally. This argument, for me, it is not enough to counter the one in favor of only setting for external storage, unless explicitly indicated by the user. Perhaps, even though it will add complexity, the fail safety mechanism to store files also locally could be only triggered if there is flag that the user sets explicitly.
In conclusion I would only go for storing in external storage and then wait to see how that goes.
@antoinebou12 commented on GitHub (Jan 21, 2025):
Can you add minio support also
@tjbck commented on GitHub (Jan 22, 2025):
Refac has been merged to main, testing wanted here!
@Mavial commented on GitHub (Jan 25, 2025):
I've been using it in prod for 40 users since it was pushed to dev and have found no issues with the s3 provider.
@rragundez commented on GitHub (Jan 25, 2025):
Thanks for the feedback. @tjbck should we close the issue?
@spammenotinoz commented on GitHub (Jan 28, 2025):
Agree entirely.
@JoeChen2me commented on GitHub (Feb 15, 2025):
I have identified a problem: S3 storage is limited to AWS S3 and does not support other compatible services such as Cloudflare R2, particularly concerning the configuration of the S3_REGION_NAME environment variable.
@Mavial commented on GitHub (Feb 15, 2025):
I've been using S3 on IONOS with no Problems. Please elaborate und your exact Problems?
@JoeChen2me commented on GitHub (Feb 15, 2025):
you can view the issue created by me. ISSUE
And, have you updated to the latest version(0.5.12).
@blegry commented on GitHub (May 19, 2025):
Agreed.
I conducted some behavior tests. My environment: owui, external S3, external vectorDB, external orcization:
--> In both cases, the system continues to function properly. I can discuss with chunks and download files that were "locally deleted." This means that S3 is sufficient.
So, what is the purpose of this? Are there any scenarios where it breaks something? When downloading a file, do the files on the container reappear in ./data/uploads.