mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 02:48:13 -05:00
[GH-ISSUE #16527] issue: Limited chunk size with S3 Vectors #17944
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Cqban on GitHub (Aug 12, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/16527
Check Existing Issues
Installation Method
Docker
Open WebUI Version
v0.6.22
Ollama Version (if applicable)
No response
Operating System
Debian 12
Browser (if applicable)
No response
Confirmation
README.md.Expected Behavior
Documents should upload sucessfuly
Actual Behavior
I get an error if the chunk size is to big (it works with 300 30 overlap , but not with 400 40 overlap)
Steps to Reproduce
name: open-webui
services:
open-webui:
container_name: open-webui
image: ghcr.io/open-webui/open-webui:main
ports:
- 3000:8080
volumes:
- open-webui:/app/backend/data
restart: unless-stopped
environment:
VECTOR_DB: s3vector
S3_VECTOR_BUCKET_NAME: Hidden
AWS_ACCESS_KEY_ID: Hidden
AWS_SECRET_ACCESS_KEY: Hidden
AWS_REGION: Hidden
S3_VECTOR_REGION: Hidden
volumes:
open-webui:
external: true
name: open-webui
Logs & Screenshots
2025-08-12 13:35:14.822 | ERROR | open_webui.routers.files:upload_file:192 - 400: An error occurred (ValidationException) when calling the PutVectors operation: Invalid record for key '5f2fbbd9-d7df-4ec2-ac9d-6c521c6a9e04': Filterable metadata must have at most 2048 bytes
Traceback (most recent call last):
File "/app/backend/open_webui/routers/retrieval.py", line 1509, in process_file
File "/app/backend/open_webui/routers/retrieval.py", line 1481, in process_file
File "/app/backend/open_webui/routers/retrieval.py", line 1323, in save_docs_to_vector_db
File "/app/backend/open_webui/routers/retrieval.py", line 1315, in save_docs_to_vector_db
File "/app/backend/open_webui/retrieval/vector/dbs/s3vector.py", line 201, in insert
File "/usr/local/lib/python3.11/site-packages/botocore/client.py", line 602, in _api_call
File "/usr/local/lib/python3.11/site-packages/botocore/context.py", line 123, in wrapper
File "/usr/local/lib/python3.11/site-packages/botocore/client.py", line 1078, in _make_api_call
botocore.errorfactory.ValidationException: An error occurred (ValidationException) when calling the PutVectors operation: Invalid record for key '5f2fbbd9-d7df-4ec2-ac9d-6c521c6a9e04': Filterable metadata must have at most 2048 bytes
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/threading.py", line 1002, in _bootstrap
File "/usr/local/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 967, in run
File "/app/backend/open_webui/routers/retrieval.py", line 1526, in process_file
fastapi.exceptions.HTTPException: 400: An error occurred (ValidationException) when calling the PutVectors operation: Invalid record for key '5f2fbbd9-d7df-4ec2-ac9d-6c521c6a9e04': Filterable metadata must have at most 2048 bytes
2025-08-12 13:35:14.932 | ERROR | open_webui.routers.files:upload_file:193 - Error processing file: 041ac3db-5e5a-4415-9163-8f7bd7a4a66d
Additional Information
It might just be because of S3 specifications and limitations but if i can be enhanced then it's worth looking at i guess
@westbrook-ai commented on GitHub (Aug 12, 2025):
Thanks @Cqban, I'll take a closer look at this. I've never used Tika before, do you happen to know if it adds/ extends existing metadata on vectors? I can do some testing and research to figure it out if you're not sure.
@westbrook-ai commented on GitHub (Aug 13, 2025):
Confirmed that the filterable metadata per vector is up to 2KB per the S3 vectors limitations docs: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors-limitations.html
I never ran into that issue during my testing using the Open WebUI defaults for chunk size and chunk overlap, but I'll still take a closer look as soon as I can to try to understand what the total size of the metadata per vector was for me using those defaults.
@joshrenshaw12 commented on GitHub (Aug 13, 2025):
I've seen similar errors like below:
From my research this is due to the number of vectors being sent in the single
PutVectorsoperationhttps://www.perplexity.ai/search/botocore-errorfactory-validati-ZmuvRNkxQeal7FvCWXHTNQ
From docs mentioned above, we know API is limited to 500 vectors per PutVectors API call.
https://docs.aws.amazon.com/AmazonS3/latest/API/API_S3VectorBuckets_PutVectors.html
This aligns with functionality i have seen in testing where a smaller file with less vectors uploads fine, but a larger one with more vectors does not.
Certainly a different issue to that mentioned above but shows the need for some batching logic applied to
backend/open_webui/retrieval/vector/dbs/s3vector.pyto match limitations of the AWS API.@westbrook-ai commented on GitHub (Aug 20, 2025):
@joshrenshaw12 your issue popped up when you tried to upload 500+ files to a collection at once right? I think I have a fix for that now, I can ping you when it ends up in
devif so.@westbrook-ai commented on GitHub (Aug 20, 2025):
@Cqban how big was the file you hit this issue on?
@joshrenshaw12 commented on GitHub (Aug 20, 2025):
not 500+ individual files, just a single large file that was chunked into 500+ vector chunks that all were trying to get uploaded at once
@rromanchuk commented on GitHub (Oct 28, 2025):
I'm getting the same thing but I'm confused about how this is related to file size/chunk/overlap. Isn't this a
tika-config.xmlissue, like re-configuringmaxTotalEstimatedSize? Or maybe an issue between filterable metadata and non-filterable metadata?Also getting this
Is there a log level or a lazy way to be able to tail the s3 put open_webui.retrieval.vector.dbs.s3vector:insert:218? I'm trying to avoid a custom configuration for Tika because it's annoying if testing another provider. Maybe there's a simple way to sanitize/filter from the open_webui.retrieval.vector.dbs.s3vector provider side of things.
@rromanchuk commented on GitHub (Oct 28, 2025):
Weird, it looks like this is already being done, including max key management.
https://github.com/open-webui/open-webui/blob/main/backend/open_webui/retrieval/vector/dbs/s3vector.py#L74
I should probably reset
@rromanchuk commented on GitHub (Oct 28, 2025):
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors-getting-started.html
This is confusing, are they saying
source_textis a reserved keyword that makes it unfilterable and everything else is filterable? So every other kv pair is filterable, likemetadata["text"] = item["text"]?This says absolutely nothing about it too https://docs.aws.amazon.com/AmazonS3/latest/API/API_S3VectorBuckets_PutInputVector.html
ignore me/update ohh, the keys are assigned on index creation ☹️
It doesn't look like this is being assigned though
Full signature
@westbrook-ai commented on GitHub (Oct 29, 2025):
@rromanchuk was marking the text metadata as non-filterable enough to solve an issue for you? I may have missed that, but can definitely test that out and help get that merged to main if so