mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 10:58:17 -05:00
[GH-ISSUE #17872] issue: Each document added to a knowledge base creates a new, redundant ChromaDB collection, causing excessive disk usage. #33953
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @bshebl on GitHub (Sep 28, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/17872
Check Existing Issues
Installation Method
Docker
Open WebUI Version
0.6.31
Ollama Version (if applicable)
0.12.3
Operating System
Ubuntu 22.04.5 LTS
Browser (if applicable)
No response
Confirmation
README.md.Expected Behavior
When adding multiple documents to a single knowledge base, the system should add the new vectorized data to the existing ChromaDB collection. The on-disk size of the vector database should grow incrementally, in proportion to the size of the new document's data.
Actual Behavior
When a new document is added to a knowledge base, the system incorrectly creates a brand-new, separate ChromaDB collection in the
vector_dbdirectory instead of adding the data to the existing collection.Each of these new collections has a massive fixed overhead of approximately 100 MB, regardless of the document's size. This leads to rapid and unsustainable disk space consumption. For example, adding a second small text file consumes an additional ~100 MB of disk space.
Note on an observed anomaly: In one initial test, the first document upload inexplicably created two 100 MB collections. While subsequent tests have not replicated this duplication, I am noting it in case it points to a deeper race condition or initialization issue. The primary bug, however, is the creation of a new collection for each subsequent file.
Steps to Reproduce
vector_dbvolume is empty.qwen3-embedding:4b-q8_0file_A.txt, a simple text file).du -ah /var/lib/docker/volumes/open-webui/_data/vector_db | sort -rh | head -n 15file_B.txt).du -ah /var/lib/docker/volumes/open-webui/_data/vector_db | sort -rh | head -n 15Logs & Screenshots
Here are the logs from the
ducommand with masked UUIDs, demonstrating the issue.After uploading the FIRST file:
After uploading the SECOND file:
Additional Information
chroma.create_collection()for each new document upload, rather than usingchroma.get_or_create_collection(name="...")and thencollection.add()to append documents to the correct, existing collection.qwen3-embedding:4b-q8_0@tjbck commented on GitHub (Sep 29, 2025):
PLEASE look for existing issues/discussions. This is an intended behaviour to enable individual file attachement.
@bshebl commented on GitHub (Sep 29, 2025):
Thank you for the clarification! I understand now that the goal is to enable isolated queries for individual files.
My feedback, then, is that the current implementation of creating a new ChromaDB collection for each file has an extremely high disk-space overhead (~150 MB per file). This makes the feature unfeasible for users who need to attach more than a handful of documents.
Would the team be open to exploring a more efficient design? The standard industry approach is to use a single collection and apply metadata filtering. For example, all chunks from file_A.pdf would be stored with a metadata tag like {'source_file': 'file_A.pdf'}. A query could then be filtered to only look at chunks with that specific tag. This would achieve the same goal of file isolation without the massive storage cost.
Looks like the most direct workaround is to manually combine the documents into a single file before uploading.
Since the system is designed to create a new, inefficient "binder" for every file it receives, the solution is to staple all the pages together first and hand it a single, thick document. This forces Open Web UI to create only one collection, achieving the goal of a unified knowledge base.