[GH-ISSUE #17872] issue: Each document added to a knowledge base creates a new, redundant ChromaDB collection, causing excessive disk usage. #33953

Closed
opened 2026-04-25 07:50:12 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @bshebl on GitHub (Sep 28, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/17872

Check Existing Issues

  • I have searched for any existing and/or related issues.
  • I have searched for any existing and/or related discussions.
  • I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

0.6.31

Ollama Version (if applicable)

0.12.3

Operating System

Ubuntu 22.04.5 LTS

Browser (if applicable)

No response

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided every relevant configuration, setting, and environment variable used in my setup.
  • I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
  • I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
  • Start with the initial platform/version/OS and dependencies used,
  • Specify exact install/launch/configure commands,
  • List URLs visited, user input (incl. example values/emails/passwords if needed),
  • Describe all options and toggles enabled or changed,
  • Include any files or environmental changes,
  • Identify the expected and actual result at each stage,
  • Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

When adding multiple documents to a single knowledge base, the system should add the new vectorized data to the existing ChromaDB collection. The on-disk size of the vector database should grow incrementally, in proportion to the size of the new document's data.

Actual Behavior

When a new document is added to a knowledge base, the system incorrectly creates a brand-new, separate ChromaDB collection in the vector_db directory instead of adding the data to the existing collection.

Each of these new collections has a massive fixed overhead of approximately 100 MB, regardless of the document's size. This leads to rapid and unsustainable disk space consumption. For example, adding a second small text file consumes an additional ~100 MB of disk space.

Note on an observed anomaly: In one initial test, the first document upload inexplicably created two 100 MB collections. While subsequent tests have not replicated this duplication, I am noting it in case it points to a deeper race condition or initialization issue. The primary bug, however, is the creation of a new collection for each subsequent file.

Steps to Reproduce

  1. Start with a clean Open Web UI installation using Docker. Ensure the vector_db volume is empty.
  2. Log into the Open Web UI instance.
  3. Configure Ollama with an embedding model. For this test, the following was used:
    • Embedding Model: qwen3-embedding:4b-q8_0
    • Chunk Size: 850
    • Chunk Overlap: 150
  4. Navigate to the Documents section in the left-hand menu.
  5. Click New document and upload your first file (e.g., file_A.txt, a simple text file).
  6. Wait for the embedding process to complete.
  7. Check the disk usage of the vector database on the host machine. SSH into the server and run:
    du -ah /var/lib/docker/volumes/open-webui/_data/vector_db | sort -rh | head -n 15
    • Observation 1: The total size will be around ~167 MB, and you will see one large collection directory (with a UUID as its name).
  8. Return to the Open Web UI Documents section.
  9. Click New document again and upload a second, different file (e.g., file_B.txt).
  10. Wait for the embedding process to complete.
  11. Check the disk usage of the vector database again on the host machine:
    du -ah /var/lib/docker/volumes/open-webui/_data/vector_db | sort -rh | head -n 15
    • Observation 2: The total size will have grown by another ~100 MB to ~267 MB. The command output will now show two separate 100 MB collection directories, confirming that a new, redundant collection was created instead of adding to the existing one.

Logs & Screenshots

Here are the logs from the du command with masked UUIDs, demonstrating the issue.

After uploading the FIRST file:

167M    /var/lib/docker/volumes/open-webui/_data/vector_db
100M    /var/lib/docker/volumes/open-webui/_data/vector_db/[collection-uuid-1]
99M     /var/lib/docker/volumes/open-webui/_data/vector_db/[collection-uuid-1]/data_level0.bin
67M     /var/lib/docker/volumes/open-webui/_data/vector_db/chroma.sqlite3

After uploading the SECOND file:

267M    /var/lib/docker/volumes/open-webui/_data/vector_db
100M    /var/lib/docker/volumes/open-webui/_data/vector_db/[collection-uuid-2]
100M    /var/lib/docker/volumes/open-webui/_data/vector_db/[collection-uuid-1]
99M     /var/lib/docker/volumes/open-webui/_data/vector_db/[collection-uuid-2]/data_level0.bin
99M     /var/lib/docker/volumes/open-webui/_data/vector_db/[collection-uuid-1]/data_level0.bin
67M     /var/lib/docker/volumes/open-webui/_data/vector_db/chroma.sqlite3

Additional Information

  • Root Cause Hypothesis: The application logic appears to be calling a function equivalent to chroma.create_collection() for each new document upload, rather than using chroma.get_or_create_collection(name="...") and then collection.add() to append documents to the correct, existing collection.
  • Impact: This bug makes the knowledge base feature impractical for any use case involving more than a handful of documents, as it can quickly fill server storage.
  • Environment:
    • Open Web UI Version: 0.6.3.1
    • Deployment: Docker
    • Ollama Version: 0.12.3
    • Embedding Model: qwen3-embedding:4b-q8_0
Originally created by @bshebl on GitHub (Sep 28, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/17872 ### Check Existing Issues - [x] I have searched for any existing and/or related issues. - [x] I have searched for any existing and/or related discussions. - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version 0.6.31 ### Ollama Version (if applicable) 0.12.3 ### Operating System Ubuntu 22.04.5 LTS ### Browser (if applicable) _No response_ ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior When adding multiple documents to a single knowledge base, the system should add the new vectorized data to the **existing ChromaDB collection**. The on-disk size of the vector database should grow incrementally, in proportion to the size of the new document's data. ### Actual Behavior When a new document is added to a knowledge base, the system incorrectly creates **a brand-new, separate ChromaDB collection** in the `vector_db` directory instead of adding the data to the existing collection. Each of these new collections has a massive fixed overhead of approximately 100 MB, regardless of the document's size. This leads to rapid and unsustainable disk space consumption. For example, adding a second small text file consumes an additional ~100 MB of disk space. *Note on an observed anomaly:* In one initial test, the first document upload inexplicably created *two* 100 MB collections. While subsequent tests have not replicated this duplication, I am noting it in case it points to a deeper race condition or initialization issue. The primary bug, however, is the creation of a new collection for each subsequent file. ### Steps to Reproduce 1. Start with a clean Open Web UI installation using Docker. Ensure the `vector_db` volume is empty. 2. Log into the Open Web UI instance. 3. Configure Ollama with an embedding model. For this test, the following was used: * **Embedding Model:** `qwen3-embedding:4b-q8_0` * **Chunk Size:** 850 * **Chunk Overlap:** 150 4. Navigate to the **Documents** section in the left-hand menu. 5. Click **New document** and upload your first file (e.g., `file_A.txt`, a simple text file). 6. Wait for the embedding process to complete. 7. Check the disk usage of the vector database on the host machine. SSH into the server and run: `du -ah /var/lib/docker/volumes/open-webui/_data/vector_db | sort -rh | head -n 15` * *Observation 1:* The total size will be around ~167 MB, and you will see one large collection directory (with a UUID as its name). 8. Return to the Open Web UI **Documents** section. 9. Click **New document** again and upload a **second, different file** (e.g., `file_B.txt`). 10. Wait for the embedding process to complete. 11. Check the disk usage of the vector database again on the host machine: `du -ah /var/lib/docker/volumes/open-webui/_data/vector_db | sort -rh | head -n 15` * *Observation 2:* The total size will have grown by another ~100 MB to ~267 MB. The command output will now show **two** separate 100 MB collection directories, confirming that a new, redundant collection was created instead of adding to the existing one. ### Logs & Screenshots Here are the logs from the `du` command with masked UUIDs, demonstrating the issue. **After uploading the FIRST file:** ``` 167M /var/lib/docker/volumes/open-webui/_data/vector_db 100M /var/lib/docker/volumes/open-webui/_data/vector_db/[collection-uuid-1] 99M /var/lib/docker/volumes/open-webui/_data/vector_db/[collection-uuid-1]/data_level0.bin 67M /var/lib/docker/volumes/open-webui/_data/vector_db/chroma.sqlite3 ``` **After uploading the SECOND file:** ``` 267M /var/lib/docker/volumes/open-webui/_data/vector_db 100M /var/lib/docker/volumes/open-webui/_data/vector_db/[collection-uuid-2] 100M /var/lib/docker/volumes/open-webui/_data/vector_db/[collection-uuid-1] 99M /var/lib/docker/volumes/open-webui/_data/vector_db/[collection-uuid-2]/data_level0.bin 99M /var/lib/docker/volumes/open-webui/_data/vector_db/[collection-uuid-1]/data_level0.bin 67M /var/lib/docker/volumes/open-webui/_data/vector_db/chroma.sqlite3 ``` ### Additional Information * **Root Cause Hypothesis:** The application logic appears to be calling a function equivalent to `chroma.create_collection()` for each new document upload, rather than using `chroma.get_or_create_collection(name="...")` and then `collection.add()` to append documents to the correct, existing collection. * **Impact:** This bug makes the knowledge base feature impractical for any use case involving more than a handful of documents, as it can quickly fill server storage. * **Environment:** * Open Web UI Version: 0.6.3.1 * Deployment: Docker * Ollama Version: 0.12.3 * Embedding Model: `qwen3-embedding:4b-q8_0`
GiteaMirror added the bug label 2026-04-25 07:50:12 -05:00
Author
Owner

@tjbck commented on GitHub (Sep 29, 2025):

PLEASE look for existing issues/discussions. This is an intended behaviour to enable individual file attachement.

<!-- gh-comment-id:3344456423 --> @tjbck commented on GitHub (Sep 29, 2025): PLEASE look for existing issues/discussions. This is an intended behaviour to enable individual file attachement.
Author
Owner

@bshebl commented on GitHub (Sep 29, 2025):

Thank you for the clarification! I understand now that the goal is to enable isolated queries for individual files.

My feedback, then, is that the current implementation of creating a new ChromaDB collection for each file has an extremely high disk-space overhead (~150 MB per file). This makes the feature unfeasible for users who need to attach more than a handful of documents.

Would the team be open to exploring a more efficient design? The standard industry approach is to use a single collection and apply metadata filtering. For example, all chunks from file_A.pdf would be stored with a metadata tag like {'source_file': 'file_A.pdf'}. A query could then be filtered to only look at chunks with that specific tag. This would achieve the same goal of file isolation without the massive storage cost.

Looks like the most direct workaround is to manually combine the documents into a single file before uploading.

Since the system is designed to create a new, inefficient "binder" for every file it receives, the solution is to staple all the pages together first and hand it a single, thick document. This forces Open Web UI to create only one collection, achieving the goal of a unified knowledge base.

<!-- gh-comment-id:3344545085 --> @bshebl commented on GitHub (Sep 29, 2025): Thank you for the clarification! I understand now that the goal is to enable isolated queries for individual files. My feedback, then, is that the current implementation of creating a new ChromaDB collection for each file has an extremely high disk-space overhead (~150 MB per file). This makes the feature unfeasible for users who need to attach more than a handful of documents. Would the team be open to exploring a more efficient design? The standard industry approach is to use a single collection and apply metadata filtering. For example, all chunks from file_A.pdf would be stored with a metadata tag like {'source_file': 'file_A.pdf'}. A query could then be filtered to only look at chunks with that specific tag. This would achieve the same goal of file isolation without the massive storage cost. Looks like the most direct workaround is to manually combine the documents into a single file before uploading. Since the system is designed to create a new, inefficient "binder" for every file it receives, the solution is to staple all the pages together first and hand it a single, thick document. This forces Open Web UI to create only one collection, achieving the goal of a unified knowledge base.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#33953