Duplicate Rows in document_chunk table with pgvector #3062

Closed
opened 2025-11-11 15:20:47 -06:00 by GiteaMirror · 11 comments
Owner

Originally created by @bgeneto on GitHub (Dec 22, 2024).

Bug Report

Installation Method

Docker compose

Environment

  • Open WebUI Version: v0.4.8

  • Ollama (if applicable): n/a

  • Operating System: Debian 12

  • Browser (if applicable): n/a

Confirmation:

  • [ x ] I have read and followed all the instructions provided in the README.md.
  • [ x ] I am on the latest version of both Open WebUI and Ollama.
  • [ x ] I have included the browser console logs.
  • [ x ] I have included the Docker container logs.
  • [ x ] I have provided the exact steps to reproduce the bug in the "Steps to Reproduce" section below.

Issue Description:

I've identified a bug where, when using VECTOR_DB="pgvector", the document_chunk table experiences duplicate row entries. This appears to stem from the system storing vector embeddings not only for individual file but also for the associated Knowledge. This duplication leads to unnecessary storage consumption, which can become significant when dealing with large datasets and extensive RAG (Retrieval Augmented Generation) usage.

Bug Summary:

The document_chunk table stores duplicate rows when using VECTOR_DB="pgvector". This is due to the system storing the same vector embeddings for both the file UUID and Knowledge UUID, leading to inefficient storage usage.

Expected Behavior:

When using VECTOR_DB="pgvector", I expect the document_chunk table to store only one set of vector embeddings per document/file. The system should not store redundant embeddings for the same file. I think that the relationship between chunks and their source file should be maintained through metadata or a separate reference, not by duplicating embeddings.

Actual Behavior:

When using VECTOR_DB="pgvector", the document_chunk table contains duplicate rows. Specifically, for each document added to a knowledge source, there are embeddings stored for the individual file and also an additional embedding stored for the entire Knowledge itself. This results in multiple rows in the document_chunk table that represent the same document, leading to inflated storage usage.

Reproduction Details

Steps to Reproduce:

  1. Set the VECTOR_DB environment variable to "pgvector".
  2. Start the OpenWebUI application.
  3. Create a new Knowledge source.
  4. Upload a document (e.g., a PDF or text file) to the newly created Knowledge source.
  5. Observe the document_chunk table in the PostgreSQL database.
  6. You will find "duplicated rows" for the same document.
  7. Repeat steps 3-6 with different documents and observe the same duplication pattern.

Relevant logs:

In the logs we can see the intentional duplication:

Inserted 557 items into collection 'file-ae9fff53-e062-47ef-970a-01e0b49381fc'.
Inserted 557 items into collection 'a31e1107-cc16-405e-a8fa-4025afd4b529'.

Additional Information:

The issue is specifically observed when VECTOR_DB="pgvector" is configured. I didn't inspect other vector storage engines. But maybe this "issue" is present in other engines also.

Originally created by @bgeneto on GitHub (Dec 22, 2024). # Bug Report ## Installation Method Docker compose ## Environment - **Open WebUI Version:** v0.4.8 - **Ollama (if applicable):** n/a - **Operating System:** Debian 12 - **Browser (if applicable):** n/a **Confirmation:** - [ x ] I have read and followed all the instructions provided in the README.md. - [ x ] I am on the latest version of both Open WebUI and Ollama. - [ x ] I have included the browser console logs. - [ x ] I have included the Docker container logs. - [ x ] I have provided the exact steps to reproduce the bug in the "Steps to Reproduce" section below. ## Issue Description: I've identified a bug where, when using `VECTOR_DB="pgvector"`, the `document_chunk` table experiences duplicate row entries. This appears to stem from the system storing vector embeddings not only for individual file but also for the associated Knowledge. This duplication leads to unnecessary storage consumption, which can become significant when dealing with large datasets and extensive RAG (Retrieval Augmented Generation) usage. ## Bug Summary: The `document_chunk` table stores duplicate rows when using `VECTOR_DB="pgvector"`. This is due to the system storing the same vector embeddings for both the file UUID and Knowledge UUID, leading to inefficient storage usage. ## Expected Behavior: When using `VECTOR_DB="pgvector"`, I expect the `document_chunk` table to store only one set of vector embeddings per document/file. The system should not store redundant embeddings for the same file. I think that the relationship between chunks and their source file should be maintained through metadata or a separate reference, not by duplicating embeddings. ## Actual Behavior: When using `VECTOR_DB="pgvector"`, the `document_chunk` table contains duplicate rows. Specifically, for each document added to a knowledge source, there are embeddings stored for the individual file and also an additional embedding stored for the entire Knowledge itself. This results in multiple rows in the `document_chunk` table that represent the same document, leading to inflated storage usage. ## Reproduction Details **Steps to Reproduce:** 1. Set the `VECTOR_DB` environment variable to `"pgvector"`. 2. Start the OpenWebUI application. 3. Create a new Knowledge source. 4. Upload a document (e.g., a PDF or text file) to the newly created Knowledge source. 5. Observe the `document_chunk` table in the PostgreSQL database. 6. You will find "duplicated rows" for the same document. 7. Repeat steps 3-6 with different documents and observe the same duplication pattern. ## Relevant logs: In the logs we can see the intentional duplication: ``` Inserted 557 items into collection 'file-ae9fff53-e062-47ef-970a-01e0b49381fc'. Inserted 557 items into collection 'a31e1107-cc16-405e-a8fa-4025afd4b529'. ``` ## Additional Information: The issue is specifically observed when `VECTOR_DB="pgvector"` is configured. I didn't inspect other vector storage engines. But maybe this "issue" is present in other engines also.
Author
Owner

@tjbck commented on GitHub (Dec 22, 2024):

pgvector isn't officially supported, PR welcome here!

@tjbck commented on GitHub (Dec 22, 2024): `pgvector` isn't officially supported, PR welcome here!
Author
Owner

@bgeneto commented on GitHub (Dec 22, 2024):

pgvector isn't officially supported, PR welcome here!

Wow, but it's still functioning better (and faster) than the built-in ChromaDB for me. It seems promising... Additionally, we can utilize PostgreSQL as the backend database for the entire OpenWebUI appliance, reducing dependencies/requirements.

@bgeneto commented on GitHub (Dec 22, 2024): > `pgvector` isn't officially supported, PR welcome here! Wow, but it's still functioning better (and faster) than the built-in ChromaDB for me. It seems promising... Additionally, we can utilize PostgreSQL as the backend database for the entire OpenWebUI appliance, reducing dependencies/requirements.
Author
Owner

@beastech commented on GitHub (Dec 25, 2024):

I'm very interested in this functionality, what still needs to be done for pgvector to be officially supported?

@beastech commented on GitHub (Dec 25, 2024): I'm very interested in this functionality, what still needs to be done for pgvector to be officially supported?
Author
Owner

@tjbck commented on GitHub (Dec 25, 2024):

Community contributions are always welcome.

@tjbck commented on GitHub (Dec 25, 2024): Community contributions are always welcome.
Author
Owner

@bgeneto commented on GitHub (Dec 25, 2024):

It is already working fine for me with Jina Embeddings v3, particularly
when used with a (max) chunk size of 1536. One important note is to ensure
that the vector extension is installed on your database. I encountered
significant issues with ChromaDB using SQLite, as it frequently stopped
working when the collection size increased, requiring me to reset it and
resulting in data loss after some time.

Em ter., 24 de dez. de 2024, 21:04, beastech @.***>
escreveu:

I'm very interested in this functionality, what still needs to be done for
pgvector to be officially supported?


Reply to this email directly, view it on GitHub
https://github.com/open-webui/open-webui/issues/7995#issuecomment-2561498134,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AADTP4TOUJYGSKVPFQHJ7VT2HHZBXAVCNFSM6AAAAABUBHPU5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRRGQ4TQMJTGQ
.
You are receiving this because you authored the thread.Message ID:
@.***>

@bgeneto commented on GitHub (Dec 25, 2024): It is already working fine for me with Jina Embeddings v3, particularly when used with a (max) chunk size of 1536. One important note is to ensure that the vector extension is installed on your database. I encountered significant issues with ChromaDB using SQLite, as it frequently stopped working when the collection size increased, requiring me to reset it and resulting in data loss after some time. Em ter., 24 de dez. de 2024, 21:04, beastech ***@***.***> escreveu: > I'm very interested in this functionality, what still needs to be done for > pgvector to be officially supported? > > — > Reply to this email directly, view it on GitHub > <https://github.com/open-webui/open-webui/issues/7995#issuecomment-2561498134>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AADTP4TOUJYGSKVPFQHJ7VT2HHZBXAVCNFSM6AAAAABUBHPU5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRRGQ4TQMJTGQ> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >
Author
Owner

@beastech commented on GitHub (Dec 25, 2024):

I had the same problem with ChromaDB when a lot of users started uploading a lot of documents. Milvus has worked much better.

I'm going to test it with pgvector soon.

@beastech commented on GitHub (Dec 25, 2024): I had the same problem with ChromaDB when a lot of users started uploading a lot of documents. Milvus has worked much better. I'm going to test it with pgvector soon.
Author
Owner

@jk-f5 commented on GitHub (Dec 31, 2024):

This isn't actually a bug in the pgvector implementation. It is true that duplicate entries are being created for a knowledge collection in document_chunk but the upstream call to VECTOR_DB_CLIENT.insert is being called twice from backend/open_webui/routers/retrieval.py

In other words, I think this issue exists for all vectordb implementations.

I'll see if I can figure out what's going on.

@jk-f5 commented on GitHub (Dec 31, 2024): This isn't actually a bug in the pgvector implementation. It is true that duplicate entries are being created for a knowledge collection in document_chunk but the upstream call to `VECTOR_DB_CLIENT.insert` is being called twice from `backend/open_webui/routers/retrieval.py` In other words, I think this issue exists for all vectordb implementations. I'll see if I can figure out what's going on.
Author
Owner

@jk-f5 commented on GitHub (Jan 1, 2025):

@tjbck - This is a bug in the UI when uploading files for a document collection. When a file is uploaded for a collection:

  • /api/v1/files is called then process_file is called which creates embeddings for the uploaded file with a collection name of file-{file.id} which is exactly what happens when uploading a file through the chat prompt.
  • Next /api/v1/knowledge/<uuid>/file/add is called which calls process_file again this time with a collection_id set to an id generated for the uploaded file in the knowledge collection.

I believe the frontend just needs to be altered not POST to /api/v1/files when adding files to knowledge collections to prevent this from happening.

@jk-f5 commented on GitHub (Jan 1, 2025): @tjbck - This is a bug in the UI when uploading files for a document collection. When a file is uploaded for a collection: - `/api/v1/files` is called then `process_file` is called which creates embeddings for the uploaded file with a collection name of `file-{file.id}` which is exactly what happens when uploading a file through the chat prompt. - Next `/api/v1/knowledge/<uuid>/file/add` is called which calls `process_file` again this time with a collection_id set to an id generated for the uploaded file in the knowledge collection. I believe the frontend just needs to be altered not POST to `/api/v1/files` when adding files to knowledge collections to prevent this from happening.
Author
Owner

@tjbck commented on GitHub (Jan 1, 2025):

@jk-f5 No, both are required to work. The first call ensures embeddings are created for the file itself, which is essential for the file to be processed and represented properly. The second call to process_file is necessary to associate those embeddings with the knowledge collection in the vector DB. Removing the first call would break isolated file processing, and skipping the second call would mean the file won't integrate into the relevant knowledge collection. Both steps are integral to maintaining proper functionality.

@tjbck commented on GitHub (Jan 1, 2025): @jk-f5 No, both are required to work. The first call ensures embeddings are created for the file itself, which is essential for the file to be processed and represented properly. The second call to `process_file` is necessary to associate those embeddings with the knowledge collection in the vector DB. Removing the first call would break isolated file processing, and skipping the second call would mean the file won't integrate into the relevant knowledge collection. Both steps are integral to maintaining proper functionality.
Author
Owner

@jk-f5 commented on GitHub (Jan 1, 2025):

Hmmm, how would you suggest we keep from creating the embeddings twice? Prevent the second call from creating them again and associate the first set of embeddings?

@jk-f5 commented on GitHub (Jan 1, 2025): Hmmm, how would you suggest we keep from creating the embeddings twice? Prevent the second call from creating them again and associate the first set of embeddings?
Author
Owner

@tjbck commented on GitHub (Jan 1, 2025):

I'm sure we could explore optimization techniques to prevent creating embeddings twice, but the bottom line is that both calls are essential and duplicated embeddings exist by design. The intent is to allow users to attach individual files from the knowledge collection in isolation, which requires retaining embeddings for both the standalone file and the collection.

That said, I agree there’s room to optimize this and avoid spending computational resources on embedding the same content twice. For example, we could implement a mechanism to reuse the first set of embeddings when associating the file with the knowledge collection, rather than reprocessing it. However, this would require careful handling to maintain the functionality and flexibility we currently have.

@tjbck commented on GitHub (Jan 1, 2025): I'm sure we could explore optimization techniques to prevent creating embeddings twice, but the bottom line is that both calls are essential and duplicated embeddings exist by design. The intent is to allow users to attach individual files from the knowledge collection in isolation, which requires retaining embeddings for both the standalone file and the collection. That said, I agree there’s room to optimize this and avoid spending computational resources on embedding the same content twice. For example, we could implement a mechanism to reuse the first set of embeddings when associating the file with the knowledge collection, rather than reprocessing it. However, this would require careful handling to maintain the functionality and flexibility we currently have.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#3062