mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-08 12:58:11 -05:00
Duplicate Rows in document_chunk table with pgvector #3062
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @bgeneto on GitHub (Dec 22, 2024).
Bug Report
Installation Method
Docker compose
Environment
Open WebUI Version: v0.4.8
Ollama (if applicable): n/a
Operating System: Debian 12
Browser (if applicable): n/a
Confirmation:
Issue Description:
I've identified a bug where, when using
VECTOR_DB="pgvector", thedocument_chunktable experiences duplicate row entries. This appears to stem from the system storing vector embeddings not only for individual file but also for the associated Knowledge. This duplication leads to unnecessary storage consumption, which can become significant when dealing with large datasets and extensive RAG (Retrieval Augmented Generation) usage.Bug Summary:
The
document_chunktable stores duplicate rows when usingVECTOR_DB="pgvector". This is due to the system storing the same vector embeddings for both the file UUID and Knowledge UUID, leading to inefficient storage usage.Expected Behavior:
When using
VECTOR_DB="pgvector", I expect thedocument_chunktable to store only one set of vector embeddings per document/file. The system should not store redundant embeddings for the same file. I think that the relationship between chunks and their source file should be maintained through metadata or a separate reference, not by duplicating embeddings.Actual Behavior:
When using
VECTOR_DB="pgvector", thedocument_chunktable contains duplicate rows. Specifically, for each document added to a knowledge source, there are embeddings stored for the individual file and also an additional embedding stored for the entire Knowledge itself. This results in multiple rows in thedocument_chunktable that represent the same document, leading to inflated storage usage.Reproduction Details
Steps to Reproduce:
VECTOR_DBenvironment variable to"pgvector".document_chunktable in the PostgreSQL database.Relevant logs:
In the logs we can see the intentional duplication:
Additional Information:
The issue is specifically observed when
VECTOR_DB="pgvector"is configured. I didn't inspect other vector storage engines. But maybe this "issue" is present in other engines also.@tjbck commented on GitHub (Dec 22, 2024):
pgvectorisn't officially supported, PR welcome here!@bgeneto commented on GitHub (Dec 22, 2024):
Wow, but it's still functioning better (and faster) than the built-in ChromaDB for me. It seems promising... Additionally, we can utilize PostgreSQL as the backend database for the entire OpenWebUI appliance, reducing dependencies/requirements.
@beastech commented on GitHub (Dec 25, 2024):
I'm very interested in this functionality, what still needs to be done for pgvector to be officially supported?
@tjbck commented on GitHub (Dec 25, 2024):
Community contributions are always welcome.
@bgeneto commented on GitHub (Dec 25, 2024):
It is already working fine for me with Jina Embeddings v3, particularly
when used with a (max) chunk size of 1536. One important note is to ensure
that the vector extension is installed on your database. I encountered
significant issues with ChromaDB using SQLite, as it frequently stopped
working when the collection size increased, requiring me to reset it and
resulting in data loss after some time.
Em ter., 24 de dez. de 2024, 21:04, beastech @.***>
escreveu:
@beastech commented on GitHub (Dec 25, 2024):
I had the same problem with ChromaDB when a lot of users started uploading a lot of documents. Milvus has worked much better.
I'm going to test it with pgvector soon.
@jk-f5 commented on GitHub (Dec 31, 2024):
This isn't actually a bug in the pgvector implementation. It is true that duplicate entries are being created for a knowledge collection in document_chunk but the upstream call to
VECTOR_DB_CLIENT.insertis being called twice frombackend/open_webui/routers/retrieval.pyIn other words, I think this issue exists for all vectordb implementations.
I'll see if I can figure out what's going on.
@jk-f5 commented on GitHub (Jan 1, 2025):
@tjbck - This is a bug in the UI when uploading files for a document collection. When a file is uploaded for a collection:
/api/v1/filesis called thenprocess_fileis called which creates embeddings for the uploaded file with a collection name offile-{file.id}which is exactly what happens when uploading a file through the chat prompt./api/v1/knowledge/<uuid>/file/addis called which callsprocess_fileagain this time with a collection_id set to an id generated for the uploaded file in the knowledge collection.I believe the frontend just needs to be altered not POST to
/api/v1/fileswhen adding files to knowledge collections to prevent this from happening.@tjbck commented on GitHub (Jan 1, 2025):
@jk-f5 No, both are required to work. The first call ensures embeddings are created for the file itself, which is essential for the file to be processed and represented properly. The second call to
process_fileis necessary to associate those embeddings with the knowledge collection in the vector DB. Removing the first call would break isolated file processing, and skipping the second call would mean the file won't integrate into the relevant knowledge collection. Both steps are integral to maintaining proper functionality.@jk-f5 commented on GitHub (Jan 1, 2025):
Hmmm, how would you suggest we keep from creating the embeddings twice? Prevent the second call from creating them again and associate the first set of embeddings?
@tjbck commented on GitHub (Jan 1, 2025):
I'm sure we could explore optimization techniques to prevent creating embeddings twice, but the bottom line is that both calls are essential and duplicated embeddings exist by design. The intent is to allow users to attach individual files from the knowledge collection in isolation, which requires retaining embeddings for both the standalone file and the collection.
That said, I agree there’s room to optimize this and avoid spending computational resources on embedding the same content twice. For example, we could implement a mechanism to reuse the first set of embeddings when associating the file with the knowledge collection, rather than reprocessing it. However, this would require careful handling to maintain the functionality and flexibility we currently have.