[GH-ISSUE #12655] issue: RAG for an entire knowledge collection cites the first source #55338

Closed
opened 2026-05-05 17:27:34 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @Elmolesto on GitHub (Apr 9, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/12655

Check Existing Issues

  • I have searched the existing issues and discussions.
  • I am using the latest version of Open WebUI.

Installation Method

Git Clone

Open WebUI Version

v0.6.2

Ollama Version (if applicable)

No response

Operating System

macOS Sequoia

Browser (if applicable)

Chrome

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have listed steps to reproduce the bug in detail.

Expected Behavior

When using RAG with an entire knowledge collection, the system should accurately cite all relevant sources contributing to the generated answer.

Actual Behavior

When querying using RAG with an entire knowledge collection, the generated response consistently cites only the first source in the collection, regardless of the actual documents retrieved or referenced.

However, when the same prompt is run with individual files selected (instead of the entire collection), the citations are accurate and reflect the true sources of the information. This suggests the issue is specific to how citations are handled when using full collections.

Steps to Reproduce

  1. Create a knowledge collection with two or more files.
  2. Go to the chat interface
  3. Choose the knowledge collection
  4. Enter a prompt to generate statements from the collections about a topic. Ex. "You are tasked with generating a concise, five factual statement about research studies attached related to AI's impact"
  5. Observe the generated response and note the cited source(s)
  6. Open a new chat
  7. Repeat the same prompt, but this time, manually select the individual documents from the same collection instead of using the full collection
  8. Compare the citations in both responses

Logs & Screenshots

RAG on collection: NOT WORKING
Image

RAG on same files: WORKING
Image

Additional Information

After debugging the issue, I found that it might have originated in the get_sources_from_files method in retrieval/utils.py.

Here is the attached JSON dump of the variable relevant_contexts in both scenarios
relevant_contexts_on_collection.json
relevant_contexts_on_files.json

Specifically, the generation of the sources list for collections leads to incorrect citation behaviour because it only references documents[0], thereby using only the first document.

Image

RELATED: https://github.com/open-webui/open-webui/discussions/10595#discussioncomment-12484708

Originally created by @Elmolesto on GitHub (Apr 9, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/12655 ### Check Existing Issues - [x] I have searched the existing issues and discussions. - [x] I am using the latest version of Open WebUI. ### Installation Method Git Clone ### Open WebUI Version v0.6.2 ### Ollama Version (if applicable) _No response_ ### Operating System macOS Sequoia ### Browser (if applicable) Chrome ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have listed steps to reproduce the bug in detail. ### Expected Behavior When using **RAG with an entire knowledge collection**, the system should accurately cite all relevant sources contributing to the generated answer. ### Actual Behavior When querying using RAG with an entire knowledge collection, the generated response consistently **cites only the first source in the collection**, regardless of the actual documents retrieved or referenced. However, **when the same prompt is run with individual files selected** (instead of the entire collection), **the citations are accurate** and reflect the true sources of the information. This suggests the issue is specific to how citations are handled when using full collections. ### Steps to Reproduce 1. Create a knowledge collection with two or more files. 2. Go to the chat interface 3. Choose the knowledge collection 4. Enter a prompt to generate statements from the collections about a topic. Ex. "You are tasked with generating a concise, five factual statement about research studies attached related to AI's impact" 5. Observe the generated response and note the cited source(s) 6. Open a new chat 7. Repeat the same prompt, but this time, manually select the individual documents from the same collection instead of using the full collection 8. Compare the citations in both responses ### Logs & Screenshots RAG on collection: NOT WORKING ![Image](https://github.com/user-attachments/assets/59a55015-3a16-496f-b697-559ebe1fb936) RAG on same files: WORKING ![Image](https://github.com/user-attachments/assets/ca5e7c16-cfb7-4b49-9ca9-75d11b8706ea) ### Additional Information After debugging the issue, I found that it might have originated in the `get_sources_from_files` method in `retrieval/utils.py`. Here is the attached JSON dump of the variable `relevant_contexts` in both scenarios [relevant_contexts_on_collection.json](https://github.com/user-attachments/files/19669258/relevant_contexts_on_collection.json) [relevant_contexts_on_files.json](https://github.com/user-attachments/files/19669259/relevant_contexts_on_files.json) Specifically, the generation of the sources list for collections leads to incorrect citation behaviour because it only references `documents[0]`, thereby using only the first document. <img width="775" alt="Image" src="https://github.com/user-attachments/assets/f7cb51f9-1878-460b-8bd5-570f46daf68a" /> RELATED: https://github.com/open-webui/open-webui/discussions/10595#discussioncomment-12484708
GiteaMirror added the bug label 2026-05-05 17:27:34 -05:00
Author
Owner

@almajo commented on GitHub (Apr 9, 2025):

This is being worked on: https://github.com/open-webui/open-webui/pull/12562

<!-- gh-comment-id:2790163598 --> @almajo commented on GitHub (Apr 9, 2025): This is being worked on: https://github.com/open-webui/open-webui/pull/12562
Author
Owner

@athoik commented on GitHub (Apr 10, 2025):

@almajo thank you! 🥇 It's really a great improvement!

@tjbck please consider having a look on this improvement.

<!-- gh-comment-id:2791643201 --> @athoik commented on GitHub (Apr 10, 2025): @almajo thank you! 🥇 It's really a great improvement! @tjbck please consider having a look on this improvement.
Author
Owner

@Elmolesto commented on GitHub (Apr 10, 2025):

This is being worked on: #12562

Thanks! I've tested the fix, and it's working.

However, this may need to go to a discussion: The context generated for RAG on a collection is shorter than the context generated when you RAG on the same files, loaded as files. This is because of the topK, but here's the question: is this a desired behaviour?

<!-- gh-comment-id:2792346279 --> @Elmolesto commented on GitHub (Apr 10, 2025): > This is being worked on: [#12562](https://github.com/open-webui/open-webui/pull/12562) Thanks! I've tested the fix, and it's working. However, this may need to go to a discussion: **The context generated for RAG on a collection is shorter than the context generated when you RAG on the same files**, loaded as files. This is because of the topK, but here's the question: **is this a desired behaviour?**
Author
Owner

@tjbck commented on GitHub (Apr 10, 2025):

https://github.com/open-webui/open-webui/pull/12562 Merged!

<!-- gh-comment-id:2794768500 --> @tjbck commented on GitHub (Apr 10, 2025): https://github.com/open-webui/open-webui/pull/12562 Merged!
Author
Owner

@Elmolesto commented on GitHub (Apr 10, 2025):

#12562 Merged!

This is being worked on: #12562

Thanks! I've tested the fix, and it's working.

However, this may need to go to a discussion: The context generated for RAG on a collection is shorter than the context generated when you RAG on the same files, loaded as files. This is because of the topK, but here's the question: is this a desired behaviour?

@almajo could we expand on this? Should I open discussion?

CC @tjbck

<!-- gh-comment-id:2794778551 --> @Elmolesto commented on GitHub (Apr 10, 2025): > [#12562](https://github.com/open-webui/open-webui/pull/12562) Merged! > > This is being worked on: [#12562](https://github.com/open-webui/open-webui/pull/12562) > > Thanks! I've tested the fix, and it's working. > > However, this may need to go to a discussion: **The context generated for RAG on a collection is shorter than the context generated when you RAG on the same files**, loaded as files. This is because of the topK, but here's the question: **is this a desired behaviour?** @almajo could we expand on this? Should I open discussion? CC @tjbck
Author
Owner

@tjbck commented on GitHub (Apr 10, 2025):

Intended behaviour.

<!-- gh-comment-id:2794829189 --> @tjbck commented on GitHub (Apr 10, 2025): Intended behaviour.
Author
Owner

@controldev commented on GitHub (May 14, 2025):

This is still broken (at least for web search) in 0.6.9.

<!-- gh-comment-id:2879492683 --> @controldev commented on GitHub (May 14, 2025): This is still broken (at least for web search) in 0.6.9.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#55338