[GH-ISSUE #23583] feat: Improve document loading without RAG for prompt caching #35548

New Issue

GiteaMirror · 2026-04-25T09:44:54-05:00

GiteaMirror commented

2026-04-25 09:44:54 -05:00

Originally created by @arty-hlr on GitHub (Apr 10, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/23583

Check Existing Issues

I have searched for all existing open AND closed issues and discussions for similar requests. I have found none that is comparable to my request.

Verify Feature Scope

I have read through and understood the scope definition for feature requests in the Issues section. I believe my feature request meets the definition and belongs in the Issues section instead of the Discussions.

Problem Description

With "Bypass Embedding and Retrieval", it seems that the RAG template is still used, and with a system prompt (for example from a folder), the order is very inefficient for prompt caching:

Expected prompt/document order:

System prompt
RAG template
Document
User query
Assistant answer
User query
Assistant answer

Actual prompt/document order:

System prompt
User query
Assistant answer
RAG template
Document
User query
Assistant answer

Problem

This behavior unfortunately invalidates prompt caching repeatedly as you'd have:

System prompt
RAG template
Document
User query
Assistant answer

but then at the next turn:

System prompt
User query
Assistant answer
RAG template
Document
User query
Assistant answer

This is especially problematic when the document is big (50-100k tokens) as moving its place around invalidates the whole cachine.

It is also clear in the UI in the way the document is retrieved:

It shows "Retrieved 1 source" before each assistant response. This makes sense when actually using RAG and dynamically getting chunks based on the user's query, but with "Bypass Embedding and Retrieval" mode the whole document is supposed to be included statically, so it shouldn't be added before each new user query.

Desired Solution you'd like

Have "Bypass Embedding and Retrieval" put the RAG template and the document at the place where the user asks for it. For example if the user adds the document in the first message, put the RAG template and document after the system prompt and before the user query. If the user adds a document in a later turn, put the RAG template and the document before that query. The main point would be not to add the document in "Bypass Embedding and Retrieval" before each user query, but leave it at the point where it was added.

Alternatives Considered

No alternative possible afaik.

Additional Context

No response

Originally created by @arty-hlr on GitHub (Apr 10, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/23583 ### Check Existing Issues - [x] I have searched for all existing **open AND closed** issues and discussions for similar requests. I have found none that is comparable to my request. ### Verify Feature Scope - [x] I have read through and understood the scope definition for feature requests in the Issues section. I believe my feature request meets the definition and belongs in the Issues section instead of the Discussions. ### Problem Description With "Bypass Embedding and Retrieval", it seems that the RAG template is still used, and with a system prompt (for example from a folder), the order is very inefficient for prompt caching: # Expected prompt/document order: System prompt RAG template Document User query Assistant answer User query Assistant answer # Actual prompt/document order: System prompt User query Assistant answer RAG template Document User query Assistant answer # Problem This behavior unfortunately invalidates prompt caching repeatedly as you'd have: System prompt RAG template Document User query Assistant answer but then at the next turn: System prompt User query Assistant answer RAG template Document User query Assistant answer This is especially problematic when the document is big (50-100k tokens) as moving its place around invalidates the whole cachine. It is also clear in the UI in the way the document is retrieved: <img width="1126" height="163" alt="Image" src="https://github.com/user-attachments/assets/b5a8baa6-1ce4-40b0-83ef-b89ba1b9e85c" /> It shows "Retrieved 1 source" before each assistant response. This makes sense when actually using RAG and dynamically getting chunks based on the user's query, but with "Bypass Embedding and Retrieval" mode the whole document is supposed to be included statically, so it shouldn't be added before each new user query. ### Desired Solution you'd like Have "Bypass Embedding and Retrieval" put the RAG template and the document at the place where the user asks for it. For example if the user adds the document in the first message, put the RAG template and document after the system prompt and before the user query. If the user adds a document in a later turn, put the RAG template and the document before that query. The main point would be not to add the document in "Bypass Embedding and Retrieval" before each user query, but leave it at the point where it was added. ### Alternatives Considered No alternative possible afaik. ### Additional Context _No response_

GiteaMirror closed this issue

2026-04-25 09:44:54 -05:00

GiteaMirror commented

2026-04-25 09:44:55 -05:00

@arty-hlr commented on GitHub (Apr 11, 2026):

Based on https://docs.openwebui.com/troubleshooting/performance/ this should be fixed by setting RAG_SYSTEM_PROMPT which fixes the issue for prompt caching. This should definitely be default when "Bypass embedding and retrieval" is enabled though!

@arty-hlr commented on GitHub (Apr 11, 2026): Based on https://docs.openwebui.com/troubleshooting/performance/ this should be fixed by setting RAG_SYSTEM_PROMPT which fixes the issue for prompt caching. This should definitely be default when "Bypass embedding and retrieval" is enabled though!

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/open-webui#35548