mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-07 03:18:23 -05:00
[GH-ISSUE #17123] feat: Add "Prompt Cache Mode" Toggle in Admin Panel for Tasks #18176
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @StevePierce on GitHub (Sep 1, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/17123
Check Existing Issues
Problem Description
I am currently working on utilizing MLX-LM as an inference engine and building a wrapper around it to provide Ollama-like functionality. One of the best features of MLX-LM which I've just gotten working in this new tool is prompt caching. Importantly, prompt caching relies heavily upon the model having seen the exact text before so we can get a hash collision on the message trail and pick out the relevant prompt cache.
The Task API calls unfortunately do not follow this particular format (like an append only log of messages). Instead, they typically are formatted as:
(and the same for the title and follow-ups tasks)
Because there is no conversational trail which has been pre-processed, this poses a problem to a prompt-cache reliant system resulting in redundant/duplicative/triplicated prompt processing at every step as the 3 tasks (title badges and follow-ups) get called after every completion.
This leads to significant performance degradation especially at higher context lengths which impede Open WebUI's fluidity and means the user ends up waiting potentially for multiple tasks to complete.
Additional side-note: I am a bit confused about the title regeneration task being called after every chat completion, since the titles do not get continually updated. Is that a bug?
Desired Solution you'd like
I think it may be beneficial to provide a toggle for "Prompt Cache Mode" in the Admin Panel > Settings > Interface menu area which allows controls for Tasks which allows the admin to turn alter the mode which the tooling tries to use. So instead of jamming the entire conversation context into a single message which then requires raw front-to-back processing all over (three times in a row for 3 separate contexts), you could do:
This would enable the entire conversation up until the task to be pre-processed and cached and thus result in a vast performance improvement all around for anyone using prompt caching. In the case of a longer context conversation at 20k context length, it would potentially take an action which previously required processing 20k tokens * 3 chat completion API calls = 60k tokens processing and reduce that down to potentially just a few hundred tokens of processing done 3 times. This is the difference between near-immediate results no matter what part of the conversation you are in, and waiting minutes for a followup to complete potentially.
Alternatives Considered
I've considered just solving the problem on my side in my project I am working on by transforming the OpenWebUI-style task request, however I think it may be beneficial to all users of Open WebUI to have a toggle for this. Additionally, putting that onus on downstream users may result in ecosystem fragility (if, say, OWUI's maintainers wanted to modify a task).
Additional Context
No response
@tjbck commented on GitHub (Sep 1, 2025):
You can implement this as a Function, keep us updated!