[GH-ISSUE #17123] feat: Add "Prompt Cache Mode" Toggle in Admin Panel for Tasks #18176

New Issue

GiteaMirror · 2026-04-20T00:23:34-05:00

GiteaMirror commented

2026-04-20 00:23:34 -05:00

Originally created by @StevePierce on GitHub (Sep 1, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/17123

Check Existing Issues

I have searched the existing issues and discussions.

Problem Description

I am currently working on utilizing MLX-LM as an inference engine and building a wrapper around it to provide Ollama-like functionality. One of the best features of MLX-LM which I've just gotten working in this new tool is prompt caching. Importantly, prompt caching relies heavily upon the model having seen the exact text before so we can get a hash collision on the message trail and pick out the relevant prompt cache.

The Task API calls unfortunately do not follow this particular format (like an append only log of messages). Instead, they typically are formatted as:

[
    {
        "role": "system",
        "content": "A conversation between a curious user and an AI assistant."
    },
    {
        "role": "user",
        "content": "I am the user and my content goes here."
    },
    {
        "role": "assistant",
        "content": "I am the assistant and my response goes here."
    },
    {
        "role": "user",
        "content": "### Task:\nGenerate 1-3 broad tags categorizing the main themes of the chat history, 
        along with 1-3 more specific subtopic tags.\n\n### Guidelines:\n- Start with high-level domains (e.g. Science, 
        Technology, Philosophy, Arts, Politics, Business, Health, Sports, Entertainment, Education)\n- Consider including relevant subfields/subdomains if they are strongly represented throughout the conversation\n
        - If content is too short (less than 3 messages) or too diverse, use only [\"General\"]\n- Use the chat's primary language; default to English if multilingual\n-
        - Prioritize accuracy over specificity\n\n### Output:\n
        - JSON format: { \"tags\": [\"tag1\", \"tag2\", \"tag3\"] }\n\n
        - ### Chat History:\n<chat_history>\n
        - USER: I am the user and my content goes here.\n
        - ASSISTANT: I am the assistant and my response goes here.\n</chat_history>"
    }
]

(and the same for the title and follow-ups tasks)

Because there is no conversational trail which has been pre-processed, this poses a problem to a prompt-cache reliant system resulting in redundant/duplicative/triplicated prompt processing at every step as the 3 tasks (title badges and follow-ups) get called after every completion.

This leads to significant performance degradation especially at higher context lengths which impede Open WebUI's fluidity and means the user ends up waiting potentially for multiple tasks to complete.

Additional side-note: I am a bit confused about the title regeneration task being called after every chat completion, since the titles do not get continually updated. Is that a bug?

Desired Solution you'd like

I think it may be beneficial to provide a toggle for "Prompt Cache Mode" in the Admin Panel > Settings > Interface menu area which allows controls for Tasks which allows the admin to turn alter the mode which the tooling tries to use. So instead of jamming the entire conversation context into a single message which then requires raw front-to-back processing all over (three times in a row for 3 separate contexts), you could do:

[
  {
    "role": "system",
    "content": "A conversation between a curious user and an AI assistant."
  },
  {
    "role": "user",
    "content": "I am the user and my content goes here."
  },
  {
    "role": "assistant",
    "content": "I am the assistant and my response goes here."
  },
  // this goes on for a few hundred times maybe
  {
    "role": "user", // this is the task emulating the user
    "content": "### Task:\nGenerate 1-3 broad tags categorizing the main themes of the chat history, along with 1-3 more specific subtopic tags.\n\n### Guidelines:\n
        - Start with high-level domains (e.g. Science, Technology, Philosophy, Arts, Politics, Business, Health, Sports, Entertainment, Education)\n
        - Consider including relevant subfields/subdomains if they are strongly represented throughout the conversation\n
        - If content is too short (less than 3 messages) or too diverse, use only [\"General\"]\n
        - Use the chat's primary language; default to English if multilingual\n
        - Prioritize accuracy over specificity\n\n
        - ### Output:\nJSON format: { \"tags\": [\"tag1\", \"tag2\", \"tag3\"] }"
  }
]

This would enable the entire conversation up until the task to be pre-processed and cached and thus result in a vast performance improvement all around for anyone using prompt caching. In the case of a longer context conversation at 20k context length, it would potentially take an action which previously required processing 20k tokens * 3 chat completion API calls = 60k tokens processing and reduce that down to potentially just a few hundred tokens of processing done 3 times. This is the difference between near-immediate results no matter what part of the conversation you are in, and waiting minutes for a followup to complete potentially.

Alternatives Considered

I've considered just solving the problem on my side in my project I am working on by transforming the OpenWebUI-style task request, however I think it may be beneficial to all users of Open WebUI to have a toggle for this. Additionally, putting that onus on downstream users may result in ecosystem fragility (if, say, OWUI's maintainers wanted to modify a task).

Additional Context

No response

Originally created by @StevePierce on GitHub (Sep 1, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/17123 ### Check Existing Issues - [x] I have searched the existing issues and discussions. ### Problem Description I am currently working on utilizing MLX-LM as an inference engine and building a wrapper around it to provide Ollama-like functionality. One of the best features of MLX-LM which I've just gotten working in this new tool is prompt caching. Importantly, prompt caching relies heavily upon the model having seen the exact text before so we can get a hash collision on the message trail and pick out the relevant prompt cache. The Task API calls unfortunately do not follow this particular format (like an append only log of messages). Instead, they typically are formatted as: [ { "role": "system", "content": "A conversation between a curious user and an AI assistant." }, { "role": "user", "content": "I am the user and my content goes here." }, { "role": "assistant", "content": "I am the assistant and my response goes here." }, { "role": "user", "content": "### Task:\nGenerate 1-3 broad tags categorizing the main themes of the chat history, along with 1-3 more specific subtopic tags.\n\n### Guidelines:\n- Start with high-level domains (e.g. Science, Technology, Philosophy, Arts, Politics, Business, Health, Sports, Entertainment, Education)\n- Consider including relevant subfields/subdomains if they are strongly represented throughout the conversation\n - If content is too short (less than 3 messages) or too diverse, use only [\"General\"]\n- Use the chat's primary language; default to English if multilingual\n- - Prioritize accuracy over specificity\n\n### Output:\n - JSON format: { \"tags\": [\"tag1\", \"tag2\", \"tag3\"] }\n\n - ### Chat History:\n<chat_history>\n - USER: I am the user and my content goes here.\n - ASSISTANT: I am the assistant and my response goes here.\n</chat_history>" } ] (and the same for the title and follow-ups tasks) **Because there is no conversational trail which has been pre-processed, this poses a problem to a prompt-cache reliant system resulting in redundant/duplicative/triplicated prompt processing at every step** as the 3 tasks (title badges and follow-ups) get called after every completion. This leads to significant performance degradation especially at higher context lengths which impede Open WebUI's fluidity and means the user ends up waiting potentially for multiple tasks to complete. Additional side-note: I am a bit confused about the title regeneration task being called after every chat completion, since the titles do not get continually updated. Is that a bug? ### Desired Solution you'd like I think it may be beneficial to provide a toggle for "Prompt Cache Mode" in the Admin Panel > Settings > Interface menu area which allows controls for Tasks which allows the admin to turn alter the mode which the tooling tries to use. **So instead of jamming the entire conversation context into a single message which then requires raw front-to-back processing all over (three times in a row for 3 separate contexts), you could do:** [ { "role": "system", "content": "A conversation between a curious user and an AI assistant." }, { "role": "user", "content": "I am the user and my content goes here." }, { "role": "assistant", "content": "I am the assistant and my response goes here." }, // this goes on for a few hundred times maybe { "role": "user", // this is the task emulating the user "content": "### Task:\nGenerate 1-3 broad tags categorizing the main themes of the chat history, along with 1-3 more specific subtopic tags.\n\n### Guidelines:\n - Start with high-level domains (e.g. Science, Technology, Philosophy, Arts, Politics, Business, Health, Sports, Entertainment, Education)\n - Consider including relevant subfields/subdomains if they are strongly represented throughout the conversation\n - If content is too short (less than 3 messages) or too diverse, use only [\"General\"]\n - Use the chat's primary language; default to English if multilingual\n - Prioritize accuracy over specificity\n\n - ### Output:\nJSON format: { \"tags\": [\"tag1\", \"tag2\", \"tag3\"] }" } ] This would enable the entire conversation up until the task to be pre-processed and cached and thus result in a vast performance improvement all around for anyone using prompt caching. In the case of a longer context conversation at 20k context length, it would potentially take an action which previously required processing 20k tokens * 3 chat completion API calls = 60k tokens processing and reduce that down to potentially just a few hundred tokens of processing done 3 times. This is the difference between near-immediate results no matter what part of the conversation you are in, and waiting minutes for a followup to complete potentially. ### Alternatives Considered I've considered just solving the problem on my side in my project I am working on by transforming the OpenWebUI-style task request, however I think it may be beneficial to all users of Open WebUI to have a toggle for this. Additionally, putting that onus on downstream users may result in ecosystem fragility (if, say, OWUI's maintainers wanted to modify a task). ### Additional Context _No response_

GiteaMirror closed this issue

2026-04-20 00:23:35 -05:00

GiteaMirror commented

2026-04-20 00:23:36 -05:00

@tjbck commented on GitHub (Sep 1, 2025):

You can implement this as a Function, keep us updated!

@tjbck commented on GitHub (Sep 1, 2025): You can implement this as a Function, keep us updated!

GiteaMirror referenced this issue

2026-04-20 05:32:39 -05:00

[PR #18176] [MERGED] chore: security.md #24687

GiteaMirror referenced this issue

2026-04-25 12:46:06 -05:00