Postfix title/tag query instead of prefix to avoid trashing KV cache #2548

Closed
opened 2025-11-11 15:09:30 -06:00 by GiteaMirror · 0 comments
Owner

Originally created by @robertvazan on GitHub (Nov 3, 2024).

Feature Request

Is your feature request related to a problem? Please describe.

Given some example chat:

User: Hello.
AI: Hi.

Title is generated in a new chat that looks like this:

User: Generate a title for this conversation:

User: Hello.
AI: Hi.

AI: Just Greetings

The same is done for tags. This forces language models to process the conversation three times even though the conversation can be very long, for example when the model was asked to summarize an article. Since the original content of the KV cache is destroyed in the process, any follow-up question will trigger fourth reprocessing of the conversation.

This is badly slowing things down in local models, which maintain dedicated KV cache for the user specifically to speed up prompt processing. Even in cloud models, this can increase cost, at least with OpenAI, which charges less for recycled context (cached input tokens).

Describe the solution you'd like

Just reformat the title query as follows:

User: Hello.
AI: Hi.
User: Generate a title for the above conversation.
AI: Just Greetings

Then discard the last two messages and repeat with tags, then discard the last two messages again and resume normal conversation.

Describe alternatives you've considered

Workarounds:

  1. You can use separate model for title/tags, but that's extra setup and a strain on RAM/VRAM. Even if you do have a separate model, reusing context between title and tags still saves time and possibly money.
  2. You can disable titles and tags entirely, but that often results in useless titles and hard-to-search history.
  3. You can change title/tag templates to use only a short excerpt, which keeps title/tag queries fast, but it still kills KV cache and slows down response to the first follow-up query.
Originally created by @robertvazan on GitHub (Nov 3, 2024). # Feature Request **Is your feature request related to a problem? Please describe.** Given some example chat: > User: Hello. > AI: Hi. Title is generated in a new chat that looks like this: > User: Generate a title for this conversation: > > User: Hello. > > AI: Hi. > AI: Just Greetings The same is done for tags. This forces language models to process the conversation three times even though the conversation can be very long, for example when the model was asked to summarize an article. Since the original content of the KV cache is destroyed in the process, any follow-up question will trigger fourth reprocessing of the conversation. This is badly slowing things down in local models, which maintain dedicated KV cache for the user specifically to speed up prompt processing. Even in cloud models, this can increase cost, at least with OpenAI, which charges less for recycled context (cached input tokens). **Describe the solution you'd like** Just reformat the title query as follows: > User: Hello. > AI: Hi. > User: Generate a title for the above conversation. > AI: Just Greetings Then discard the last two messages and repeat with tags, then discard the last two messages again and resume normal conversation. **Describe alternatives you've considered** Workarounds: 1. You can use separate model for title/tags, but that's extra setup and a strain on RAM/VRAM. Even if you do have a separate model, reusing context between title and tags still saves time and possibly money. 2. You can disable titles and tags entirely, but that often results in useless titles and hard-to-search history. 3. You can change title/tag templates to use only a short excerpt, which keeps title/tag queries fast, but it still kills KV cache and slows down response to the first follow-up query.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#2548