[GH-ISSUE #638] Separate ollama instance for titles #12157

New Issue

GiteaMirror · 2026-04-19T18:58:55-05:00

GiteaMirror commented

2026-04-19 18:58:55 -05:00

Originally created by @robertvazan on GitHub (Feb 3, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/638

Is your feature request related to a problem? Please describe.
Auto-generated titles cause serious performance problems. They trash KV cache in llama.cpp, so the whole prompt and the first response have to be processed again after followup question, which is slow on CPU. Titles also take time to generate, sometimes a lot of time if the model gets stuck in a loop. If separate model is used for titles, time is wasted loading it and then reloading the main model.

Describe the solution you'd like
Make it possible to configure second ollama instance dedicated to titles. Query this second instance concurrently, so that user can continue chatting.

Things get a bit messy with models, especially custom ones. A lazy way to do this is to offer only models available in the second ollama instance. A more complete solution is to add a new settings tab for title generation, which would allow model download and customization (prompt, max tokens).

Describe alternatives you've considered
Llama.cpp supports multiple concurrent sessions (or "slots"), which would fix the KV cache trashing, but ollama does not support it yet. Neither llama.cpp nor ollama have support for loading and running multiple models concurrently.

Originally created by @robertvazan on GitHub (Feb 3, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/638 **Is your feature request related to a problem? Please describe.** Auto-generated titles cause serious performance problems. They trash KV cache in llama.cpp, so the whole prompt and the first response have to be processed again after followup question, which is slow on CPU. Titles also take time to generate, sometimes a lot of time if the model gets stuck in a loop. If separate model is used for titles, time is wasted loading it and then reloading the main model. **Describe the solution you'd like** Make it possible to configure second ollama instance dedicated to titles. Query this second instance concurrently, so that user can continue chatting. Things get a bit messy with models, especially custom ones. A lazy way to do this is to offer only models available in the second ollama instance. A more complete solution is to add a new settings tab for title generation, which would allow model download and customization (prompt, max tokens). **Describe alternatives you've considered** Llama.cpp supports multiple concurrent sessions (or "slots"), which would fix the KV cache trashing, but ollama does not support it yet. Neither llama.cpp nor ollama have support for loading and running multiple models concurrently.

GiteaMirror closed this issue

2026-04-19 18:58:56 -05:00

GiteaMirror commented

2026-04-19 18:58:57 -05:00

@tjbck commented on GitHub (Feb 3, 2024):

Duplicate #278

@tjbck commented on GitHub (Feb 3, 2024): Duplicate #278

GiteaMirror referenced this issue

2026-04-19 22:23:54 -05:00

[GH-ISSUE #12157] issue: Is this project dead? - no new updates and soooo many bugs #16487

GiteaMirror referenced this issue

2026-04-25 05:54:13 -05:00