[GH-ISSUE #638] Separate ollama instance for titles #12157

Closed
opened 2026-04-19 18:58:55 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @robertvazan on GitHub (Feb 3, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/638

Is your feature request related to a problem? Please describe.
Auto-generated titles cause serious performance problems. They trash KV cache in llama.cpp, so the whole prompt and the first response have to be processed again after followup question, which is slow on CPU. Titles also take time to generate, sometimes a lot of time if the model gets stuck in a loop. If separate model is used for titles, time is wasted loading it and then reloading the main model.

Describe the solution you'd like
Make it possible to configure second ollama instance dedicated to titles. Query this second instance concurrently, so that user can continue chatting.

Things get a bit messy with models, especially custom ones. A lazy way to do this is to offer only models available in the second ollama instance. A more complete solution is to add a new settings tab for title generation, which would allow model download and customization (prompt, max tokens).

Describe alternatives you've considered
Llama.cpp supports multiple concurrent sessions (or "slots"), which would fix the KV cache trashing, but ollama does not support it yet. Neither llama.cpp nor ollama have support for loading and running multiple models concurrently.

Originally created by @robertvazan on GitHub (Feb 3, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/638 **Is your feature request related to a problem? Please describe.** Auto-generated titles cause serious performance problems. They trash KV cache in llama.cpp, so the whole prompt and the first response have to be processed again after followup question, which is slow on CPU. Titles also take time to generate, sometimes a lot of time if the model gets stuck in a loop. If separate model is used for titles, time is wasted loading it and then reloading the main model. **Describe the solution you'd like** Make it possible to configure second ollama instance dedicated to titles. Query this second instance concurrently, so that user can continue chatting. Things get a bit messy with models, especially custom ones. A lazy way to do this is to offer only models available in the second ollama instance. A more complete solution is to add a new settings tab for title generation, which would allow model download and customization (prompt, max tokens). **Describe alternatives you've considered** Llama.cpp supports multiple concurrent sessions (or "slots"), which would fix the KV cache trashing, but ollama does not support it yet. Neither llama.cpp nor ollama have support for loading and running multiple models concurrently.
Author
Owner

@tjbck commented on GitHub (Feb 3, 2024):

Duplicate #278

<!-- gh-comment-id:1925456616 --> @tjbck commented on GitHub (Feb 3, 2024): Duplicate #278
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#12157