feat: Allow parameter control for task model #6176

Open
opened 2025-11-11 16:47:03 -06:00 by GiteaMirror · 0 comments
Owner

Originally created by @Gjarllarhorn on GitHub (Aug 22, 2025).

Check Existing Issues

  • I have searched the existing issues and discussions.

Problem Description

When using a large model (mode size > VRAM, it will spill into system memory) and selecting a separate task model, once chat completes, it will eject the large model to load the task model (as it doesn’t fit in memory), adding extra delays.

Desired Solution you'd like

Add advance parameter control for task model, the specific use case is to be able to keep the task model loaded indefinitely and run on CPU rather than GPU.

This would help with larger models that take up all the available VRAM. When larger models are loaded, all VRAM is used and the rest is loaded into system memory, if a separate task model is selected, it will eject the large model and then load the task model into the released vram. This introduces extra time as the larger model would have to be loaded again into memory.

By allowing control over the task model parameters, it can be tuned so it keeps a very small/efficient model loaded in system memory and run on CPU, for example setting the parameters:
• num_gpu: 0
• keep_alive:-1

Alternatives Considered

No response

Additional Context

No response

Originally created by @Gjarllarhorn on GitHub (Aug 22, 2025). ### Check Existing Issues - [x] I have searched the existing issues and discussions. ### Problem Description When using a large model (mode size > VRAM, it will spill into system memory) and selecting a separate task model, once chat completes, it will eject the large model to load the task model (as it doesn’t fit in memory), adding extra delays. ### Desired Solution you'd like Add advance parameter control for task model, the specific use case is to be able to keep the task model loaded indefinitely and run on CPU rather than GPU. This would help with larger models that take up all the available VRAM. When larger models are loaded, all VRAM is used and the rest is loaded into system memory, if a separate task model is selected, it will eject the large model and then load the task model into the released vram. This introduces extra time as the larger model would have to be loaded again into memory. By allowing control over the task model parameters, it can be tuned so it keeps a very small/efficient model loaded in system memory and run on CPU, for example setting the parameters: • num_gpu: 0 • keep_alive:-1 ### Alternatives Considered _No response_ ### Additional Context _No response_
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#6176