[GH-ISSUE #3507] Switching dynamically between multiple LLM models on VRAM #27920

Closed
opened 2026-04-22 05:34:33 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @Q-point on GitHub (Apr 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3507

Originally assigned to: @dhiltgen on GitHub.

What are you trying to do?

At the moment, Ollama needs to load LLM one by one. It should be possible to have multiple LLM resident in VRAM memory and switch dynamically between the two.

How should we solve this?

  1. Check if the requested images can be loaded within the current hardware VRAM budget.
  2. load multiple images in VRAM.
  3. Augment API to switch dynamically between the two.

What is the impact of not solving this?

The present latency when loading and unloading multiple models.

Anything else?

No response

Originally created by @Q-point on GitHub (Apr 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3507 Originally assigned to: @dhiltgen on GitHub. ### What are you trying to do? At the moment, Ollama needs to load LLM one by one. It should be possible to have multiple LLM resident in VRAM memory and switch dynamically between the two. ### How should we solve this? 1. Check if the requested images can be loaded within the current hardware VRAM budget. 2. load multiple images in VRAM. 3. Augment API to switch dynamically between the two. ### What is the impact of not solving this? The present latency when loading and unloading multiple models. ### Anything else? _No response_
GiteaMirror added the gpufeature request labels 2026-04-22 05:34:34 -05:00
Author
Owner

@BradKML commented on GitHub (Apr 9, 2024):

Seconding this for the use case of multi-agent workflows

<!-- gh-comment-id:2044458221 --> @BradKML commented on GitHub (Apr 9, 2024): Seconding this for the use case of multi-agent workflows
Author
Owner

@davidearlyoung commented on GitHub (Apr 9, 2024):

I second this as well due to wanting to have both a embedding model and a instruction tuned model loaded at the same time.
I'm seeing this useful for tasks such as dynamic/live RAG systems.

Also, I was needing this feature when I was wanting to load and compare two different quantized versions of the same model a few days ago.
I was wanted to see how much of a deviation of next token predictions were between two quantization's from the same starting prompt. And I was aiming to have the same seed, high temp's and greedy like sampling generation options for each test.

Most embedding models are small compared to regular models. So this seems reasonable. I know it would be for my setup.

Currently as ollama stands, It seems excessive and likely a waste of power, delay of time, and like causes more wear on the hardware to switch rapidly between two different models for pipelines.

<!-- gh-comment-id:2046110708 --> @davidearlyoung commented on GitHub (Apr 9, 2024): I second this as well due to wanting to have both a embedding model and a instruction tuned model loaded at the same time. I'm seeing this useful for tasks such as dynamic/live RAG systems. Also, I was needing this feature when I was wanting to load and compare two different quantized versions of the same model a few days ago. I was wanted to see how much of a deviation of next token predictions were between two quantization's from the same starting prompt. And I was aiming to have the same seed, high temp's and greedy like sampling generation options for each test. Most embedding models are small compared to regular models. So this seems reasonable. I know it would be for my setup. Currently as ollama stands, It seems excessive and likely a waste of power, delay of time, and like causes more wear on the hardware to switch rapidly between two different models for pipelines.
Author
Owner

@pdevine commented on GitHub (Apr 12, 2024):

This is coming soon (along with concurrency).

<!-- gh-comment-id:2052610907 --> @pdevine commented on GitHub (Apr 12, 2024): This is coming soon (along with concurrency).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#27920