[GH-ISSUE #1389] Request: The ability to load multiple models into the same GPUs and running them concurrently. #62771

Closed
opened 2026-05-03 10:16:51 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @phalexo on GitHub (Dec 5, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1389

Originally assigned to: @dhiltgen on GitHub.

Currently what ollama does is UNLOAD the previously loaded model, and loads the last model you try to use. Although the load is reasonably fast (if you intend to manually enter text and such) but if you want to use it with AutoGen or similar, loads and unloads put additional latency into the system, when token generation can already be pretty slow.

I am going try to separate GPUs into different groups and try to run different models within different groups, BUT it does not really solve the problem of resource utilization.

Thanks.

Originally created by @phalexo on GitHub (Dec 5, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1389 Originally assigned to: @dhiltgen on GitHub. Currently what ollama does is UNLOAD the previously loaded model, and loads the last model you try to use. Although the load is reasonably fast (if you intend to manually enter text and such) but if you want to use it with AutoGen or similar, loads and unloads put additional latency into the system, when token generation can already be pretty slow. I am going try to separate GPUs into different groups and try to run different models within different groups, BUT it does not really solve the problem of resource utilization. Thanks.
Author
Owner

@phalexo commented on GitHub (Dec 5, 2023):

I was able to load different models into different sets of GPUs, but changing the ports manually, using OLLAMA_HOST,
was a pain. It should fail over to the next numbered port from the default 11434 to 11345, 11346,... if the port is already occupied.

<!-- gh-comment-id:1841354726 --> @phalexo commented on GitHub (Dec 5, 2023): I was able to load different models into different sets of GPUs, but changing the ports manually, using OLLAMA_HOST, was a pain. It should fail over to the next numbered port from the default 11434 to 11345, 11346,... if the port is already occupied.
Author
Owner

@shroominic commented on GitHub (Dec 24, 2023):

Would be sick to have it because it would enable 2 different models having a conversation without the delay of offloading the one and loading the other

<!-- gh-comment-id:1868599810 --> @shroominic commented on GitHub (Dec 24, 2023): Would be sick to have it because it would enable 2 different models having a conversation without the delay of offloading the one and loading the other
Author
Owner

@phalexo commented on GitHub (Dec 24, 2023):

You can do it now. I have been doing it. Just use multiple Ollama servers.

On Sun, Dec 24, 2023, 4:55 PM shroominic @.***> wrote:

Would be sick to have it because it would enable 2 different models having
a conversation without the delay of offloading the one and loading the other


Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1389#issuecomment-1868599810,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZOTLKQA4B4YV2KAN7LYLCQFTAVCNFSM6AAAAABAIBO4QKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGU4TSOBRGA
.
You are receiving this because you authored the thread.Message ID:
@.***>

<!-- gh-comment-id:1868600185 --> @phalexo commented on GitHub (Dec 24, 2023): You can do it now. I have been doing it. Just use multiple Ollama servers. On Sun, Dec 24, 2023, 4:55 PM shroominic ***@***.***> wrote: > Would be sick to have it because it would enable 2 different models having > a conversation without the delay of offloading the one and loading the other > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1389#issuecomment-1868599810>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZOTLKQA4B4YV2KAN7LYLCQFTAVCNFSM6AAAAABAIBO4QKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGU4TSOBRGA> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >
Author
Owner

@dhiltgen commented on GitHub (Mar 12, 2024):

Consolidating as a dup of #2109

<!-- gh-comment-id:1992109800 --> @dhiltgen commented on GitHub (Mar 12, 2024): Consolidating as a dup of #2109
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62771