[GH-ISSUE #9637] Assign Different Large Models to Each GPU #68344

Closed
opened 2026-05-04 13:16:24 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @ZimaBlueee on GitHub (Mar 11, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9637

Hello,
I have a server with two GPUs, and I would like to run different large models on each GPU. For instance, I want the first GPU to run the Qwen model and the second GPU to run the Llama model. Could you please provide guidance on how to specify which GPU runs which model?
Thank you!

Originally created by @ZimaBlueee on GitHub (Mar 11, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9637 Hello, I have a server with two GPUs, and I would like to run different large models on each GPU. For instance, I want the first GPU to run the Qwen model and the second GPU to run the Llama model. Could you please provide guidance on how to specify which GPU runs which model? Thank you!
Author
Owner

@rick-github commented on GitHub (Mar 11, 2025):

Just load the models. ollama will try to fit a model on a single GPU, so if qwen takes 60% of a GPU and llama takes 80%, ollama will load one model per GPU. If you are finding that both models fit on a single GPU and you want to spread them out, increase num_ctx on one of the models until the models cannot co-reside.

The other way to do this is to run multiple ollama servers and use CUDA_VISIBLE_DEVICES to bind a server to a GPU.

See #3902 for a placeholder ticket for future work on instance management.

<!-- gh-comment-id:2712245201 --> @rick-github commented on GitHub (Mar 11, 2025): Just load the models. ollama will try to fit a model on a single GPU, so if qwen takes 60% of a GPU and llama takes 80%, ollama will load one model per GPU. If you are finding that both models fit on a single GPU and you want to spread them out, increase `num_ctx` on one of the models until the models cannot co-reside. The other way to do this is to run multiple ollama servers and use `CUDA_VISIBLE_DEVICES` to bind a server to a GPU. See #3902 for a placeholder ticket for future work on instance management.
Author
Owner

@ZimaBlueee commented on GitHub (Mar 11, 2025):

@rick-github thanks!!

<!-- gh-comment-id:2712696625 --> @ZimaBlueee commented on GitHub (Mar 11, 2025): @rick-github thanks!!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68344