[GH-ISSUE #8874] How to dynamically load the model onto the specified GPU? Not set 'CUDA_VISIBLE_DEVICES' #5752

Closed
opened 2026-04-12 17:04:05 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @17Reset on GitHub (Feb 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8874

System Configuration:
I have 4 GPUs, each with 50GB of VRAM.
I need to load a model that requires 64GB of VRAM.
While installing ollama, I set CUDA_VISIBLE_DEVICES=0,1,2,3 to use all 4 GPUs.

Problem: When I load the model onto the GPUs, the 64GB of VRAM is evenly distributed across all 4 GPUs, which results in approximately 16GB of VRAM being used per GPU. However, in reality, only 2 GPUs are necessary to load the model fully. I want to know if there is a way to specify that only 2 GPUs should be used without modifying the CUDA_VISIBLE_DEVICES=0,1,2,3 setting.

Additional Context: Based on my understanding, using just 2 GPUs should provide better efficiency for inference compared to using all 4 GPUs. Is there any way to enforce this without reconfiguring CUDA_VISIBLE_DEVICES?

Originally created by @17Reset on GitHub (Feb 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8874 System Configuration: I have 4 GPUs, each with 50GB of VRAM. I need to load a model that requires 64GB of VRAM. While installing ollama, I set CUDA_VISIBLE_DEVICES=0,1,2,3 to use all 4 GPUs. Problem: When I load the model onto the GPUs, the 64GB of VRAM is evenly distributed across all 4 GPUs, which results in approximately 16GB of VRAM being used per GPU. However, in reality, only 2 GPUs are necessary to load the model fully. I want to know if there is a way to specify that only 2 GPUs should be used without modifying the CUDA_VISIBLE_DEVICES=0,1,2,3 setting. Additional Context: Based on my understanding, using just 2 GPUs should provide better efficiency for inference compared to using all 4 GPUs. Is there any way to enforce this without reconfiguring CUDA_VISIBLE_DEVICES?
GiteaMirror added the question label 2026-04-12 17:04:05 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 6, 2025):

Based on my understanding, using just 2 GPUs should provide better efficiency for inference compared to using all 4 GPUs

This is true, but the difference will hardly be noticeable.

Currently, the only way of achieving this is to run multiple ollama servers.

<!-- gh-comment-id:2639383861 --> @rick-github commented on GitHub (Feb 6, 2025): > Based on my understanding, using just 2 GPUs should provide better efficiency for inference compared to using all 4 GPUs This is true, but the difference will hardly be noticeable. Currently, the only way of achieving this is to run [multiple ollama servers](https://github.com/ollama/ollama/issues/8430#issuecomment-2596293633).
Author
Owner

@LeisureLinux commented on GitHub (Feb 6, 2025):

before your put on payload, you can check your current gpu idle status, and put your load on that. Just some additional lines of code.

<!-- gh-comment-id:2639669171 --> @LeisureLinux commented on GitHub (Feb 6, 2025): before your put on payload, you can check your current gpu idle status, and put your load on that. Just some additional lines of code.
Author
Owner

@17Reset commented on GitHub (Feb 7, 2025):

before your put on payload, you can check your current gpu idle status, and put your load on that. Just some additional lines of code.

Does Ollama allow to set which GPU to run on before running? Other than finding the 'CUDA_VISIBLE_DEVICES' setting in the documentation, I didn't see how to set the model load location before the call. My requirement is that after the ollama server is deployed, the front-end calls are concise

<!-- gh-comment-id:2641812180 --> @17Reset commented on GitHub (Feb 7, 2025): > before your put on payload, you can check your current gpu idle status, and put your load on that. Just some additional lines of code. Does Ollama allow to set which GPU to run on before running? Other than finding the 'CUDA_VISIBLE_DEVICES' setting in the documentation, I didn't see how to set the model load location before the call. My requirement is that after the ollama server is deployed, the front-end calls are concise
Author
Owner

@rick-github commented on GitHub (Feb 7, 2025):

Does Ollama allow to set which GPU to run on before running?

No.

<!-- gh-comment-id:2642408317 --> @rick-github commented on GitHub (Feb 7, 2025): > Does Ollama allow to set which GPU to run on before running? No.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5752