[GH-ISSUE #7351] Can the device used for reasoning be specified via environment variables or other simple methods (in Docker and server deployments)? #51183

Closed
opened 2026-04-28 18:53:27 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @somnifex on GitHub (Oct 25, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7351

I am trying a scenario where I have multiple GPU compute cards, and I want to utilize the video memory of all the cards, but only use a specific card for inference calculations to better separate the purposes of the other cards and prevent conflicts

Originally created by @somnifex on GitHub (Oct 25, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7351 I am trying a scenario where I have multiple GPU compute cards, and I want to utilize the video memory of all the cards, but only use a specific card for inference calculations to better separate the purposes of the other cards and prevent conflicts
GiteaMirror added the feature request label 2026-04-28 18:53:27 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 25, 2024):

https://github.com/ollama/ollama/blob/main/docs/gpu.md#gpu-selection

<!-- gh-comment-id:2437675331 --> @rick-github commented on GitHub (Oct 25, 2024): https://github.com/ollama/ollama/blob/main/docs/gpu.md#gpu-selection
Author
Owner

@somnifex commented on GitHub (Oct 25, 2024):

https://github.com/ollama/ollama/blob/main/docs/gpu.md#gpu-selection

I am not sure if my idea can be achieved natively through Nvidia. For example, I have GPUs 0123, and I can load the entire model into the memory of 0123 to ensure that the large model can be loaded smoothly. However, I want to use a specific GPU (for instance, id=0) for inference tasks. If I implement this using --gpus=device or CUDA_VISIBLE_DEVICES, it will only restrict both the GPUs and the memory to the allowed devices (if I remember correctly).

<!-- gh-comment-id:2437684113 --> @somnifex commented on GitHub (Oct 25, 2024): > https://github.com/ollama/ollama/blob/main/docs/gpu.md#gpu-selection I am not sure if my idea can be achieved natively through Nvidia. For example, I have GPUs 0123, and I can load the entire model into the memory of 0123 to ensure that the large model can be loaded smoothly. However, I want to use a specific GPU (for instance, id=0) for inference tasks. If I implement this using --gpus=device or CUDA_VISIBLE_DEVICES, it will only restrict both the GPUs and the memory to the allowed devices (if I remember correctly).
Author
Owner

@somnifex commented on GitHub (Oct 25, 2024):

https://github.com/ollama/ollama/blob/main/docs/gpu.md#gpu-selection

The purpose of doing this is that I want to use the computing power of GPU1, 2, and 3 for other tasks that may require computing power but do not consume much video memory. This way, the extra video memory can be used for gpu0's ollama task. Separating the tasks can lead to more efficient use of GPU resources.

<!-- gh-comment-id:2437689150 --> @somnifex commented on GitHub (Oct 25, 2024): > https://github.com/ollama/ollama/blob/main/docs/gpu.md#gpu-selection The purpose of doing this is that I want to use the computing power of GPU1, 2, and 3 for other tasks that may require computing power but do not consume much video memory. This way, the extra video memory can be used for gpu0's ollama task. Separating the tasks can lead to more efficient use of GPU resources.
Author
Owner

@rick-github commented on GitHub (Oct 25, 2024):

Unless the cards are connected with Nvlink, there is no way for GPU0 to efficiently use the VRAM in GPU{1,2,3}.

<!-- gh-comment-id:2437719719 --> @rick-github commented on GitHub (Oct 25, 2024): Unless the cards are connected with Nvlink, there is no way for GPU0 to efficiently use the VRAM in GPU{1,2,3}.
Author
Owner

@somnifex commented on GitHub (Oct 25, 2024):

Unless the cards are connected with Nvlink, there is no way for GPU0 to efficiently use the VRAM in GPU{1,2,3}.

From a speed perspective, that's true. But to be honest, I want to solve the existence issue first and then consider optimization. After all, whether it can run or not is quite different from how fast it can run. lol

<!-- gh-comment-id:2437725044 --> @somnifex commented on GitHub (Oct 25, 2024): > Unless the cards are connected with Nvlink, there is no way for GPU0 to efficiently use the VRAM in GPU{1,2,3}. From a speed perspective, that's true. But to be honest, I want to solve the existence issue first and then consider optimization. After all, whether it can run or not is quite different from how fast it can run. lol
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51183