[GH-ISSUE #10301] Control multi-GPU distribution for models regardless of size #32525

Closed
opened 2026-04-22 13:53:11 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @FourierMourier on GitHub (Apr 16, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10301

Hi!
I've noticed that when running ollama with large models that don't fit on a single GPU, they automatically distribute across both GPUs. However, for models that barely fit on one GPU, they don't distribute evenly across both GPUs, even when I explicitly request 2 GPUs in my deployment:

env:
  - name: CUDA_VISIBLE_DEVICES
    value: "0,1"
resources:
  requests:
    nvidia.com/gpu: 2
  limits:
    nvidia.com/gpu: 2

Is there a way to force Ollama to distribute any model across both GPUs, even when it technically fits on one? The case for that is simple: if context size is quite large - pod restarts even with smaller it might process on that single gpu the model that barely fits there. Both GPUs are visible from the pod (confirmed via nvidia-smi), but I'm missing a configuration to ensure balanced distribution.

I've read in the FAQ that Ollama loads a model on a single GPU if it fits entirely, and only distributes across GPUs when necessary. Is there a way to override this behavior to force distribution even when it technically fits on one GPU? This would be helpful for handling larger contexts without running out of memory.


As a side question: in environments with GPUs of different VRAM capacities, is there any way to influence how ollama distributes the workload? For example, can GPUs with different VRAM capacities (e.g., 20GB and 16GB) be configured to handle proportionally different parts of the model?


Thanks

Originally created by @FourierMourier on GitHub (Apr 16, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10301 Hi! I've noticed that when running ollama with large models that don't fit on a single GPU, they automatically distribute across both GPUs. However, for models that barely fit on one GPU, they don't distribute evenly across both GPUs, even when I explicitly request 2 GPUs in my deployment: ```yaml env: - name: CUDA_VISIBLE_DEVICES value: "0,1" resources: requests: nvidia.com/gpu: 2 limits: nvidia.com/gpu: 2 ``` Is there a way to force Ollama to distribute any model across both GPUs, even when it technically fits on one? The case for that is simple: if context size is quite large - pod restarts even with smaller it might process on that single gpu the model that barely fits there. Both GPUs are visible from the pod (confirmed via nvidia-smi), but I'm missing a configuration to ensure balanced distribution. I've read in the [FAQ](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-load-models-on-multiple-gpus) that Ollama loads a model on a single GPU if it fits entirely, and only distributes across GPUs when necessary. Is there a way to override this behavior to force distribution even when it technically fits on one GPU? This would be helpful for handling larger contexts without running out of memory. --- As a side question: in environments with GPUs of different VRAM capacities, is there any way to influence how ollama distributes the workload? For example, can GPUs with different VRAM capacities (e.g., 20GB and 16GB) be configured to handle proportionally different parts of the model? --- Thanks
Author
Owner

@rick-github commented on GitHub (Apr 16, 2025):

Is there a way to force Ollama to distribute any model across both GPUs, even when it technically fits on one?

OLLAMA_SCHED_SPREAD

is there any way to influence how ollama distributes the workload?

Not currently. #3902 is focused on instance management that would likely include this use case but there's been no significant work on it so far.

<!-- gh-comment-id:2809889690 --> @rick-github commented on GitHub (Apr 16, 2025): > Is there a way to force Ollama to distribute any model across both GPUs, even when it technically fits on one? [`OLLAMA_SCHED_SPREAD`](https://github.com/ollama/ollama/blob/1e7f62cb429e5a962dd9c448e7b1b3371879e48b/envconfig/config.go#L256) > is there any way to influence how ollama distributes the workload? Not currently. #3902 is focused on instance management that would likely include this use case but there's been no significant work on it so far.
Author
Owner

@FourierMourier commented on GitHub (Apr 16, 2025):

Thanks for the quick response!
Just checked that OLLAMA_SCHED_SPREAD: "true" specifier works. I think it would be nice to highlight that in the FAQ btw

<!-- gh-comment-id:2810009407 --> @FourierMourier commented on GitHub (Apr 16, 2025): Thanks for the quick response! Just checked that `OLLAMA_SCHED_SPREAD: "true"` specifier works. I think it would be nice to highlight that in the FAQ btw
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32525