[GH-ISSUE #7531] Poor acceleration choices with mixed GPUs #4790

Closed
opened 2026-04-12 15:45:30 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @cobrafast on GitHub (Nov 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7531

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I've noticed is that Ollama makes poor decisions about acceleration in setups with heterogenous GPUs. I for example have a 16GB VRAM and a 3GB VRAM dGPU in my desktop PC and Ollama seems to only consider the smaller VRAM GPU, even if I set up CUDA_VISIBLE_DEVICES=0 to only let it compute on the bigger one.

time=2024-11-03T22:28:58.806+01:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-11-03T22:28:58.806+01:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2024-11-03T22:28:58.806+01:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16
time=2024-11-03T22:28:59.090+01:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-3392b891-9899-c4e1-5fff-f56fe0c463c5 library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4080" total="16.0 GiB" available="14.7 GiB"
...
time=2024-11-06T21:16:40.820+01:00 level=INFO source=server.go:105 msg="system memory" total="127.7 GiB" free="85.7 GiB" free_swap="96.8 GiB"
time=2024-11-06T21:16:40.821+01:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=2 layers.split="" memory.available="[914.5 MiB]" memory.gpu_overhead="0 B" memory.required.full="3.4 GiB" memory.required.partial="839.3 MiB" memory.required.kv="768.0 MiB" memory.required.allocations="[839.3 MiB]" memory.weights.total="2.6 GiB" memory.weights.repeating="2.6 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB"
...
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes

If I'm reading this right, then Ollama thinks theres 839 MiB of available VRAM, which seems to be correct for the smaller GPU, but the bigger one should have some ~15 GiB available that don't seem to get considered at all.
This seems to make Ollama split the model between CPU and GPU, or run on CPU entirely.

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.3.14

Originally created by @cobrafast on GitHub (Nov 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7531 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I've noticed is that Ollama makes poor decisions about acceleration in setups with heterogenous GPUs. I for example have a 16GB VRAM and a 3GB VRAM dGPU in my desktop PC and Ollama seems to only consider the smaller VRAM GPU, even if I set up `CUDA_VISIBLE_DEVICES=0` to only let it compute on the bigger one. ``` time=2024-11-03T22:28:58.806+01:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs" time=2024-11-03T22:28:58.806+01:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2024-11-03T22:28:58.806+01:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16 time=2024-11-03T22:28:59.090+01:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-3392b891-9899-c4e1-5fff-f56fe0c463c5 library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4080" total="16.0 GiB" available="14.7 GiB" ... time=2024-11-06T21:16:40.820+01:00 level=INFO source=server.go:105 msg="system memory" total="127.7 GiB" free="85.7 GiB" free_swap="96.8 GiB" time=2024-11-06T21:16:40.821+01:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=2 layers.split="" memory.available="[914.5 MiB]" memory.gpu_overhead="0 B" memory.required.full="3.4 GiB" memory.required.partial="839.3 MiB" memory.required.kv="768.0 MiB" memory.required.allocations="[839.3 MiB]" memory.weights.total="2.6 GiB" memory.weights.repeating="2.6 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="128.0 MiB" memory.graph.partial="128.0 MiB" ... ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes ``` If I'm reading this right, then Ollama thinks theres 839 MiB of available VRAM, which seems to be correct for the smaller GPU, but the bigger one should have some ~15 GiB available that don't seem to get considered at all. This seems to make Ollama split the model between CPU and GPU, or run on CPU entirely. ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.14
GiteaMirror added the nvidiabug labels 2026-04-12 15:45:30 -05:00
Author
Owner

@cobrafast commented on GitHub (Nov 6, 2024):

Just updated to 0.4.0 and this seems fixed already. I'll keep an eye on it and report back or close accordingly.

<!-- gh-comment-id:2460898109 --> @cobrafast commented on GitHub (Nov 6, 2024): Just updated to `0.4.0` and this seems fixed already. I'll keep an eye on it and report back or close accordingly.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4790