[GH-ISSUE #1288] Ollama with multiple GPUs #47175

Closed
opened 2026-04-28 03:24:33 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @technovangelist on GitHub (Nov 27, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1288

Originally assigned to: @dhiltgen on GitHub.

If you are running ollama on a machine with multiple GPUs, inference will be slower than the same machine with one gpu but it will still be faster than the same machine with no gpu. The benefit of multiple GPUs is access to more video memory, allowing for larger models or more of the model to be processed by the GPU.

BUT if you have enough video memory on the first gpu, we should use only the one gpu, to ensure that perf is as fast as possible. Otherwise it is slower for no good reason.

And if possible, it would be great to identify the faster gpu and use that first.

Originally created by @technovangelist on GitHub (Nov 27, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1288 Originally assigned to: @dhiltgen on GitHub. If you are running ollama on a machine with multiple GPUs, inference will be slower than the same machine with one gpu but it will still be faster than the same machine with no gpu. The benefit of multiple GPUs is access to more video memory, allowing for larger models or more of the model to be processed by the GPU. BUT if you have enough video memory on the first gpu, we should use only the one gpu, to ensure that perf is as fast as possible. Otherwise it is slower for no good reason. And if possible, it would be great to identify the faster gpu and use that first.
Author
Owner

@peteygao commented on GitHub (Nov 28, 2023):

This is because the model checkpoint synchronisation is dependent on the slowest GPU running in the cluster.

So you're correct, you can utilise increased VRAM distributed across all the GPUs, but the inference speed will be bottlenecked by the speed of the slowest GPU.

This should be a separate feature request: Specifying which GPUs to use when there are multiple GPUs attached to the machine.

<!-- gh-comment-id:1829054243 --> @peteygao commented on GitHub (Nov 28, 2023): This is because the model checkpoint synchronisation is dependent on the slowest GPU running in the cluster. So you're correct, you can utilise increased VRAM distributed across all the GPUs, but the inference speed will be bottlenecked by the speed of the _slowest_ GPU. This should be a separate feature request: Specifying which GPUs to use when there are multiple GPUs attached to the machine.
Author
Owner

@technovangelist commented on GitHub (Nov 28, 2023):

I hadn't really thought through about a machine with different kinds of gpus, though that is what my last sentence is about. But mostly i was thinking about when i have 2-4 t4's or a bunch of a100's attached that are all identical. In the discord however it was suggested that someone wanted to dedicate one gpu to work projects and another to Ollama.

<!-- gh-comment-id:1830006005 --> @technovangelist commented on GitHub (Nov 28, 2023): I hadn't really thought through about a machine with different kinds of gpus, though that is what my last sentence is about. But mostly i was thinking about when i have 2-4 t4's or a bunch of a100's attached that are all identical. In the discord however it was suggested that someone wanted to dedicate one gpu to work projects and another to Ollama.
Author
Owner

@peteygao commented on GitHub (Dec 6, 2023):

I was assuming you were using a heterogenous GPU setup given your last sentence about the "identify the faster GPU".

If your model can fit inside a single GPU, that will yield the maximum performance since there's zero latency penalty from synchronising with other GPUs.

Multi-GPU setup is only useful in two scenarios:

  1. Increasing throughput by having parallel inferences, 1 inference per GPU (assuming the model fits into the VRAM entirely)
  2. Ability to use larger parameter models by splitting the tensors across the GPUs--you'll have less throughput compared to a single "large" GPU, but at least you can run larger models. You lose less throughput if the GPUs are utilising NVLink rather than the PCI-e bus, but I guess you are using a cloud provider and have no way of controlling the physical hardware.
<!-- gh-comment-id:1842421536 --> @peteygao commented on GitHub (Dec 6, 2023): I was assuming you were using a heterogenous GPU setup given your last sentence about the "identify the faster GPU". If your model can fit inside a single GPU, that will yield the maximum performance since there's zero latency penalty from synchronising with other GPUs. Multi-GPU setup is only useful in two scenarios: 1) Increasing throughput by having parallel inferences, 1 inference per GPU (assuming the model fits into the VRAM entirely) 2) Ability to use larger parameter models by splitting the tensors across the GPUs--you'll have less throughput compared to a single "large" GPU, but at least you can run larger models. You lose less throughput if the GPUs are utilising NVLink rather than the PCI-e bus, but I guess you are using a cloud provider and have no way of controlling the physical hardware.
Author
Owner

@dhiltgen commented on GitHub (Mar 12, 2024):

Let's track this optimization with #1656

<!-- gh-comment-id:1992072945 --> @dhiltgen commented on GitHub (Mar 12, 2024): Let's track this optimization with #1656
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47175