[PR #10678] server/sched.go: Added support for grouping GPUs ( significant VRAM reduction, Power-Consumption + Low-PCIe-speed performance gains) #13323

Closed
opened 2026-04-13 00:23:48 -05:00 by GiteaMirror · 0 comments
Owner

Original Pull Request: https://github.com/ollama/ollama/pull/10678

State: closed
Merged: Yes


This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs.

Benefits:

  • Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed
  • Allowing unallocated GPUs to get into power-saving mode.
  • Significantly reduce VRAM allocation when using more than 2 GPUs in a system (see below)
  • Due to the reduced memory allocation, you can run more models simultaneously.

How to use:

  • OLLAMA_SCHED_SPREAD=0 (or keep it unset) to use only one GPU, and then evenly distribute the load across all GPUs. (old default behavior)
  • OLLAMA_SCHED_SPREAD=1 (or set to anything else) to evenly distribute the load across all GPUs. (old default behavior)
  • OLLAMA_SCHED_SPREAD=2 to keep the number of GPUs to a minimum and only use as many as needed to run the model.

Tests and Environment:

Like so many in the home-lab community, my Ollama server is built on a budget, which resulted in a low-power optimized server with 7 GPUs (Nvidia RTX 3060 - 12GB) with a total of 72GB VRAM.

When using models which require around 30-40 GB, all GPUs have to run, which results in a higher power consumption and a huge overhead in VRAM allocation. With this patch, just a few models are in use, which powers down all other GPUs to around 3 watts.
Since the server (Ryzen 5500G / Asus Prime B450) is not enterprise grade, the PCIe bus far slower, which also slows down the utilization when using too many GPUs with one single model.

Results of Memory-Allocation Overhead when spreading a model over more than 2 GPUs:

In this example, we configure Ollama to num_parallel=2, num_ctx=8192

Model: gemma3:4b - a2af6cc3eb7f - 3.3 GB
Overall required VRAM when spreading over:
1x GPU: 5.7 GiB
2x GPU: 7.8 GiB
3x GPU: 9.3 GiB
4x GPU: 10.8 GiB
5x GPU: 12.3 GiB
6x GPU: 13.9 GiB
7x GPU: 15.4 GiB

a more realistic situation could be qwen3:30b - 0b28110b7a33 - 18 GB , which exceeds the home-lab gpus.
3x GPU: 27.2 GiB
4x GPU: 30.1 GiB
5x GPU: 32.9 GiB
6x GPU: 35.8 GiB
7x GPU: 38.7 GiB

Results of Power consumption measured "at the wall" and eval rates

Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD unset: 450 Watt with 27 fps
Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD=1: 450 Watt with 27 fps
Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD=2: 310 Watt with 81 fps

Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD unset: 393 Watt with 44 fps
Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD=1: 393 Watt with 44 fps
Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD=2: 320 Watt with 51 fps

(Yes, Qwen3:14b is far more inefficient on my system than the slightly younger 30b variant)

gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD unset: 250 Watt with 70 fps
gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD=1 : 389 Watt with 55 fps
gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD=2 : 250 Watt with 70 fps

(gemma3:4b with num_ctx of 8192 also resulted in just utilizing 1 GPUs with OLLAMA_SCHED_SPREAD=2 similar to the unset OLLAMA_SCHED_SPREAD as one GPU is enough to fit the whole model)

**Original Pull Request:** https://github.com/ollama/ollama/pull/10678 **State:** closed **Merged:** Yes --- This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU <OR> distributing over all available GPUs. ### Benefits: * Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed * Allowing unallocated GPUs to get into power-saving mode. * Significantly reduce VRAM allocation when using more than 2 GPUs in a system (see below) * Due to the reduced memory allocation, you can run more models simultaneously. ### How to use: * OLLAMA_SCHED_SPREAD=0 (or keep it unset) to use only one GPU, and then evenly distribute the load across all GPUs. (old default behavior) * OLLAMA_SCHED_SPREAD=1 (or set to anything else) to evenly distribute the load across all GPUs. (old default behavior) * OLLAMA_SCHED_SPREAD=2 to keep the number of GPUs to a minimum and only use as many as needed to run the model. ### Tests and Environment: Like so many in the home-lab community, my Ollama server is built on a budget, which resulted in a low-power optimized server with 7 GPUs (Nvidia RTX 3060 - 12GB) with a total of 72GB VRAM. When using models which require around 30-40 GB, all GPUs have to run, which results in a higher power consumption and a huge overhead in VRAM allocation. With this patch, just a few models are in use, which powers down all other GPUs to around 3 watts. Since the server (Ryzen 5500G / Asus Prime B450) is not enterprise grade, the PCIe bus far slower, which also slows down the utilization when using too many GPUs with one single model. ### Results of Memory-Allocation Overhead when spreading a model over more than 2 GPUs: In this example, we configure Ollama to num_parallel=2, num_ctx=8192 Model: gemma3:4b - a2af6cc3eb7f - 3.3 GB Overall required VRAM when spreading over: 1x GPU: 5.7 GiB 2x GPU: 7.8 GiB 3x GPU: 9.3 GiB 4x GPU: 10.8 GiB 5x GPU: 12.3 GiB 6x GPU: 13.9 GiB 7x GPU: 15.4 GiB a more realistic situation could be qwen3:30b - 0b28110b7a33 - 18 GB , which exceeds the home-lab gpus. 3x GPU: 27.2 GiB 4x GPU: 30.1 GiB 5x GPU: 32.9 GiB 6x GPU: 35.8 GiB 7x GPU: 38.7 GiB ### Results of Power consumption measured "at the wall" and eval rates Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD unset: 450 Watt with 27 fps Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD=1: 450 Watt with 27 fps Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD=2: **310 Watt with 81 fps** Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD unset: 393 Watt with 44 fps Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD=1: 393 Watt with 44 fps Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD=2: **320 Watt with 51 fps** (Yes, Qwen3:14b is far more inefficient on my system than the slightly younger 30b variant) gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD unset: 250 Watt with 70 fps gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD=1 : 389 Watt with 55 fps gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD=2 : **250 Watt with 70 fps** (gemma3:4b with num_ctx of 8192 also resulted in just utilizing 1 GPUs with OLLAMA_SCHED_SPREAD=2 similar to the unset OLLAMA_SCHED_SPREAD as one GPU is enough to fit the whole model)
GiteaMirror added the pull-request label 2026-04-13 00:23:48 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#13323