[PR #10678] [MERGED] server/sched.go: Added support for grouping GPUs ( significant VRAM reduction, Power-Consumption + Low-PCIe-speed performance gains) #75617

Closed
opened 2026-05-05 08:02:43 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/10678
Author: @dan-and
Created: 5/12/2025
Status: Merged
Merged: 8/11/2025
Merged by: @jessegross

Base: mainHead: SCHED_SPREAD_compact


📝 Commits (10+)

  • 68f195c Added support for Grouping GPUs for modes based on the models RAM requirements.
  • eda0adc Merge branch 'ollama:main' into SCHED_SPREAD_compact
  • 0470916 Merge branch 'ollama:main' into SCHED_SPREAD_compact
  • 0647aff Merge branch 'ollama:main' into SCHED_SPREAD_compact
  • 8a16264 Merge branch 'ollama:main' into SCHED_SPREAD_compact
  • 9489c11 Refactored the minimal GPU subset concept as the original algorithm always calculated the overall memory requirement for all GPUs, which led to not fully utilized
  • 997f9ba Merge branch 'ollama:main' into SCHED_SPREAD_compact
  • 8f4fd18 Merge branch 'ollama:main' into SCHED_SPREAD_compact
  • 3c6cca5 Merge branch 'ollama:main' into SCHED_SPREAD_compact
  • 9c2fd94 OLLAMA_SCHED_SPREAD setting back to original true / falsch boolean.

📊 Changes

1 file changed (+35 additions, -25 deletions)

View changed files

📝 server/sched.go (+35 -25)

📄 Description

This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs.

Benefits:

  • Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed
  • Allowing unallocated GPUs to get into power-saving mode.
  • Significantly reduce VRAM allocation when using more than 2 GPUs in a system (see below)
  • Due to the reduced memory allocation, you can run more models simultaneously.

How to use:

  • OLLAMA_SCHED_SPREAD=0 (or keep it unset) to use only one GPU, and then evenly distribute the load across all GPUs. (old default behavior)
  • OLLAMA_SCHED_SPREAD=1 (or set to anything else) to evenly distribute the load across all GPUs. (old default behavior)
  • OLLAMA_SCHED_SPREAD=2 to keep the number of GPUs to a minimum and only use as many as needed to run the model.

Tests and Environment:

Like so many in the home-lab community, my Ollama server is built on a budget, which resulted in a low-power optimized server with 7 GPUs (Nvidia RTX 3060 - 12GB) with a total of 72GB VRAM.

When using models which require around 30-40 GB, all GPUs have to run, which results in a higher power consumption and a huge overhead in VRAM allocation. With this patch, just a few models are in use, which powers down all other GPUs to around 3 watts.
Since the server (Ryzen 5500G / Asus Prime B450) is not enterprise grade, the PCIe bus far slower, which also slows down the utilization when using too many GPUs with one single model.

Results of Memory-Allocation Overhead when spreading a model over more than 2 GPUs:

In this example, we configure Ollama to num_parallel=2, num_ctx=8192

Model: gemma3:4b - a2af6cc3eb7f - 3.3 GB
Overall required VRAM when spreading over:
1x GPU: 5.7 GiB
2x GPU: 7.8 GiB
3x GPU: 9.3 GiB
4x GPU: 10.8 GiB
5x GPU: 12.3 GiB
6x GPU: 13.9 GiB
7x GPU: 15.4 GiB

a more realistic situation could be qwen3:30b - 0b28110b7a33 - 18 GB , which exceeds the home-lab gpus.
3x GPU: 27.2 GiB
4x GPU: 30.1 GiB
5x GPU: 32.9 GiB
6x GPU: 35.8 GiB
7x GPU: 38.7 GiB

Results of Power consumption measured "at the wall" and eval rates

Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD unset: 450 Watt with 27 fps
Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD=1: 450 Watt with 27 fps
Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD=2: 310 Watt with 81 fps

Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD unset: 393 Watt with 44 fps
Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD=1: 393 Watt with 44 fps
Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD=2: 320 Watt with 51 fps

(Yes, Qwen3:14b is far more inefficient on my system than the slightly younger 30b variant)

gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD unset: 250 Watt with 70 fps
gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD=1 : 389 Watt with 55 fps
gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD=2 : 250 Watt with 70 fps

(gemma3:4b with num_ctx of 8192 also resulted in just utilizing 1 GPUs with OLLAMA_SCHED_SPREAD=2 similar to the unset OLLAMA_SCHED_SPREAD as one GPU is enough to fit the whole model)


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/10678 **Author:** [@dan-and](https://github.com/dan-and) **Created:** 5/12/2025 **Status:** ✅ Merged **Merged:** 8/11/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `SCHED_SPREAD_compact` --- ### 📝 Commits (10+) - [`68f195c`](https://github.com/ollama/ollama/commit/68f195c2f396a732c99a93d93272225cad7f1cad) Added support for Grouping GPUs for modes based on the models RAM requirements. - [`eda0adc`](https://github.com/ollama/ollama/commit/eda0adc9a01a3664d2ea594dbeb69d19b1a0441e) Merge branch 'ollama:main' into SCHED_SPREAD_compact - [`0470916`](https://github.com/ollama/ollama/commit/0470916338991817f13d1c43abd3362317d948fd) Merge branch 'ollama:main' into SCHED_SPREAD_compact - [`0647aff`](https://github.com/ollama/ollama/commit/0647aff8a09f81b0696a45fda4cc1e8bc0a98f21) Merge branch 'ollama:main' into SCHED_SPREAD_compact - [`8a16264`](https://github.com/ollama/ollama/commit/8a162640830deaa1472e98e08bc5e358ec735f68) Merge branch 'ollama:main' into SCHED_SPREAD_compact - [`9489c11`](https://github.com/ollama/ollama/commit/9489c111e6d2c0752f4aa6e34ce8dabac8dbbd12) Refactored the minimal GPU subset concept as the original algorithm always calculated the overall memory requirement for all GPUs, which led to not fully utilized - [`997f9ba`](https://github.com/ollama/ollama/commit/997f9baa37ae7ee2cf492b268e72875b617d81ca) Merge branch 'ollama:main' into SCHED_SPREAD_compact - [`8f4fd18`](https://github.com/ollama/ollama/commit/8f4fd185cfe8bf482cfc4edc7525735126dacf2a) Merge branch 'ollama:main' into SCHED_SPREAD_compact - [`3c6cca5`](https://github.com/ollama/ollama/commit/3c6cca5287cc5be64ca9f90f260657842c4cf1fe) Merge branch 'ollama:main' into SCHED_SPREAD_compact - [`9c2fd94`](https://github.com/ollama/ollama/commit/9c2fd945fd4c3247f04fb7408100b37af29f5eb2) OLLAMA_SCHED_SPREAD setting back to original true / falsch boolean. ### 📊 Changes **1 file changed** (+35 additions, -25 deletions) <details> <summary>View changed files</summary> 📝 `server/sched.go` (+35 -25) </details> ### 📄 Description This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU <OR> distributing over all available GPUs. ### Benefits: * Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed * Allowing unallocated GPUs to get into power-saving mode. * Significantly reduce VRAM allocation when using more than 2 GPUs in a system (see below) * Due to the reduced memory allocation, you can run more models simultaneously. ### How to use: * OLLAMA_SCHED_SPREAD=0 (or keep it unset) to use only one GPU, and then evenly distribute the load across all GPUs. (old default behavior) * OLLAMA_SCHED_SPREAD=1 (or set to anything else) to evenly distribute the load across all GPUs. (old default behavior) * OLLAMA_SCHED_SPREAD=2 to keep the number of GPUs to a minimum and only use as many as needed to run the model. ### Tests and Environment: Like so many in the home-lab community, my Ollama server is built on a budget, which resulted in a low-power optimized server with 7 GPUs (Nvidia RTX 3060 - 12GB) with a total of 72GB VRAM. When using models which require around 30-40 GB, all GPUs have to run, which results in a higher power consumption and a huge overhead in VRAM allocation. With this patch, just a few models are in use, which powers down all other GPUs to around 3 watts. Since the server (Ryzen 5500G / Asus Prime B450) is not enterprise grade, the PCIe bus far slower, which also slows down the utilization when using too many GPUs with one single model. ### Results of Memory-Allocation Overhead when spreading a model over more than 2 GPUs: In this example, we configure Ollama to num_parallel=2, num_ctx=8192 Model: gemma3:4b - a2af6cc3eb7f - 3.3 GB Overall required VRAM when spreading over: 1x GPU: 5.7 GiB 2x GPU: 7.8 GiB 3x GPU: 9.3 GiB 4x GPU: 10.8 GiB 5x GPU: 12.3 GiB 6x GPU: 13.9 GiB 7x GPU: 15.4 GiB a more realistic situation could be qwen3:30b - 0b28110b7a33 - 18 GB , which exceeds the home-lab gpus. 3x GPU: 27.2 GiB 4x GPU: 30.1 GiB 5x GPU: 32.9 GiB 6x GPU: 35.8 GiB 7x GPU: 38.7 GiB ### Results of Power consumption measured "at the wall" and eval rates Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD unset: 450 Watt with 27 fps Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD=1: 450 Watt with 27 fps Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD=2: **310 Watt with 81 fps** Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD unset: 393 Watt with 44 fps Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD=1: 393 Watt with 44 fps Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD=2: **320 Watt with 51 fps** (Yes, Qwen3:14b is far more inefficient on my system than the slightly younger 30b variant) gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD unset: 250 Watt with 70 fps gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD=1 : 389 Watt with 55 fps gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD=2 : **250 Watt with 70 fps** (gemma3:4b with num_ctx of 8192 also resulted in just utilizing 1 GPUs with OLLAMA_SCHED_SPREAD=2 similar to the unset OLLAMA_SCHED_SPREAD as one GPU is enough to fit the whole model) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 08:02:43 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#75617