[PR #14954] iGPU: reduce memory overhead, RAM pressure guard, cap concurrent models, clarify OLLAMA_VULKAN #46188

Open
opened 2026-04-25 01:42:31 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14954
Author: @RajeshKumar11
Created: 3/19/2026
Status: 🔄 Open

Base: mainHead: igpu-memory-improvements


📝 Commits (2)

  • c17183f iGPU: reduce overhead, add RAM pressure guard, cap concurrent models
  • 265412a iGPU: address PR review comments from guicybercode

📊 Changes

5 files changed (+99 additions, -8 deletions)

View changed files

📝 discover/runner.go (+1 -1)
📝 discover/types.go (+20 -0)
📝 envconfig/config.go (+12 -3)
📝 ml/device.go (+14 -0)
📝 server/sched.go (+52 -4)

📄 Description

Summary

Integrated GPUs (Intel Iris Xe, AMD APU, etc.) share physical RAM with the CPU. Several scheduler decisions that make sense for discrete GPUs cause memory pressure or wasted headroom on iGPU-only systems. This PR addresses four issues:

  • ml/device.go: MinimumMemory() now returns 256 MiB for integrated GPUs instead of 457 MiB. iGPUs have no separate VRAM management structures so the smaller overhead is correct.
  • server/sched.go: After updateFreeSpace, cap iGPU FreeMemory at 80% of current system free RAM. iGPU "VRAM" is shared physical memory — over-allocating starves the OS and causes OOM under CPU load.
  • server/sched.go: When all detected GPUs are integrated and no user override is set, auto-cap maxRunners at 1 (instead of defaultModelsPerGPU=3). Multiple concurrently loaded models on iGPU compete for the same RAM the CPU uses.
  • envconfig/config.go: Add OLLAMA_IGPU_MAX_MODELS env var so users can explicitly control the concurrent model cap on iGPU systems (0 = use standard logic).
  • envconfig/config.go + discover/runner.go: Clarify OLLAMA_VULKAN — the flag forces Vulkan over higher-priority backends, not enables it. Vulkan is auto-detected on iGPU. OLLAMA_VULKAN=0 disables it entirely.
  • discover/types.go: Emit a selected backend log line after GPU discovery so users can confirm which compute path is active without having to parse the inference compute line.

Test plan

  • Build with Vulkan on Intel Iris Xe (Windows 11, Vulkan 1.4.341)
  • Verify selected backend backends=Vulkan appears in server log on iGPU-only system
  • Verify OLLAMA_IGPU_MAX_MODELS appears in server config log
  • Verify iGPU free memory is capped in debug log when system RAM is under pressure
  • Existing scheduler tests pass (go test ./server/...)

Closes #14953


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14954 **Author:** [@RajeshKumar11](https://github.com/RajeshKumar11) **Created:** 3/19/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `igpu-memory-improvements` --- ### 📝 Commits (2) - [`c17183f`](https://github.com/ollama/ollama/commit/c17183f789faad3ec8f0804560f4fd892a6f9142) iGPU: reduce overhead, add RAM pressure guard, cap concurrent models - [`265412a`](https://github.com/ollama/ollama/commit/265412a8c84bed6b84a7dd65124e4fdbc1ef28f3) iGPU: address PR review comments from guicybercode ### 📊 Changes **5 files changed** (+99 additions, -8 deletions) <details> <summary>View changed files</summary> 📝 `discover/runner.go` (+1 -1) 📝 `discover/types.go` (+20 -0) 📝 `envconfig/config.go` (+12 -3) 📝 `ml/device.go` (+14 -0) 📝 `server/sched.go` (+52 -4) </details> ### 📄 Description ## Summary Integrated GPUs (Intel Iris Xe, AMD APU, etc.) share physical RAM with the CPU. Several scheduler decisions that make sense for discrete GPUs cause memory pressure or wasted headroom on iGPU-only systems. This PR addresses four issues: - **`ml/device.go`**: `MinimumMemory()` now returns 256 MiB for integrated GPUs instead of 457 MiB. iGPUs have no separate VRAM management structures so the smaller overhead is correct. - **`server/sched.go`**: After `updateFreeSpace`, cap iGPU `FreeMemory` at 80% of current system free RAM. iGPU "VRAM" is shared physical memory — over-allocating starves the OS and causes OOM under CPU load. - **`server/sched.go`**: When all detected GPUs are integrated and no user override is set, auto-cap `maxRunners` at 1 (instead of `defaultModelsPerGPU=3`). Multiple concurrently loaded models on iGPU compete for the same RAM the CPU uses. - **`envconfig/config.go`**: Add `OLLAMA_IGPU_MAX_MODELS` env var so users can explicitly control the concurrent model cap on iGPU systems (0 = use standard logic). - **`envconfig/config.go` + `discover/runner.go`**: Clarify `OLLAMA_VULKAN` — the flag *forces* Vulkan over higher-priority backends, not enables it. Vulkan is auto-detected on iGPU. `OLLAMA_VULKAN=0` disables it entirely. - **`discover/types.go`**: Emit a `selected backend` log line after GPU discovery so users can confirm which compute path is active without having to parse the `inference compute` line. ## Test plan - [ ] Build with Vulkan on Intel Iris Xe (Windows 11, Vulkan 1.4.341) - [ ] Verify `selected backend backends=Vulkan` appears in server log on iGPU-only system - [ ] Verify `OLLAMA_IGPU_MAX_MODELS` appears in server config log - [ ] Verify iGPU free memory is capped in debug log when system RAM is under pressure - [ ] Existing scheduler tests pass (`go test ./server/...`) Closes #14953 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 01:42:31 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#46188