[GH-ISSUE #14953] iGPU: reduce memory overhead, add RAM pressure guard, cap concurrent models, clarify OLLAMA_VULKAN #35376

Open
opened 2026-04-22 19:51:23 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @RajeshKumar11 on GitHub (Mar 19, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14953

Problem

Integrated GPUs (iGPU — Intel Iris Xe, AMD APU, etc.) share physical RAM with the CPU. The current Ollama scheduler treats iGPU the same as a discrete GPU in several places, causing:

  1. Over-reserved memory overhead — `MinimumMemory()` reserves 457 MiB on all non-Metal backends, including iGPU. Since iGPU has no separate VRAM management structures, this wastes headroom that could be used to offload more model layers.

  2. No system RAM pressure guard — iGPU `FreeMemory` is reported as available VRAM, but this is shared with the OS and CPU processes. Loading a large model can exhaust physical RAM and cause OOM crashes under CPU load.

  3. Too many concurrent models by default — `defaultModelsPerGPU = 3` allows 3 models loaded simultaneously. On iGPU-only systems this multiplies RAM pressure since all loaded models share the same physical memory pool.

  4. Misleading `OLLAMA_VULKAN` log — On iGPU-only systems Vulkan is auto-selected, but the server config log shows `OLLAMA_VULKAN:false`, making users think Vulkan is not active. There is also no log line explicitly stating which backend was chosen.

  • #13023 — Intel Iris Xe not detected (0 VRAM)
  • #13029 — Vulkan fails to allocate memory buffer
  • #12223 — OLLAMA_GPU_OVERHEAD not respected
  • #13212 — OLLAMA_VULKAN=0 has no effect
  • #11748 — No shared-memory offload when VRAM full

Proposed fix

  • `ml/device.go`: Return 256 MiB overhead for integrated GPUs (vs 457 MiB for discrete)
  • `server/sched.go`: After `updateFreeSpace`, cap iGPU `FreeMemory` at 80% of current system free RAM
  • `server/sched.go`: When all GPUs are integrated and no user override is set, auto-cap `maxRunners` at 1
  • `envconfig/config.go`: Add `OLLAMA_IGPU_MAX_MODELS` env var for user override of the concurrent model cap
  • `envconfig/config.go` + `discover/runner.go`: Clarify `OLLAMA_VULKAN` docs — it forces Vulkan, not enables it
  • `discover/types.go`: Emit `selected backend` log line after GPU discovery

Test environment

  • Intel Core Ultra 7 155H, Intel Iris Xe Graphics (iGPU)
  • Windows 11, Vulkan 1.4.341, Ollama built from source
Originally created by @RajeshKumar11 on GitHub (Mar 19, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14953 ## Problem Integrated GPUs (iGPU — Intel Iris Xe, AMD APU, etc.) share physical RAM with the CPU. The current Ollama scheduler treats iGPU the same as a discrete GPU in several places, causing: 1. **Over-reserved memory overhead** — \`MinimumMemory()\` reserves 457 MiB on all non-Metal backends, including iGPU. Since iGPU has no separate VRAM management structures, this wastes headroom that could be used to offload more model layers. 2. **No system RAM pressure guard** — iGPU \`FreeMemory\` is reported as available VRAM, but this is shared with the OS and CPU processes. Loading a large model can exhaust physical RAM and cause OOM crashes under CPU load. 3. **Too many concurrent models by default** — \`defaultModelsPerGPU = 3\` allows 3 models loaded simultaneously. On iGPU-only systems this multiplies RAM pressure since all loaded models share the same physical memory pool. 4. **Misleading \`OLLAMA_VULKAN\` log** — On iGPU-only systems Vulkan is auto-selected, but the server config log shows \`OLLAMA_VULKAN:false\`, making users think Vulkan is not active. There is also no log line explicitly stating which backend was chosen. ## Related issues - #13023 — Intel Iris Xe not detected (0 VRAM) - #13029 — Vulkan fails to allocate memory buffer - #12223 — OLLAMA_GPU_OVERHEAD not respected - #13212 — OLLAMA_VULKAN=0 has no effect - #11748 — No shared-memory offload when VRAM full ## Proposed fix - \`ml/device.go\`: Return 256 MiB overhead for integrated GPUs (vs 457 MiB for discrete) - \`server/sched.go\`: After \`updateFreeSpace\`, cap iGPU \`FreeMemory\` at 80% of current system free RAM - \`server/sched.go\`: When all GPUs are integrated and no user override is set, auto-cap \`maxRunners\` at 1 - \`envconfig/config.go\`: Add \`OLLAMA_IGPU_MAX_MODELS\` env var for user override of the concurrent model cap - \`envconfig/config.go\` + \`discover/runner.go\`: Clarify \`OLLAMA_VULKAN\` docs — it forces Vulkan, not enables it - \`discover/types.go\`: Emit \`selected backend\` log line after GPU discovery ## Test environment - Intel Core Ultra 7 155H, Intel Iris Xe Graphics (iGPU) - Windows 11, Vulkan 1.4.341, Ollama built from source
Author
Owner

@rick-github commented on GitHub (Mar 19, 2026):

  1. No system RAM pressure guard — iGPU FreeMemory is reported as available VRAM, but this is shared with the OS and CPU processes. Loading a large model can exhaust physical RAM and cause OOM crashes under CPU load.

If a large model would have exhausted physical RAM when loaded to be processed by the iGPU, forcing it into physical RAM to be be processed by the CPU is not going to help.

  1. Too many concurrent models by defaultdefaultModelsPerGPU = 3 allows 3 models loaded simultaneously. On iGPU-only systems this multiplies RAM pressure since all loaded models share the same physical memory pool.

Set OLLAMA_MAX_LOADED_MODELS.

  1. Misleading OLLAMA_VULKAN log — On iGPU-only systems Vulkan is auto-selected, but the server config log shows OLLAMA_VULKAN:false, making users think Vulkan is not active.

This line expressly removes the Vulkan library from consideration if OLLAMA_VULKAN is not set true:

96e36c0d90/discover/runner.go (L105)

If ROCm or Nvidia backends are not found, a machine with an iGPU will fall back to CPU if Vulkan is not enabled.

There is also no log line explicitly stating which backend was chosen.

Chosen backends are listed in the inference compute line.

Not related to the issues above.

Possibly resolved in 0.13.2+.

User error.

User error.

Presumably resolved by upgrading.

<!-- gh-comment-id:4091888121 --> @rick-github commented on GitHub (Mar 19, 2026): > 2. **No system RAM pressure guard** — iGPU `FreeMemory` is reported as available VRAM, but this is shared with the OS and CPU processes. Loading a large model can exhaust physical RAM and cause OOM crashes under CPU load. If a large model would have exhausted physical RAM when loaded to be processed by the iGPU, forcing it into physical RAM to be be processed by the CPU is not going to help. > 3. **Too many concurrent models by default** — `defaultModelsPerGPU = 3` allows 3 models loaded simultaneously. On iGPU-only systems this multiplies RAM pressure since all loaded models share the same physical memory pool. Set [`OLLAMA_MAX_LOADED_MODELS`](https://github.com/ollama/ollama/blob/3f3a24b4189a6c143d691c0747af3f284ff8928f/envconfig/config.go#L313). > 4. **Misleading `OLLAMA_VULKAN` log** — On iGPU-only systems Vulkan is auto-selected, but the server config log shows `OLLAMA_VULKAN:false`, making users think Vulkan is not active. This line expressly removes the Vulkan library from consideration if `OLLAMA_VULKAN` is not set true: https://github.com/ollama/ollama/blob/96e36c0d90b1da23304658a2ba90784b4a1c822d/discover/runner.go#L105 If ROCm or Nvidia backends are not found, a machine with an iGPU will fall back to CPU if Vulkan is not enabled. > There is also no log line explicitly stating which backend was chosen. Chosen backends are listed in the `inference compute` line. > ## Related issues > * [Intel Iris Xe Graphics (16GB) not detected by Ollama v0.12.10 on Windows 11 despite Vulkan/DXGI+PDH support #13023](https://github.com/ollama/ollama/issues/13023) — Intel Iris Xe not detected (0 VRAM) Not related to the issues above. > * [Vulkan fails to allocate memory buffer #13029](https://github.com/ollama/ollama/issues/13029) — Vulkan fails to allocate memory buffer Possibly resolved in 0.13.2+. > * [OLLAMA_GPU_OVERHEAD is not respected #12223](https://github.com/ollama/ollama/issues/12223) — OLLAMA_GPU_OVERHEAD not respected User error. > * [Vulkan is enabled by default and can't be disabled with OLLAMA_VULKAN=0 #13212](https://github.com/ollama/ollama/issues/13212) — OLLAMA_VULKAN=0 has no effect User error. > * [Ollama does not natively support offloading model weights to shared memory (system RAM or CPU memory) when dedicated GPU memory is full #11748](https://github.com/ollama/ollama/issues/11748) — No shared-memory offload when VRAM full Presumably resolved by upgrading.
Author
Owner

@rick-github commented on GitHub (Mar 21, 2026):

@CastelDazur Please don't post AI slop.

<!-- gh-comment-id:4103206125 --> @rick-github commented on GitHub (Mar 21, 2026): @CastelDazur Please don't post AI slop.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#35376