[PR #13042] [MERGED] Prefer dedicated GPUs over iGPUs when offloading #39918

Closed
opened 2026-04-23 00:56:14 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13042
Author: @jessegross
Created: 11/11/2025
Status: Merged
Merged: 11/11/2025
Merged by: @jessegross

Base: mainHead: jessegross/igpu


📝 Commits (4)

  • c0b1f79 llamarunner: Respect device ordering for offloaded layers
  • a51cff6 llm: Use Ollama engine memory layouts for both old and new engines
  • 84ff4dc llm: Separate llamaServer and ollamaServer code paths
  • fe5220b llm: Prefer dedicated GPUs over iGPUs when allocating memory

📊 Changes

9 files changed (+437 additions, -1007 deletions)

View changed files

📝 fs/ggml/ggml.go (+0 -67)
📝 llama/llama.go (+29 -5)
llm/memory.go (+0 -516)
llm/memory_test.go (+0 -130)
📝 llm/server.go (+269 -254)
📝 llm/server_test.go (+60 -29)
📝 ml/device.go (+50 -0)
📝 runner/llamarunner/runner.go (+12 -6)
📝 server/sched.go (+17 -0)

📄 Description

Currently, when we split models over multiple GPUs it is proportional to the amount of free VRAM. This works well if the GPUs have roughly equal performance. However, it does not work well with a mix of dedicated GPUs and iGPUs because iGPUs often have large amounts of memory (system RAM) but much slower performance. Instead, we should prefer the dedicated GPUs and only use iGPUs for layers that would otherwise go onto the CPU.

In support of this, the PR also updates the llama engine to use the same layout code as the Ollama engine. It is still subject to inaccurate inputs as it continues to use estimates rather than allocation data. However, the actual layout is more accurate, especially with multi-GPU setups. Sharing the same layout code also means that it gets the iGPU logic automatically.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13042 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 11/11/2025 **Status:** ✅ Merged **Merged:** 11/11/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/igpu` --- ### 📝 Commits (4) - [`c0b1f79`](https://github.com/ollama/ollama/commit/c0b1f798908c9ad245b406e2ed008ec70f058ece) llamarunner: Respect device ordering for offloaded layers - [`a51cff6`](https://github.com/ollama/ollama/commit/a51cff6272cb9f365c5e44e4fab85abcdc14ea83) llm: Use Ollama engine memory layouts for both old and new engines - [`84ff4dc`](https://github.com/ollama/ollama/commit/84ff4dc946ef11cf38fa1e75419a3fb74662c222) llm: Separate llamaServer and ollamaServer code paths - [`fe5220b`](https://github.com/ollama/ollama/commit/fe5220ba6b189b192df5b0e315092a0a5f6f5881) llm: Prefer dedicated GPUs over iGPUs when allocating memory ### 📊 Changes **9 files changed** (+437 additions, -1007 deletions) <details> <summary>View changed files</summary> 📝 `fs/ggml/ggml.go` (+0 -67) 📝 `llama/llama.go` (+29 -5) ➖ `llm/memory.go` (+0 -516) ➖ `llm/memory_test.go` (+0 -130) 📝 `llm/server.go` (+269 -254) 📝 `llm/server_test.go` (+60 -29) 📝 `ml/device.go` (+50 -0) 📝 `runner/llamarunner/runner.go` (+12 -6) 📝 `server/sched.go` (+17 -0) </details> ### 📄 Description Currently, when we split models over multiple GPUs it is proportional to the amount of free VRAM. This works well if the GPUs have roughly equal performance. However, it does not work well with a mix of dedicated GPUs and iGPUs because iGPUs often have large amounts of memory (system RAM) but much slower performance. Instead, we should prefer the dedicated GPUs and only use iGPUs for layers that would otherwise go onto the CPU. In support of this, the PR also updates the llama engine to use the same layout code as the Ollama engine. It is still subject to inaccurate inputs as it continues to use estimates rather than allocation data. However, the actual layout is more accurate, especially with multi-GPU setups. Sharing the same layout code also means that it gets the iGPU logic automatically. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-23 00:56:14 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#39918