[PR #12811] [MERGED] Enable op_offload to improve partial offload performance #19236

Closed
opened 2026-04-16 07:01:40 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/12811
Author: @jessegross
Created: 10/29/2025
Status: Merged
Merged: 10/30/2025
Merged by: @jessegross

Base: mainHead: jessegross/op_offload


📝 Commits (2)

  • dbb461c ollamarunner: Worst case batch for token generation
  • 921ba0b ggml: Enable op_offload to improve partial offload performance

📊 Changes

15 files changed (+422 additions, -135 deletions)

View changed files

llama/patches/0019-Enable-CUDA-Graphs-for-gemma3n.patch (+0 -58)
llama/patches/0019-ggml-Add-batch-size-hint.patch (+300 -0)
📝 llama/patches/0022-ggml-No-alloc-mode.patch (+20 -19)
📝 ml/backend.go (+5 -0)
📝 ml/backend/ggml/ggml.go (+17 -1)
📝 ml/backend/ggml/ggml/include/ggml-backend.h (+4 -1)
📝 ml/backend/ggml/ggml/src/ggml-backend-impl.h (+2 -2)
📝 ml/backend/ggml/ggml/src/ggml-backend.cpp (+13 -6)
📝 ml/backend/ggml/ggml/src/ggml-blas/ggml-blas.cpp (+2 -1)
📝 ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.cpp (+3 -1)
📝 ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu (+29 -37)
📝 ml/backend/ggml/ggml/src/ggml-metal/ggml-metal.cpp (+3 -1)
📝 ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp (+2 -1)
📝 runner/ollamarunner/multimodal.go (+3 -0)
📝 runner/ollamarunner/runner.go (+19 -7)

📄 Description

When a model is partially offloaded to system RAM, we can either do the calculations on the CPU or we can temporarily transfer the data to the GPU to do the calculations there. Small batches tend to be better on the CPU, large batches on the GPU.

The llamarunner used the GPU in most cases and the ollamarunner used the CPU. Although the ollamarunner saw an improvement in token generation performance, there was a large performance hit in prompt processing (3-10x).

There is an existing heuristic to dynamically switch between these two modes but in practice it doesn't have enough information to accurately make that decision. This adds authoritative data to make the check work to get the best of both worlds.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/12811 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 10/29/2025 **Status:** ✅ Merged **Merged:** 10/30/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/op_offload` --- ### 📝 Commits (2) - [`dbb461c`](https://github.com/ollama/ollama/commit/dbb461cc7360e664fd21422e28174c6b3802ea85) ollamarunner: Worst case batch for token generation - [`921ba0b`](https://github.com/ollama/ollama/commit/921ba0bc848dc8d6499f164e78c42bfceba6048d) ggml: Enable op_offload to improve partial offload performance ### 📊 Changes **15 files changed** (+422 additions, -135 deletions) <details> <summary>View changed files</summary> ➖ `llama/patches/0019-Enable-CUDA-Graphs-for-gemma3n.patch` (+0 -58) ➕ `llama/patches/0019-ggml-Add-batch-size-hint.patch` (+300 -0) 📝 `llama/patches/0022-ggml-No-alloc-mode.patch` (+20 -19) 📝 `ml/backend.go` (+5 -0) 📝 `ml/backend/ggml/ggml.go` (+17 -1) 📝 `ml/backend/ggml/ggml/include/ggml-backend.h` (+4 -1) 📝 `ml/backend/ggml/ggml/src/ggml-backend-impl.h` (+2 -2) 📝 `ml/backend/ggml/ggml/src/ggml-backend.cpp` (+13 -6) 📝 `ml/backend/ggml/ggml/src/ggml-blas/ggml-blas.cpp` (+2 -1) 📝 `ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.cpp` (+3 -1) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu` (+29 -37) 📝 `ml/backend/ggml/ggml/src/ggml-metal/ggml-metal.cpp` (+3 -1) 📝 `ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp` (+2 -1) 📝 `runner/ollamarunner/multimodal.go` (+3 -0) 📝 `runner/ollamarunner/runner.go` (+19 -7) </details> ### 📄 Description When a model is partially offloaded to system RAM, we can either do the calculations on the CPU or we can temporarily transfer the data to the GPU to do the calculations there. Small batches tend to be better on the CPU, large batches on the GPU. The llamarunner used the GPU in most cases and the ollamarunner used the CPU. Although the ollamarunner saw an improvement in token generation performance, there was a large performance hit in prompt processing (3-10x). There is an existing heuristic to dynamically switch between these two modes but in practice it doesn't have enough information to accurately make that decision. This adds authoritative data to make the check work to get the best of both worlds. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 07:01:40 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#19236