[PR #9378] [MERGED] ml: split model across backends #18205

Closed
opened 2026-04-16 06:28:15 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/9378
Author: @mxyng
Created: 2/26/2025
Status: Merged
Merged: 3/7/2025
Merged by: @mxyng

Base: mainHead: mxyng/gpu-split


📝 Commits (10+)

  • 69e5454 ml/backend/ggml: update model loading for hybrid/multi backends
  • 5094161 model: load non-repeated tensors into multiple backends
  • f1ee773 kvcache: create cache ctx per layer
  • 90fe069 ml/backend/ggml: create tensor on specific backend
  • 83ea2f0 kvcache: update tests
  • e3593c9 ml/backend/ggml: set cpu n_threads
  • de03ba8 ml/backend/ggml: handle user specified cpu offloading
  • 948443b ml/backend/ggml: handle tensor split
  • ef791bf ml/backend/ggml: offload vision to cpu
  • 77add8d ml/backend/ggml: clean up

📊 Changes

8 files changed (+513 additions, -229 deletions)

View changed files

📝 kvcache/causal.go (+34 -22)
📝 kvcache/causal_test.go (+9 -1)
📝 kvcache/encoder.go (+22 -14)
📝 ml/backend.go (+11 -1)
📝 ml/backend/ggml/ggml.go (+408 -164)
📝 model/models/llama/model.go (+11 -11)
📝 model/models/mllama/model.go (+6 -6)
📝 model/models/mllama/model_text.go (+12 -10)

📄 Description

Assign model weights to backends using a strategy similar to llama.cpp as well as assigning inputs, outputs, and cache tensors to their respective backends.

In order to accurately assign tensors, two additional changes have been made to cache and models:

  1. Cache now create separate contexts for each layer. This allows the key and value tensors to be created next to their
  2. Move as a model-level tensor to a layer-level tensor. This ensures it is using the correct backend for the layer

TODO:

  • Accept ml.BackendParams to configure GPU split
  • Check buffer type supports tensor type before assigning
  • Fix multimodal models

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/9378 **Author:** [@mxyng](https://github.com/mxyng) **Created:** 2/26/2025 **Status:** ✅ Merged **Merged:** 3/7/2025 **Merged by:** [@mxyng](https://github.com/mxyng) **Base:** `main` ← **Head:** `mxyng/gpu-split` --- ### 📝 Commits (10+) - [`69e5454`](https://github.com/ollama/ollama/commit/69e5454c9bb0a7cacbcbf54e042398487002fa39) ml/backend/ggml: update model loading for hybrid/multi backends - [`5094161`](https://github.com/ollama/ollama/commit/509416139d4c3cbde5d2763733a1efd77b8a5d0d) model: load non-repeated tensors into multiple backends - [`f1ee773`](https://github.com/ollama/ollama/commit/f1ee773fdacbff3c8cf0162dbd7f4efbef96ee42) kvcache: create cache ctx per layer - [`90fe069`](https://github.com/ollama/ollama/commit/90fe0696b69657ec671791ed64e5e22a35e39794) ml/backend/ggml: create tensor on specific backend - [`83ea2f0`](https://github.com/ollama/ollama/commit/83ea2f0fc29d049a70d34237d3d72c378cb248af) kvcache: update tests - [`e3593c9`](https://github.com/ollama/ollama/commit/e3593c94daa73466fa05b23861a1390d2ea7b00c) ml/backend/ggml: set cpu n_threads - [`de03ba8`](https://github.com/ollama/ollama/commit/de03ba81bb32d2d88594c44fe8d7d65967e86769) ml/backend/ggml: handle user specified cpu offloading - [`948443b`](https://github.com/ollama/ollama/commit/948443b4ef52bd0e6f3dda910c23525e66998827) ml/backend/ggml: handle tensor split - [`ef791bf`](https://github.com/ollama/ollama/commit/ef791bfa458ddac2d0f1a7cf3036db034a85d33d) ml/backend/ggml: offload vision to cpu - [`77add8d`](https://github.com/ollama/ollama/commit/77add8d604a32d58b27144a556078d805d8dbc8d) ml/backend/ggml: clean up ### 📊 Changes **8 files changed** (+513 additions, -229 deletions) <details> <summary>View changed files</summary> 📝 `kvcache/causal.go` (+34 -22) 📝 `kvcache/causal_test.go` (+9 -1) 📝 `kvcache/encoder.go` (+22 -14) 📝 `ml/backend.go` (+11 -1) 📝 `ml/backend/ggml/ggml.go` (+408 -164) 📝 `model/models/llama/model.go` (+11 -11) 📝 `model/models/mllama/model.go` (+6 -6) 📝 `model/models/mllama/model_text.go` (+12 -10) </details> ### 📄 Description Assign model weights to backends using a strategy similar to `llama.cpp` as well as assigning inputs, outputs, and cache tensors to their respective backends. In order to accurately assign tensors, two additional changes have been made to cache and models: 1. Cache now create separate contexts for each layer. This allows the key and value tensors to be created next to their 2. Move as a model-level tensor to a layer-level tensor. This ensures it is using the correct backend for the layer TODO: - [x] Accept `ml.BackendParams` to configure GPU split - [x] Check buffer type supports tensor type before assigning - [x] Fix multimodal models --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 06:28:15 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#18205