[PR #9235] [MERGED] New engine performance improvements #12896

Closed
opened 2026-04-13 00:12:15 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/9235
Author: @jessegross
Created: 2/19/2025
Status: Merged
Merged: 2/20/2025
Merged by: @jessegross

Base: mainHead: jessegross/new_engine_perf


📝 Commits (2)

  • ec6afc7 ggml-backend: Don't recreate the scheduler for each context
  • 08fdd02 models: Prune unused outputs earlier in the forward pass

📊 Changes

4 files changed (+64 additions, -33 deletions)

View changed files

📝 ml/backend/ggml/ggml.go (+21 -13)
📝 model/models/llama/model.go (+21 -9)
📝 model/models/mllama/model.go (+2 -4)
📝 model/models/mllama/model_text.go (+20 -7)

📄 Description

This series improves performance by bringing the new engine more inline with how llama.cpp uses GGML. A few notes:

  • The scheduler changes were kept relatively minimal because improvements to split backends (multi-GPU/GPU-CPU hybrid) are expected soon. Therefore this keeps the same structure as before with respect to backends and simply moves where the scheduler is allocated.
  • The Rows change to the models mirrors what llama.cpp does with their model implementations. The original location where Rows was called is simpler and more similar to what Transformers/vLLM do (I believe it is actually called by the general runner code, removing it from the model completely). Perhaps they are able to automatically optimize the graph to achieve the same result without changing the model definition. In any case, the performance gains are too large to pass up.

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/9235 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 2/19/2025 **Status:** ✅ Merged **Merged:** 2/20/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/new_engine_perf` --- ### 📝 Commits (2) - [`ec6afc7`](https://github.com/ollama/ollama/commit/ec6afc75e25b5faa232c6239b181bc5dd605ce56) ggml-backend: Don't recreate the scheduler for each context - [`08fdd02`](https://github.com/ollama/ollama/commit/08fdd02478383b133a326c4ac5f1f28591b46e85) models: Prune unused outputs earlier in the forward pass ### 📊 Changes **4 files changed** (+64 additions, -33 deletions) <details> <summary>View changed files</summary> 📝 `ml/backend/ggml/ggml.go` (+21 -13) 📝 `model/models/llama/model.go` (+21 -9) 📝 `model/models/mllama/model.go` (+2 -4) 📝 `model/models/mllama/model_text.go` (+20 -7) </details> ### 📄 Description This series improves performance by bringing the new engine more inline with how llama.cpp uses GGML. A few notes: - The scheduler changes were kept relatively minimal because improvements to split backends (multi-GPU/GPU-CPU hybrid) are expected soon. Therefore this keeps the same structure as before with respect to backends and simply moves where the scheduler is allocated. - The Rows change to the models mirrors what llama.cpp does with their model implementations. The original location where Rows was called is simpler and more similar to what Transformers/vLLM do (I believe it is actually called by the general runner code, removing it from the model completely). Perhaps they are able to automatically optimize the graph to achieve the same result without changing the model definition. In any case, the performance gains are too large to pass up. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-13 00:12:15 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#12896