[PR #9204] [MERGED] ml: Abstract attention out of model definitions #12889

Closed
opened 2026-04-13 00:12:01 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/9204
Author: @jessegross
Created: 2/18/2025
Status: Merged
Merged: 2/21/2025
Merged by: @jessegross

Base: mainHead: jessegross/attention


📝 Commits (1)

  • d428db8 ml: Abstract attention out of model definitions

📊 Changes

5 files changed (+102 additions, -22 deletions)

View changed files

📝 ml/backend.go (+20 -0)
📝 ml/backend/ggml/ggml.go (+15 -0)
ml/nn/attention.go (+59 -0)
📝 model/models/llama/model.go (+2 -7)
📝 model/models/mllama/model_text.go (+6 -15)

📄 Description

There are two benefits to doing this:

  • Provide a library function that models can use, reducing code for each model implementation
  • Enables a single place to drop in optimized implementations of attention based on the backend or other factors. One is provided for GGML.

On CUDA this improves token generation rate by about 3%. It does not have a significant effect on Metal.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/9204 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 2/18/2025 **Status:** ✅ Merged **Merged:** 2/21/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/attention` --- ### 📝 Commits (1) - [`d428db8`](https://github.com/ollama/ollama/commit/d428db808e1bc05ff816c3248041bd92c762e9cb) ml: Abstract attention out of model definitions ### 📊 Changes **5 files changed** (+102 additions, -22 deletions) <details> <summary>View changed files</summary> 📝 `ml/backend.go` (+20 -0) 📝 `ml/backend/ggml/ggml.go` (+15 -0) ➕ `ml/nn/attention.go` (+59 -0) 📝 `model/models/llama/model.go` (+2 -7) 📝 `model/models/mllama/model_text.go` (+6 -15) </details> ### 📄 Description There are two benefits to doing this: - Provide a library function that models can use, reducing code for each model implementation - Enables a single place to drop in optimized implementations of attention based on the backend or other factors. One is provided for GGML. On CUDA this improves token generation rate by about 3%. It does not have a significant effect on Metal. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-13 00:12:01 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#12889