[PR #9383] [MERGED] attention: Remove unnecessary contiguous operations & Flash attention #12938

Closed
opened 2026-04-13 00:13:07 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/9383
Author: @jessegross
Created: 2/27/2025
Status: Merged
Merged: 3/2/2025
Merged by: @jessegross

Base: mainHead: jessegross/contiguous


📝 Commits (4)

  • ca006b0 attention: Remove unnecessary contiguous operations
  • 264606e ggml-backend: Store parent backend as part of tensor
  • f809110 ml: Empty tensor constructor for tensors
  • e012faa ml: Enable support for flash attention

📊 Changes

12 files changed (+396 additions, -117 deletions)

View changed files

📝 kvcache/cache.go (+11 -0)
📝 kvcache/causal.go (+163 -37)
📝 kvcache/causal_test.go (+8 -4)
📝 kvcache/encoder.go (+31 -2)
📝 kvcache/wrapper.go (+6 -0)
📝 ml/backend.go (+38 -1)
📝 ml/backend/ggml/ggml.go (+82 -22)
📝 ml/nn/attention.go (+31 -20)
📝 model/models/llama/model.go (+1 -8)
📝 model/models/mllama/model.go (+3 -1)
📝 model/models/mllama/model_text.go (+16 -16)
📝 runner/ollamarunner/runner.go (+6 -6)

📄 Description

This has two significant performance improvements related to attention on the new engine:

  • Removing extra contiguous calls after permutations required before attention. This improves token generation performance by nearly 25%.
  • Implement support for flash attention, which improves performance by roughly another 10%.

Both of these have special requirements for how the cache outputs its tensors, so this builds out infrastructure to support that.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/9383 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 2/27/2025 **Status:** ✅ Merged **Merged:** 3/2/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/contiguous` --- ### 📝 Commits (4) - [`ca006b0`](https://github.com/ollama/ollama/commit/ca006b0290ad94dc1921fdc976f9430df4745298) attention: Remove unnecessary contiguous operations - [`264606e`](https://github.com/ollama/ollama/commit/264606e9db2324fe5f69443925435a05f205cc8c) ggml-backend: Store parent backend as part of tensor - [`f809110`](https://github.com/ollama/ollama/commit/f809110deb5c4ffc80e6b48fb14d4a4967dfe552) ml: Empty tensor constructor for tensors - [`e012faa`](https://github.com/ollama/ollama/commit/e012faab35dbd60994631f5a0d4f555899a6f52b) ml: Enable support for flash attention ### 📊 Changes **12 files changed** (+396 additions, -117 deletions) <details> <summary>View changed files</summary> 📝 `kvcache/cache.go` (+11 -0) 📝 `kvcache/causal.go` (+163 -37) 📝 `kvcache/causal_test.go` (+8 -4) 📝 `kvcache/encoder.go` (+31 -2) 📝 `kvcache/wrapper.go` (+6 -0) 📝 `ml/backend.go` (+38 -1) 📝 `ml/backend/ggml/ggml.go` (+82 -22) 📝 `ml/nn/attention.go` (+31 -20) 📝 `model/models/llama/model.go` (+1 -8) 📝 `model/models/mllama/model.go` (+3 -1) 📝 `model/models/mllama/model_text.go` (+16 -16) 📝 `runner/ollamarunner/runner.go` (+6 -6) </details> ### 📄 Description This has two significant performance improvements related to attention on the new engine: - Removing extra contiguous calls after permutations required before attention. This improves token generation performance by nearly 25%. - Implement support for flash attention, which improves performance by roughly another 10%. Both of these have special requirements for how the cache outputs its tensors, so this builds out infrastructure to support that. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-13 00:13:07 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#12938