[PR #11967] [MERGED] kvcache: Use Cast instead of Copy for flash attention masks #39543

Closed
opened 2026-04-23 00:26:00 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/11967
Author: @jessegross
Created: 8/19/2025
Status: Merged
Merged: 8/19/2025
Merged by: @jessegross

Base: mainHead: jessegross/cast


📝 Commits (1)

  • aaf3965 kvcache: Use Cast instead of Copy for flash attention masks

📊 Changes

3 files changed (+29 additions, -20 deletions)

View changed files

📝 kvcache/causal.go (+1 -3)
📝 ml/backend.go (+1 -0)
📝 ml/backend/ggml/ggml.go (+27 -17)

📄 Description

Flash attention kernels require the mask of the KV cache be a F16 rather than an F32. We can use the GGML operation ggml_cast to do this rather than doing it ourselves, which allows reuse of a preallocated buffer in the graph rather than allocating a new one for each batch. This improves token generation performance with flash attention by 10-30% (with gpt-oss). This also makes performance with flash attention better than without it, as expected.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/11967 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 8/19/2025 **Status:** ✅ Merged **Merged:** 8/19/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/cast` --- ### 📝 Commits (1) - [`aaf3965`](https://github.com/ollama/ollama/commit/aaf39653ff16c85ff6d9cf1d2e04014acb78bffa) kvcache: Use Cast instead of Copy for flash attention masks ### 📊 Changes **3 files changed** (+29 additions, -20 deletions) <details> <summary>View changed files</summary> 📝 `kvcache/causal.go` (+1 -3) 📝 `ml/backend.go` (+1 -0) 📝 `ml/backend/ggml/ggml.go` (+27 -17) </details> ### 📄 Description Flash attention kernels require the mask of the KV cache be a F16 rather than an F32. We can use the GGML operation ggml_cast to do this rather than doing it ourselves, which allows reuse of a preallocated buffer in the graph rather than allocating a new one for each batch. This improves token generation performance with flash attention by 10-30% (with gpt-oss). This also makes performance with flash attention better than without it, as expected. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-23 00:26:01 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#39543