[PR #13448] ollamarunner: Automatically enable flash attention #60910

Open
opened 2026-04-29 16:01:31 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13448
Author: @jessegross
Created: 12/12/2025
Status: 🔄 Open

Base: mainHead: jessegross/flash


📝 Commits (1)

  • 1200e42 ollamarunner: Automatically enable flash attention

📊 Changes

8 files changed (+246 additions, -153 deletions)

View changed files

📝 fs/ggml/ggml.go (+1 -31)
📝 kvcache/causal_test.go (+1 -1)
📝 llm/server.go (+19 -58)
📝 ml/backend.go (+1 -1)
📝 ml/backend/ggml/ggml.go (+144 -3)
📝 runner/llamarunner/runner.go (+2 -2)
📝 runner/ollamarunner/multimodal.go (+4 -1)
📝 runner/ollamarunner/runner.go (+74 -56)

📄 Description

If a user hasn't explicitly either enabled or disabled flash attention, automatically enable flash attention if the model supports it and it would not trigger a fallback to CPU.

This supports text, vision and embedding models as well as automatic handling of KV cache quantization (which requires flash attention). If a model does not call the fast fused attention operation, this is detected and disables any operations that depend on it.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13448 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 12/12/2025 **Status:** 🔄 Open **Base:** `main` ← **Head:** `jessegross/flash` --- ### 📝 Commits (1) - [`1200e42`](https://github.com/ollama/ollama/commit/1200e427f729b0786781321f05594fe2aff26108) ollamarunner: Automatically enable flash attention ### 📊 Changes **8 files changed** (+246 additions, -153 deletions) <details> <summary>View changed files</summary> 📝 `fs/ggml/ggml.go` (+1 -31) 📝 `kvcache/causal_test.go` (+1 -1) 📝 `llm/server.go` (+19 -58) 📝 `ml/backend.go` (+1 -1) 📝 `ml/backend/ggml/ggml.go` (+144 -3) 📝 `runner/llamarunner/runner.go` (+2 -2) 📝 `runner/ollamarunner/multimodal.go` (+4 -1) 📝 `runner/ollamarunner/runner.go` (+74 -56) </details> ### 📄 Description If a user hasn't explicitly either enabled or disabled flash attention, automatically enable flash attention if the model supports it and it would not trigger a fallback to CPU. This supports text, vision and embedding models as well as automatic handling of KV cache quantization (which requires flash attention). If a model does not call the fast fused attention operation, this is detected and disables any operations that depend on it. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 16:01:31 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#60910