[PR #13333] [MERGED] Enable flash attention for vision encoders #19439

Closed
opened 2026-04-16 07:07:28 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13333
Author: @jessegross
Created: 12/4/2025
Status: Merged
Merged: 12/4/2025
Merged by: @jessegross

Base: mainHead: jessegross/flash_vision


📝 Commits (3)

  • d06c607 ggml: Always set cache padding to 256
  • aa118e6 ggml: Enable flash attention for vision encoders
  • 6631245 llm: Enable flash attention for mistral3 by default

📊 Changes

4 files changed (+31 additions, -7 deletions)

View changed files

📝 fs/ggml/ggml.go (+1 -0)
📝 ml/backend.go (+3 -1)
📝 ml/backend/ggml/ggml.go (+24 -2)
📝 ml/nn/attention.go (+3 -4)

📄 Description

Currently the vision encoder components of vision models do not use flash attention, even when the text portions do. This is due to the way that the vision models construct their tensors, which did not meet the requirements of our fast attention model for backends. By softening those requirements (along with recent softer requirements from the underlying GGML kernels), we are able to use flash attention for vision models as well.

This can significantly reduce the size of the compute graph as well as improve processing speed. For example, the compute graph for ministral-3 at default settings goes from 9.1G to 882M with flash attention turned on. Flash attention for vision encoders is controlled by the existing flash attention settings for the overall model.

This also enables flash attention by default for the mistral3 architecture to take advantage of the improvements.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13333 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 12/4/2025 **Status:** ✅ Merged **Merged:** 12/4/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/flash_vision` --- ### 📝 Commits (3) - [`d06c607`](https://github.com/ollama/ollama/commit/d06c6075acf92564f4984160371144c88c023557) ggml: Always set cache padding to 256 - [`aa118e6`](https://github.com/ollama/ollama/commit/aa118e645a67984a65d4d3fda8ce64cb7dd55a61) ggml: Enable flash attention for vision encoders - [`6631245`](https://github.com/ollama/ollama/commit/6631245b0282346aad0159ffed3104cfe83e662d) llm: Enable flash attention for mistral3 by default ### 📊 Changes **4 files changed** (+31 additions, -7 deletions) <details> <summary>View changed files</summary> 📝 `fs/ggml/ggml.go` (+1 -0) 📝 `ml/backend.go` (+3 -1) 📝 `ml/backend/ggml/ggml.go` (+24 -2) 📝 `ml/nn/attention.go` (+3 -4) </details> ### 📄 Description Currently the vision encoder components of vision models do not use flash attention, even when the text portions do. This is due to the way that the vision models construct their tensors, which did not meet the requirements of our fast attention model for backends. By softening those requirements (along with recent softer requirements from the underlying GGML kernels), we are able to use flash attention for vision models as well. This can significantly reduce the size of the compute graph as well as improve processing speed. For example, the compute graph for ministral-3 at default settings goes from 9.1G to 882M with flash attention turned on. Flash attention for vision encoders is controlled by the existing flash attention settings for the overall model. This also enables flash attention by default for the mistral3 architecture to take advantage of the improvements. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 07:07:28 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#19439