[GH-ISSUE #8335] Make flash attention configurable via UI or enable by default #5342

Closed
opened 2026-04-12 16:32:32 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @HDembinski on GitHub (Jan 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8335

Hi, I love Ollama, excellent work. It makes using LLMs really beginner friendly, but does impose any limits on power usage.

I recently learned about flash attention and found out from reading the FAQ that Ollama supports this. As flash attention is important to support large contexts and can speed up models considerably, it would be great if the option to enable flash attention would be more easily accessible.

I am on Windows, and the Ollama Server has a small icon in the notification area. It would be great if you could add a checkbox to enable flash attention and set the KV cache quantization there.

Originally created by @HDembinski on GitHub (Jan 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8335 Hi, I love Ollama, excellent work. It makes using LLMs really beginner friendly, but does impose any limits on power usage. I recently learned about flash attention and found out from reading the FAQ that Ollama supports this. As flash attention is important to support large contexts and can speed up models considerably, it would be great if the option to enable flash attention would be more easily accessible. I am on Windows, and the Ollama Server has a small icon in the notification area. It would be great if you could add a checkbox to enable flash attention and set the KV cache quantization there.
GiteaMirror added the feature request label 2026-04-12 16:32:32 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 9, 2026):

This is the default for most of the newer models.

<!-- gh-comment-id:3872816223 --> @rick-github commented on GitHub (Feb 9, 2026): This is the [default](https://github.com/ollama/ollama/pulls?q=is%3Apr+flash+attention+default+is%3Aclosed) for most of the newer models.
Author
Owner

@zoltanf commented on GitHub (Apr 11, 2026):

Not for gemma4 unfortunately, and when it's enabled trough env. variable it gives a 50% tokens/sec boost on my M1Pro mac.
Also super hard to enable when ollama is running trough brew service.

<!-- gh-comment-id:4229775693 --> @zoltanf commented on GitHub (Apr 11, 2026): Not for gemma4 unfortunately, and when it's enabled trough env. variable it gives a 50% tokens/sec boost on my M1Pro mac. Also super hard to enable when ollama is running trough brew service.
Author
Owner

@rick-github commented on GitHub (Apr 11, 2026):

https://github.com/ollama/ollama/pull/15378

<!-- gh-comment-id:4229776888 --> @rick-github commented on GitHub (Apr 11, 2026): https://github.com/ollama/ollama/pull/15378
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5342