[PR #9425] [MERGED] cuda: enable flash attention #38828

Closed
opened 2026-04-22 23:29:21 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/9425
Author: @mxyng
Created: 2/28/2025
Status: Merged
Merged: 2/28/2025
Merged by: @mxyng

Base: mainHead: mxyng/flash-attention


📝 Commits (1)

  • e69faa1 cuda: enable flash attention

📊 Changes

1 file changed (+1 additions, -0 deletions)

View changed files

📝 CMakeLists.txt (+1 -0)

📄 Description

ggml added an option to disable flash attention so explicitly enable it. this caused flash attention kernels to not be build thus forced those operation onto the cpu and caused slowdowns

resolves #9415


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/9425 **Author:** [@mxyng](https://github.com/mxyng) **Created:** 2/28/2025 **Status:** ✅ Merged **Merged:** 2/28/2025 **Merged by:** [@mxyng](https://github.com/mxyng) **Base:** `main` ← **Head:** `mxyng/flash-attention` --- ### 📝 Commits (1) - [`e69faa1`](https://github.com/ollama/ollama/commit/e69faa15b05aabe21249eda79e15e5b2b386500c) cuda: enable flash attention ### 📊 Changes **1 file changed** (+1 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `CMakeLists.txt` (+1 -0) </details> ### 📄 Description ggml added an option to disable flash attention so explicitly enable it. this caused flash attention kernels to not be build thus forced those operation onto the cpu and caused slowdowns resolves #9415 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-22 23:29:21 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#38828