[PR #15378] [MERGED] gemma4: enable flash attention #61829

Closed
opened 2026-04-29 16:50:11 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15378
Author: @dhiltgen
Created: 4/7/2026
Status: Merged
Merged: 4/7/2026
Merged by: @dhiltgen

Base: mainHead: gemma4-fa


📝 Commits (1)

  • 0ba0b14 gemma4: enable flash attention

📊 Changes

20 files changed (+559 additions, -36 deletions)

View changed files

📝 fs/ggml/ggml.go (+1 -0)
📝 llama/patches/0020-ggml-No-alloc-mode.patch (+23 -22)
📝 llama/patches/0022-ggml-Enable-resetting-backend-devices.patch (+2 -2)
📝 llama/patches/0024-GPU-discovery-enhancements.patch (+2 -2)
llama/patches/0036-backport-kernels-for-gemma4.patch (+416 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/fattn-mma-f16.cuh (+25 -1)
📝 ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cu (+4 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cuh (+29 -8)
📝 ml/backend/ggml/ggml/src/ggml-cuda/fattn.cu (+10 -1)
📝 ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_1-ncols2_8.cu (+1 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_16-ncols2_4.cu (+1 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_4.cu (+1 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_8.cu (+1 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_4.cu (+1 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_8.cu (+1 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_4.cu (+1 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_8.cu (+1 -0)
📝 ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-device.m (+1 -0)
📝 ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal (+19 -0)
📝 ml/backend/ggml/ggml/src/ggml-metal/ggml-metal.metal (+19 -0)

📄 Description

Backport GGML kernels so we can enable flash attention for the gemma 4 model on Metal and CUDA.

No significant performance change, but this does reduce VRAM usage thus allowing larger context sizes.

Fixes #15368
Fixes #15350
Fixes #15237


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15378 **Author:** [@dhiltgen](https://github.com/dhiltgen) **Created:** 4/7/2026 **Status:** ✅ Merged **Merged:** 4/7/2026 **Merged by:** [@dhiltgen](https://github.com/dhiltgen) **Base:** `main` ← **Head:** `gemma4-fa` --- ### 📝 Commits (1) - [`0ba0b14`](https://github.com/ollama/ollama/commit/0ba0b146eb26cbc284257e01b1284d6a8fdfbb76) gemma4: enable flash attention ### 📊 Changes **20 files changed** (+559 additions, -36 deletions) <details> <summary>View changed files</summary> 📝 `fs/ggml/ggml.go` (+1 -0) 📝 `llama/patches/0020-ggml-No-alloc-mode.patch` (+23 -22) 📝 `llama/patches/0022-ggml-Enable-resetting-backend-devices.patch` (+2 -2) 📝 `llama/patches/0024-GPU-discovery-enhancements.patch` (+2 -2) ➕ `llama/patches/0036-backport-kernels-for-gemma4.patch` (+416 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/fattn-mma-f16.cuh` (+25 -1) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cu` (+4 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cuh` (+29 -8) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/fattn.cu` (+10 -1) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_1-ncols2_8.cu` (+1 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_16-ncols2_4.cu` (+1 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_4.cu` (+1 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_8.cu` (+1 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_4.cu` (+1 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_8.cu` (+1 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_4.cu` (+1 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_8.cu` (+1 -0) 📝 `ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-device.m` (+1 -0) 📝 `ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal` (+19 -0) 📝 `ml/backend/ggml/ggml/src/ggml-metal/ggml-metal.metal` (+19 -0) </details> ### 📄 Description Backport GGML kernels so we can enable flash attention for the gemma 4 model on Metal and CUDA. No significant performance change, but this does reduce VRAM usage thus allowing larger context sizes. Fixes #15368 Fixes #15350 Fixes #15237 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 16:50:11 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#61829