[PR #15893] cuda: prevent Gemma 4 Dense FA hang for head_dim=512 #77638

Open
opened 2026-05-05 10:18:45 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15893
Author: @VrtxOmega
Created: 4/30/2026
Status: 🔄 Open

Base: mainHead: fix/gemma4-fa-hang-dkq512


📝 Commits (1)

  • 9777987 cuda: prevent Gemma 4 Dense FA hang for head_dim=512

📊 Changes

2 files changed (+32 additions, -2 deletions)

View changed files

📝 fs/ggml/ggml.go (+21 -1)
📝 ml/backend/ggml/ggml/src/ggml-cuda/fattn.cu (+11 -1)

📄 Description

Fixes #15350.

Root Cause

Template-instantiation gap in the FA MMA dispatcher: for DKQ=DV=512 only ncols2 in {4,8} are explicitly instantiated, but the dispatcher can select ncols2 in {1,2} when gqa_ratio % 4 != 0, producing implicit template instantiations that hang the GPU.

PR #15296 / revert #15311 disabled FA entirely for Gemma 4. This PR fixes the underlying dispatcher gap.

Changes (32+ / 2-)

  1. fattn.cu — tighten case 512 guard to require gqa_ratio % 4 == 0 (mirrors case 576 precedent)
  2. ggml.go — extend SupportsFlashAttention() to inspect key_length_swa/value_length_swa for hybrid architectures

Full root-cause analysis in the issue comment.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15893 **Author:** [@VrtxOmega](https://github.com/VrtxOmega) **Created:** 4/30/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix/gemma4-fa-hang-dkq512` --- ### 📝 Commits (1) - [`9777987`](https://github.com/ollama/ollama/commit/9777987660291541b03682ae90276e9fdd7e9da2) cuda: prevent Gemma 4 Dense FA hang for head_dim=512 ### 📊 Changes **2 files changed** (+32 additions, -2 deletions) <details> <summary>View changed files</summary> 📝 `fs/ggml/ggml.go` (+21 -1) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/fattn.cu` (+11 -1) </details> ### 📄 Description Fixes #15350. ## Root Cause Template-instantiation gap in the FA MMA dispatcher: for DKQ=DV=512 only ncols2 in {4,8} are explicitly instantiated, but the dispatcher can select ncols2 in {1,2} when gqa_ratio % 4 != 0, producing implicit template instantiations that hang the GPU. PR #15296 / revert #15311 disabled FA entirely for Gemma 4. This PR fixes the underlying dispatcher gap. ## Changes (32+ / 2-) 1. fattn.cu — tighten case 512 guard to require gqa_ratio % 4 == 0 (mirrors case 576 precedent) 2. ggml.go — extend SupportsFlashAttention() to inspect key_length_swa/value_length_swa for hybrid architectures Full root-cause analysis in the issue comment. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 10:18:45 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#77638