[PR #13872] [MERGED] llama: fix fattn-tile shared memory overflow on sm_50/52 #14429

Closed
opened 2026-04-13 00:53:48 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13872
Author: @jmorganca
Created: 1/24/2026
Status: Merged
Merged: 1/24/2026
Merged by: @jmorganca

Base: mainHead: fix-cuda12-fattn-shmem


📝 Commits (1)

  • e3fdc39 ggml-cuda: fix fattn-tile shared memory overflow on sm_50/52

📊 Changes

2 files changed (+18 additions, -19 deletions)

View changed files

📝 llama/patches/0032-ggml-enable-MLA-flash-attention-for-GLM-4.7-flash.patch (+11 -12)
📝 ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cuh (+7 -7)

📄 Description

Use nthreads=128 for ncols=4 configurations in flash attention tile kernel to reduce shared memory usage below 48KB limit on Maxwell architectures (sm_50/52).

With nthreads=256 and ncols=4, np=2 which caused shared memory to exceed 48KB. With nthreads=128 and ncols=4, np=1 keeps shared memory under the limit.

This should fix the release CI.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13872 **Author:** [@jmorganca](https://github.com/jmorganca) **Created:** 1/24/2026 **Status:** ✅ Merged **Merged:** 1/24/2026 **Merged by:** [@jmorganca](https://github.com/jmorganca) **Base:** `main` ← **Head:** `fix-cuda12-fattn-shmem` --- ### 📝 Commits (1) - [`e3fdc39`](https://github.com/ollama/ollama/commit/e3fdc3928dbabf9cd03e2634479006a4342ea9e7) ggml-cuda: fix fattn-tile shared memory overflow on sm_50/52 ### 📊 Changes **2 files changed** (+18 additions, -19 deletions) <details> <summary>View changed files</summary> 📝 `llama/patches/0032-ggml-enable-MLA-flash-attention-for-GLM-4.7-flash.patch` (+11 -12) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cuh` (+7 -7) </details> ### 📄 Description Use nthreads=128 for ncols=4 configurations in flash attention tile kernel to reduce shared memory usage below 48KB limit on Maxwell architectures (sm_50/52). With nthreads=256 and ncols=4, np=2 which caused shared memory to exceed 48KB. With nthreads=128 and ncols=4, np=1 keeps shared memory under the limit. This should fix the release CI. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-13 00:53:48 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#14429