[PR #14783] llama: add D=88 flash attention support (for Llama 4 Vision) #46082

Open
opened 2026-04-25 01:37:32 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14783
Author: @caffeinatedbits
Created: 3/11/2026
Status: 🔄 Open

Base: mainHead: fix-llama4-vision


📝 Commits (1)

  • f81694d llama: add D=88 flash attention support for llama-4 vision

📊 Changes

5 files changed (+209 additions, -3 deletions)

View changed files

llama/patches/0035-ggml-cuda-add-flash-attention-support-for-head-size-.patch (+169 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cu (+4 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cuh (+27 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/fattn.cu (+4 -3)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-tile-instance-dkq88-dv88.cu (+5 -0)

📄 Description

Llama 4 Vision models utilize a projector head dimension of D=88.

Currently, the vendored ggml-cuda backend does not support 88 in its flash attention dispatch tables.

This causes a failure during warmup (flash attention not supported by CUDA0), forcing the backend to fall back to unoptimized operations (SOFT_MAX, PERMUTE, CONCAT). This results in extreme VRAM bloat, severe inference slowdowns, and inevitable CPU fallback or OOM crashes depending on context size.

Upstream PR:

This has already been submitted and is actively being discussed upstream in llama.cpp here: https://github.com/ggml-org/llama.cpp/pull/20375

Changes:

  • fattn.cu: Added 88 to the allowed dimensions switch. Explicitly blacklisted 88 from Turing/Volta/WMMA/MMA Tensor Core checks to prevent hardware misalignment and segfaults, routing it safely to the TILE kernel.
  • fattn-tile.cuh & .cu: Defined specific memory alignment configurations for 88.
  • template-instances: Dynamically generated and included fattn-tile-instance-dkq88-dv88.cu.
  • Generated llama/patches/0035-... using make -f Makefile.sync.

Results:

Tested on an NVIDIA DGX Spark (Grace-Blackwell GB10 APU, Compute 12.1, 128GB Unified Memory) running a 109B MoE Llama 4 Vision model (q8_0 KV cache, 256k context window).

  • Warmup completes successfully (warmup: flash attention is enabled).
  • Unoptimized graph fallback is completely eliminated.
  • VRAM usage normalizes, allowing massive models to be 100% offloaded to the GPU with a 256K context window.

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14783 **Author:** [@caffeinatedbits](https://github.com/caffeinatedbits) **Created:** 3/11/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix-llama4-vision` --- ### 📝 Commits (1) - [`f81694d`](https://github.com/ollama/ollama/commit/f81694db7f398460e49ef3e89472e2479d5592c5) llama: add D=88 flash attention support for llama-4 vision ### 📊 Changes **5 files changed** (+209 additions, -3 deletions) <details> <summary>View changed files</summary> ➕ `llama/patches/0035-ggml-cuda-add-flash-attention-support-for-head-size-.patch` (+169 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cu` (+4 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cuh` (+27 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/fattn.cu` (+4 -3) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-tile-instance-dkq88-dv88.cu` (+5 -0) </details> ### 📄 Description Llama 4 Vision models utilize a projector head dimension of `D=88`. Currently, the vendored `ggml-cuda` backend does not support `88` in its flash attention dispatch tables. This causes a failure during warmup (`flash attention not supported by CUDA0`), forcing the backend to fall back to unoptimized operations (`SOFT_MAX`, `PERMUTE`, `CONCAT`). This results in extreme VRAM bloat, severe inference slowdowns, and inevitable CPU fallback or OOM crashes depending on context size. **Upstream PR:** This has already been submitted and is actively being discussed upstream in `llama.cpp` here: https://github.com/ggml-org/llama.cpp/pull/20375 **Changes:** - `fattn.cu`: Added `88` to the allowed dimensions switch. Explicitly blacklisted `88` from Turing/Volta/WMMA/MMA Tensor Core checks to prevent hardware misalignment and segfaults, routing it safely to the TILE kernel. - `fattn-tile.cuh` & `.cu`: Defined specific memory alignment configurations for `88`. - `template-instances`: Dynamically generated and included `fattn-tile-instance-dkq88-dv88.cu`. - Generated `llama/patches/0035-...` using `make -f Makefile.sync`. **Results:** Tested on an NVIDIA DGX Spark (Grace-Blackwell GB10 APU, Compute 12.1, 128GB Unified Memory) running a 109B MoE Llama 4 Vision model (q8_0 KV cache, 256k context window). - Warmup completes successfully (`warmup: flash attention is enabled`). - Unoptimized graph fallback is completely eliminated. - VRAM usage normalizes, allowing massive models to be 100% offloaded to the GPU with a 256K context window. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 01:37:32 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#46082