[PR #14862] Vulkan flash-attention backport for mask-opt / split-k on WSL2 Intel Arc #14879

Open
opened 2026-04-13 01:04:40 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14862
Author: @oldeucryptoboi
Created: 3/15/2026
Status: 🔄 Open

Base: mainHead: laurent/vulkan-flash-attn-mask-opt-wsl2


📝 Commits (1)

  • 1d381e8 vulkan: backport flash-attn mask-opt fixes

📊 Changes

9 files changed (+430 additions, -178 deletions)

View changed files

📝 ml/backend/ggml/ggml/src/ggml-vulkan/CMakeLists.txt (+8 -1)
📝 ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp (+177 -86)
📝 ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn.comp (+28 -26)
📝 ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_base.glsl (+21 -6)
📝 ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm1.comp (+10 -9)
📝 ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm2.comp (+44 -42)
ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_mask_opt.comp (+132 -0)
📝 ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_split_k_reduce.comp (+9 -8)
📝 ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp (+1 -0)

📄 Description

Summary

This backports the Vulkan flash-attention changes needed for the
mask-optimization path and related split-k indexing fixes.

On my WSL2 Ubuntu system with an Intel Arc 140T GPU, the latest local
ollama-main build was failing in the Vulkan flash-attention path after the
incremental llama.cpp backports. With this patch applied, the model loads on
Vulkan, flash attention stays enabled, and generation completes successfully.

The relevant code is in:

  • ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp
  • ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn.comp
  • ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_base.glsl
  • ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm1.comp
  • ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm2.comp
  • ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_mask_opt.comp
  • ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_split_k_reduce.comp
  • ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp
  • ml/backend/ggml/ggml/src/ggml-vulkan/CMakeLists.txt

Repro Environment

  • Host: WSL2 Ubuntu
  • GPU: Intel Arc 140T
  • Vulkan device as reported by logs:
    Microsoft Direct3D12 (Intel(R) Arc(TM) 140T GPU (48GB))
  • Backend: Vulkan via Dozen
  • Model used for the main repro:
    qwen35-9b-q8-local:latest
  • Context size for the main comparison:
    ctx=4096
  • Flash attention: enabled

Before

The Vulkan flash-attention path was not healthy after the incremental
llama.cpp backports. The failure surfaced when flash attention was active,
and the current tree was missing the full set of Vulkan-side changes needed for
the mask optimization and related indexing/layout updates.

After

With this patch:

  • the Vulkan backend loads successfully
  • flash attention remains enabled
  • the model is placed on GPU rather than falling back to CPU
  • generation completes successfully on the 9B model

In the successful 9B run on this machine:

  • 33/33 layers were offloaded to GPU
  • the request completed through ollama serve

Testing

Passed locally:

go test ./...
GOEXPERIMENT=synctest go test ./...
cmake --build /home/laurent/src/ollama-main/build --parallel 12
go build -o build/ollama .

Manual WSL2 Vulkan validation:

OLLAMA_HOST=127.0.0.1:11441 \
OLLAMA_LIBRARY_PATH=/home/laurent/src/ollama-main/build/lib/ollama \
OLLAMA_FLASH_ATTENTION=1 \
OLLAMA_DEBUG=1 \
./build/ollama serve
curl http://127.0.0.1:11441/api/generate -d '{
  "model": "qwen35-9b-q8-local:latest",
  "prompt": "hi",
  "stream": false,
  "options": { "num_predict": 4 }
}'

Observed locally:

  • Vulkan backend loaded on WSL2 / Dozen
  • flash attention enabled
  • 33/33 layers offloaded to GPU
  • generation completed successfully on the 9B model

I also compared this against latest local upstream llama.cpp with matched
settings: same GGUF blob, ctx=4096, batch=512, ubatch=512, flash
attention enabled, full Vulkan offload.

The main difference in that comparison was load/staging: latest llama.cpp
spent about 36s more before generation began. Once the model was loaded,
llama.cpp was faster on token work in that run. Prompt accounting was not
perfectly symmetric because llama.cpp reported 0.00 ms prompt-eval for the
1-token prompt.

Both runs used the GPU and completed successfully. This did not reproduce the
hang-before-first-token behavior.

Notes

  • I also tried a narrow opt-in integration-test subset with a substituted local
    model tag, but that is not part of the repo's documented default local suite
    and it was not green with that custom setup, so I am not counting it as a
    passing validation signal for this patch.
  • Local benchmark notes and logs were kept out of this code diff.

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14862 **Author:** [@oldeucryptoboi](https://github.com/oldeucryptoboi) **Created:** 3/15/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `laurent/vulkan-flash-attn-mask-opt-wsl2` --- ### 📝 Commits (1) - [`1d381e8`](https://github.com/ollama/ollama/commit/1d381e88898aab81651932061f5d6bff3c84c490) vulkan: backport flash-attn mask-opt fixes ### 📊 Changes **9 files changed** (+430 additions, -178 deletions) <details> <summary>View changed files</summary> 📝 `ml/backend/ggml/ggml/src/ggml-vulkan/CMakeLists.txt` (+8 -1) 📝 `ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp` (+177 -86) 📝 `ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn.comp` (+28 -26) 📝 `ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_base.glsl` (+21 -6) 📝 `ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm1.comp` (+10 -9) 📝 `ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm2.comp` (+44 -42) ➕ `ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_mask_opt.comp` (+132 -0) 📝 `ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_split_k_reduce.comp` (+9 -8) 📝 `ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp` (+1 -0) </details> ### 📄 Description ## Summary This backports the Vulkan flash-attention changes needed for the mask-optimization path and related split-k indexing fixes. On my WSL2 Ubuntu system with an Intel Arc 140T GPU, the latest local `ollama-main` build was failing in the Vulkan flash-attention path after the incremental `llama.cpp` backports. With this patch applied, the model loads on Vulkan, flash attention stays enabled, and generation completes successfully. The relevant code is in: - `ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp` - `ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn.comp` - `ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_base.glsl` - `ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm1.comp` - `ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm2.comp` - `ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_mask_opt.comp` - `ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_split_k_reduce.comp` - `ml/backend/ggml/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp` - `ml/backend/ggml/ggml/src/ggml-vulkan/CMakeLists.txt` ## Repro Environment - Host: WSL2 Ubuntu - GPU: Intel Arc 140T - Vulkan device as reported by logs: `Microsoft Direct3D12 (Intel(R) Arc(TM) 140T GPU (48GB))` - Backend: Vulkan via Dozen - Model used for the main repro: `qwen35-9b-q8-local:latest` - Context size for the main comparison: `ctx=4096` - Flash attention: enabled ## Before The Vulkan flash-attention path was not healthy after the incremental `llama.cpp` backports. The failure surfaced when flash attention was active, and the current tree was missing the full set of Vulkan-side changes needed for the mask optimization and related indexing/layout updates. ## After With this patch: - the Vulkan backend loads successfully - flash attention remains enabled - the model is placed on GPU rather than falling back to CPU - generation completes successfully on the 9B model In the successful 9B run on this machine: - `33/33` layers were offloaded to GPU - the request completed through `ollama serve` ## Testing Passed locally: ```bash go test ./... GOEXPERIMENT=synctest go test ./... cmake --build /home/laurent/src/ollama-main/build --parallel 12 go build -o build/ollama . ``` Manual WSL2 Vulkan validation: ```bash OLLAMA_HOST=127.0.0.1:11441 \ OLLAMA_LIBRARY_PATH=/home/laurent/src/ollama-main/build/lib/ollama \ OLLAMA_FLASH_ATTENTION=1 \ OLLAMA_DEBUG=1 \ ./build/ollama serve ``` ```bash curl http://127.0.0.1:11441/api/generate -d '{ "model": "qwen35-9b-q8-local:latest", "prompt": "hi", "stream": false, "options": { "num_predict": 4 } }' ``` Observed locally: - Vulkan backend loaded on WSL2 / Dozen - flash attention enabled - `33/33` layers offloaded to GPU - generation completed successfully on the 9B model I also compared this against latest local upstream `llama.cpp` with matched settings: same GGUF blob, `ctx=4096`, `batch=512`, `ubatch=512`, flash attention enabled, full Vulkan offload. The main difference in that comparison was load/staging: latest `llama.cpp` spent about `36s` more before generation began. Once the model was loaded, `llama.cpp` was faster on token work in that run. Prompt accounting was not perfectly symmetric because `llama.cpp` reported `0.00 ms` prompt-eval for the 1-token prompt. Both runs used the GPU and completed successfully. This did not reproduce the hang-before-first-token behavior. ## Notes - I also tried a narrow opt-in integration-test subset with a substituted local model tag, but that is not part of the repo's documented default local suite and it was not green with that custom setup, so I am not counting it as a passing validation signal for this patch. - Local benchmark notes and logs were kept out of this code diff. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-13 01:04:40 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#14879