[GH-ISSUE #9332] Flash Attention Enabled Incorrectly Due to Fallback in Head Count Metadata #52604

Open
opened 2026-04-28 23:50:06 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @ItzCrazyKns on GitHub (Feb 25, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9332

What is the issue?

When loading a model that lacks metadata for attention.key_length and attention.value_length, the fallback mechanism in the kv.Uint function returns the embedding head count for both fields. This causes the check in SupportsFlashAttention(), which compares the key and value head counts, to show success incorrectly (since both values are identical), even if the model does not truly support Flash Attention. As a result, Flash Attention is enabled, leading to issues such as segmentation faults (e.g., SIGSEGV in llama/ggml-cuda/fattn.cu:67).

Steps to Reproduce:

  1. Load a model that does not include attention.key_length and attention.value_length in its metadata.
  2. Observe that the functions EmbeddingHeadCountK() and EmbeddingHeadCountV() fallback to the embedding head count.
  3. The SupportsFlashAttention() method then compares these equal values and incorrectly concludes that the model supports Flash Attention.
  4. Enabling Flash Attention under these conditions leads to crashes (segmentation faults).

Expected Behavior:
If the model metadata is missing attention.key_length and attention.value_length, the server should disable Flash Attention by default.

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @ItzCrazyKns on GitHub (Feb 25, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9332 ### What is the issue? When loading a model that lacks metadata for `attention.key_length` and `attention.value_length`, the fallback mechanism in the `kv.Uint` function returns the embedding head count for both fields. This causes the check in `SupportsFlashAttention()`, which compares the key and value head counts, to show success incorrectly (since both values are identical), even if the model does not truly support Flash Attention. As a result, Flash Attention is enabled, leading to issues such as segmentation faults (e.g., SIGSEGV in `llama/ggml-cuda/fattn.cu:67`). **Steps to Reproduce:** 1. Load a model that does not include `attention.key_length` and `attention.value_length` in its metadata. 2. Observe that the functions `EmbeddingHeadCountK()` and `EmbeddingHeadCountV()` fallback to the embedding head count. 3. The `SupportsFlashAttention()` method then compares these equal values and incorrectly concludes that the model supports Flash Attention. 4. Enabling Flash Attention under these conditions leads to crashes (segmentation faults). **Expected Behavior:** If the model metadata is missing `attention.key_length` and `attention.value_length`, the server should disable Flash Attention by default. ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-28 23:50:06 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52604