[GH-ISSUE #13337] [Docs/Code] Clarify supported architectures for Flash Attention and KV Cache Quantization #8809

Closed
opened 2026-04-12 21:35:38 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @chakka-guna-sekhar-venkata-chennaiah on GitHub (Dec 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13337

Hi Team,

I have been analyzing the source code to optimize memory usage for deployments on constrained GPUs (e.g., Nvidia L4), specifically looking into OLLAMA_FLASH_ATTENTION and OLLAMA_KV_CACHE_TYPE.

I noticed that while the FAQ mentions how to enable these features via environment variables, it does not explicitly list which model architectures are actually supported.

Upon reviewing fs/ggml/ggml.go, I found that FlashAttention() relies on a specific allowlist:

// fs/ggml/ggml.go
func (f GGML) FlashAttention() bool {
    return slices.Contains([]string{
        "gemma3",
        "gptoss", "gpt-oss",
        "mistral3",
        "qwen3", "qwen3moe",
        "qwen3vl", "qwen3vlmoe",
    }, f.KV().String("general.architecture"))
}

The Issue:
Developers might attempt to force OLLAMA_KV_CACHE_TYPE=q8_0 on architectures like command-r or llama3 (standard), expecting memory savings. However, because these architectures are not in the allowlist (or SupportsFlashAttention returns false), the server silently falls back to f16, leading to unexpected OOMs or higher VRAM usage than calculated.

It would be very helpful to add a small table or note in docs/faq.md or docs/gpu.md listing the architectures that currently support Flash Attention (and thus KV Quantization).

This would save developers significant time when debugging memory usage and planning capacity for specific models.

I hope its get understood. By the way , im interested to add or mainatin that table in docs/faq.mdx file whenever a new model gets added in FalshAttention function of fs/ggml/ggml.go file.

Originally created by @chakka-guna-sekhar-venkata-chennaiah on GitHub (Dec 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13337 Hi Team, I have been analyzing the source code to optimize memory usage for deployments on constrained GPUs (e.g., Nvidia L4), specifically looking into `OLLAMA_FLASH_ATTENTION` and `OLLAMA_KV_CACHE_TYPE`. I noticed that while the [FAQ](https://github.com/ollama/ollama/blob/main/docs/faq.md) mentions how to enable these features via environment variables, it does not explicitly list which model architectures are actually supported. Upon reviewing `fs/ggml/ggml.go`, I found that `FlashAttention()` relies on a specific allowlist: ```go // fs/ggml/ggml.go func (f GGML) FlashAttention() bool { return slices.Contains([]string{ "gemma3", "gptoss", "gpt-oss", "mistral3", "qwen3", "qwen3moe", "qwen3vl", "qwen3vlmoe", }, f.KV().String("general.architecture")) } ``` **The Issue:** Developers might attempt to force `OLLAMA_KV_CACHE_TYPE=q8_0` on architectures like `command-r` or `llama3` (standard), expecting memory savings. However, because these architectures are not in the allowlist (or `SupportsFlashAttention` returns false), the server silently falls back to `f16`, leading to unexpected OOMs or higher VRAM usage than calculated. It would be very helpful to add a small table or note in `docs/faq.md` or `docs/gpu.md` listing the architectures that currently support Flash Attention (and thus KV Quantization). This would save developers significant time when debugging memory usage and planning capacity for specific models. I hope its get understood. By the way , im interested to add or mainatin that table in `docs/faq.mdx` file whenever a new model gets added in `FalshAttention` function of `fs/ggml/ggml.go` file.
GiteaMirror added the documentation label 2026-04-12 21:35:38 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 5, 2025):

FlashAttention is not an allowlist, it sets the default value of OLLAMA_FLASH_ATTENTION to true for the listed model architectures. That is, in the absence of the environment variable, flash attention is provisionally enabled for these architectures and disabled for the rest. Flash attention can be enabled for other architectures by setting the environment variable.

For example: command-r without FA:

command-r:latest    7d96360d357f    57 GB    100% GPU     131072     Forever    

command-r with FA, CQ q4_0:

command-r:latest    7d96360d357f    41 GB    100% GPU     131072     Forever    

However, enabling flash attention (either by default or by explicitly setting the variable) does not mean that flash attention will be used by the model. Not only does the model have to support FA, but so does the GPU that the model is running on. Additionally, if in a multi-GPU environment, all GPUs must be able to support FA, otherwise no GPUs will use FA.

So a table will only be able to say that a given model may support FA. Since this is the default, the table offers no extra guidance. The source of truth will be in the logs. If the developer enables FA and CQ, the logs will attest to FA being enabled (enabling flash attention) and that a valid value of CQ has been used (otherwise kv cache type not supported by model).

Feel free to raise a PR that adds some explanatory text to the FAQ entry.

<!-- gh-comment-id:3616061242 --> @rick-github commented on GitHub (Dec 5, 2025): `FlashAttention` is not an allowlist, it sets the default value of `OLLAMA_FLASH_ATTENTION` to `true` for the listed model architectures. That is, in the absence of the environment variable, flash attention is provisionally enabled for these architectures and disabled for the rest. Flash attention can be enabled for other architectures by setting the environment variable. For example: command-r without FA: ``` command-r:latest 7d96360d357f 57 GB 100% GPU 131072 Forever ``` command-r with FA, CQ q4_0: ``` command-r:latest 7d96360d357f 41 GB 100% GPU 131072 Forever ``` However, enabling flash attention (either by default or by explicitly setting the variable) does not mean that flash attention will be used by the model. Not only does the model have to support FA, but so does the GPU that the model is running on. Additionally, if in a multi-GPU environment, all GPUs must be able to support FA, otherwise no GPUs will use FA. So a table will only be able to say that a given model may support FA. Since this is the default, the table offers no extra guidance. The source of truth will be in the logs. If the developer enables FA and CQ, the logs will attest to FA being enabled (`enabling flash attention`) and that a valid value of CQ has been used (otherwise `kv cache type not supported by model`). Feel free to raise a PR that adds some explanatory text to the FAQ entry.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8809