[GH-ISSUE #15043] When flash attention is not supported, quantized KV cache should be disregarded instead of aborting the model run. #71718

Closed
opened 2026-05-05 02:24:07 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @gordan-bobic on GitHub (Mar 24, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15043

What is the issue?

With OLLAMA_FLASH_ATTENTION=1 when an incompatible model is used, flash_attn is automatically disabled. This is reasonable behaviour. When OLLAMA_KV_CACHE_TYPE=q8_0 is also set, but flash_attn was auto-disabled due to incompatibility, ollama panics because v cache quantization requires flash_attn.

What should happen: When flash_attn isn't enabled for whatever reason, v cache quantization for the model should get automatically disabled and ignored so that it doesn't have to be disabled globally.

Relevant log output

Mar 24 21:17:29 jarvis ollama[2742822]: llama_init_from_model: flash_attn is not compatible with Grok - forcing off
Mar 24 21:17:29 jarvis ollama[2742822]: llama_init_from_model: V cache quantization requires flash_attn
Mar 24 21:17:29 jarvis ollama[2742822]: panic: unable to create llama context
Mar 24 21:17:29 jarvis ollama[2742822]: goroutine 221 [running]:
Mar 24 21:17:29 jarvis ollama[2742822]: github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc0005943c0, {{0xc000596540, 0x2, 0x2}, 0xe, 0x0, 0x1, {0xc000596528, 0x2, 0x2}, ...}, ...)
Mar 24 21:17:29 jarvis ollama[2742822]: #011/builddir/build/BUILD/ollama-0.17.7/runner/llamarunner/runner.go:849 +0x333
Mar 24 21:17:29 jarvis ollama[2742822]: created by github.com/ollama/ollama/runner/llamarunner.(*Server).load in goroutine 227

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.17.7

Originally created by @gordan-bobic on GitHub (Mar 24, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15043 ### What is the issue? With `OLLAMA_FLASH_ATTENTION=1` when an incompatible model is used, flash_attn is automatically disabled. This is reasonable behaviour. When `OLLAMA_KV_CACHE_TYPE=q8_0` is also set, but flash_attn was auto-disabled due to incompatibility, ollama panics because v cache quantization requires flash_attn. What should happen: When flash_attn isn't enabled for whatever reason, v cache quantization for the model should get automatically disabled and ignored so that it doesn't have to be disabled globally. ### Relevant log output ```shell Mar 24 21:17:29 jarvis ollama[2742822]: llama_init_from_model: flash_attn is not compatible with Grok - forcing off Mar 24 21:17:29 jarvis ollama[2742822]: llama_init_from_model: V cache quantization requires flash_attn Mar 24 21:17:29 jarvis ollama[2742822]: panic: unable to create llama context Mar 24 21:17:29 jarvis ollama[2742822]: goroutine 221 [running]: Mar 24 21:17:29 jarvis ollama[2742822]: github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc0005943c0, {{0xc000596540, 0x2, 0x2}, 0xe, 0x0, 0x1, {0xc000596528, 0x2, 0x2}, ...}, ...) Mar 24 21:17:29 jarvis ollama[2742822]: #011/builddir/build/BUILD/ollama-0.17.7/runner/llamarunner/runner.go:849 +0x333 Mar 24 21:17:29 jarvis ollama[2742822]: created by github.com/ollama/ollama/runner/llamarunner.(*Server).load in goroutine 227 ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.17.7
GiteaMirror added the bug label 2026-05-05 02:24:07 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 24, 2026):

What model are you loading?

<!-- gh-comment-id:4121105598 --> @rick-github commented on GitHub (Mar 24, 2026): What model are you loading?
Author
Owner

@gordan-bobic commented on GitHub (Mar 24, 2026):

Grok 2.5 (Unsloth's edition).
For what it's worth, this bug seems to be inherited from llama.cpp upstream. It is smart enough to ignore flash_attn when it isn't supported, but not smart enough to ignore kv quantization when flash_attn isn't availab.e.

<!-- gh-comment-id:4121345565 --> @gordan-bobic commented on GitHub (Mar 24, 2026): Grok 2.5 (Unsloth's edition). For what it's worth, this bug seems to be inherited from llama.cpp upstream. It is smart enough to ignore flash_attn when it isn't supported, but not smart enough to ignore kv quantization when flash_attn isn't availab.e.
Author
Owner

@rick-github commented on GitHub (Mar 24, 2026):

this bug seems to be inherited from llama.cpp upstream

Yes, but ollama should be disabling FA and KVCQ for this model like ti does for other models that don't support FA. If you can provide a link to the model and the quant it can be examined for appropriate parameters.

<!-- gh-comment-id:4121456784 --> @rick-github commented on GitHub (Mar 24, 2026): > this bug seems to be inherited from llama.cpp upstream Yes, but ollama should be disabling FA and KVCQ for this model like ti does for other models that don't support FA. If you can provide a link to the model and the quant it can be examined for appropriate parameters.
Author
Owner

@gordan-bobic commented on GitHub (Mar 24, 2026):

I am using the Q3-K-XL variant mentioned here, but I tried other quantisations as well. The only change I made to the model is that I merged the shards into a single gguf for import into ollama.
https://unsloth.ai/docs/models/tutorials/grok-2#run-in-llama.cpp

<!-- gh-comment-id:4121931600 --> @gordan-bobic commented on GitHub (Mar 24, 2026): I am using the Q3-K-XL variant mentioned here, but I tried other quantisations as well. The only change I made to the model is that I merged the shards into a single gguf for import into ollama. https://unsloth.ai/docs/models/tutorials/grok-2#run-in-llama.cpp
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71718