[PR #11685] ggml: Prevent kv cache quanitization on gpt-oss #13604

Closed
opened 2026-04-13 00:31:06 -05:00 by GiteaMirror · 0 comments
Owner

Original Pull Request: https://github.com/ollama/ollama/pull/11685

State: closed
Merged: Yes


KV cache quantization has a dependency on the flash attention kernel. We currently cannot use flash attention with gpt-oss as it requires additional operations.

The model definition does not call flash attention, so it works regardless of the setting but the cache will pick up the quantization type. This updates the flash attention setting earlier in the loading flow so that all downstream settings are also set correctly.

Fixes: #11671

**Original Pull Request:** https://github.com/ollama/ollama/pull/11685 **State:** closed **Merged:** Yes --- KV cache quantization has a dependency on the flash attention kernel. We currently cannot use flash attention with gpt-oss as it requires additional operations. The model definition does not call flash attention, so it works regardless of the setting but the cache will pick up the quantization type. This updates the flash attention setting earlier in the loading flow so that all downstream settings are also set correctly. Fixes: #11671
GiteaMirror added the pull-request label 2026-04-13 00:31:06 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#13604