[PR #11929] gpt-oss: disable quantized kv cache #13662

Closed
opened 2026-04-13 00:32:13 -05:00 by GiteaMirror · 0 comments
Owner

Original Pull Request: https://github.com/ollama/ollama/pull/11929

State: closed
Merged: Yes


quantized kv cache for gpt-oss is much slower than with regular f16 cache type due to the model using attention with sinks. this isn't supported on backends such as cuda which forces it onto the cpu dramatically reducing performance

**Original Pull Request:** https://github.com/ollama/ollama/pull/11929 **State:** closed **Merged:** Yes --- quantized kv cache for gpt-oss is much slower than with regular f16 cache type due to the model using attention with sinks. this isn't supported on backends such as cuda which forces it onto the cpu dramatically reducing performance
GiteaMirror added the pull-request label 2026-04-13 00:32:13 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#13662