[GH-ISSUE #10794] Support differentiated quantization of the KV cache #53599

Open
opened 2026-04-29 04:06:14 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @stasm on GitHub (May 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10794

According to https://arxiv.org/abs/2502.15075, the keys can be more sensitive to aggressive quantization than values. However, currently Ollama only offers a way to set the same quantization method uniformly for keys and values, via OLLAMA_KV_CACHE_TYPE. At the same time, llama.cpp supports setting separate quantization methods for keys and values of the KV cache via the --cache-type-k and --cache-type-v flags.

Rather than adding new envvars to control keys and values seprately, I suggest to extend the existing OLLAMA_KV_CACHE_TYPE variable like so:

  • Canonically, the values will be of the form type_k:type_v, e.g. f16:f16 or q8_0:q4_0, allowing setting each type separately.
  • The current single-value format will be kept as a shorthand for setting both to the same type, for backwards compatibility with the current behavior. I.e. q8_0 will be normalized to q8_0:q8_0.
Originally created by @stasm on GitHub (May 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10794 According to https://arxiv.org/abs/2502.15075, the keys can be more sensitive to aggressive quantization than values. However, currently Ollama only offers a way to set the same quantization method uniformly for keys *and* values, via `OLLAMA_KV_CACHE_TYPE`. At the same time, llama.cpp supports setting separate quantization methods for keys and values of the KV cache via the `--cache-type-k` and `--cache-type-v` flags. Rather than adding new envvars to control keys and values seprately, I suggest to extend the existing `OLLAMA_KV_CACHE_TYPE` variable like so: * Canonically, the values will be of the form `type_k:type_v`, e.g. `f16:f16` or `q8_0:q4_0`, allowing setting each type separately. * The current single-value format will be kept as a shorthand for setting both to the same type, for backwards compatibility with the current behavior. I.e. `q8_0` will be normalized to `q8_0:q8_0`.
GiteaMirror added the feature request label 2026-04-29 04:06:14 -05:00
Author
Owner

@stasm commented on GitHub (May 21, 2025):

For my local setup, I put together a WIP at https://github.com/ollama/ollama/compare/main...stasm:ollama:kv-cache-differentiated-quantization. If the feature request is accepted, I'll be happy to polish it and submit a PR.

<!-- gh-comment-id:2897120289 --> @stasm commented on GitHub (May 21, 2025): For my local setup, I put together a WIP at https://github.com/ollama/ollama/compare/main...stasm:ollama:kv-cache-differentiated-quantization. If the feature request is accepted, I'll be happy to polish it and submit a PR.
Author
Owner

@ccebelenski commented on GitHub (May 22, 2025):

This seems like a sensible approach, especially since it's already supported by the underlying llama.cpp.
Further questions, but probably not for this project:
Should it support 2-bit Q's? (arguably more useful for split quants)
Should there be support for HQQ vs Quanto?
What about q_group_size and residual_length tuning?

What would be very useful would be per-model tuning - as part of the modelfile being able to select the best options, rather than globally on the server. I might need more accuracy for agent calling vs. general chat for example.

<!-- gh-comment-id:2901263350 --> @ccebelenski commented on GitHub (May 22, 2025): This seems like a sensible approach, especially since it's already supported by the underlying llama.cpp. Further questions, but probably not for this project: Should it support 2-bit Q's? (arguably more useful for split quants) Should there be support for HQQ vs Quanto? What about q_group_size and residual_length tuning? What would be very useful would be per-model tuning - as part of the modelfile being able to select the best options, rather than globally on the server. I might need more accuracy for agent calling vs. general chat for example.
Author
Owner

@rainlycoris commented on GitHub (Aug 28, 2025):

May I ask if it would be possible to get support for q5_1 quantization? Since the gap between q8_0 and q4_0 is quite significant, q5_1 seems to be an ideal compromise—especially as llama.cpp also supports it. Combining q8_0 for the K and q5_1 for the V sounds like a very promising option.

<!-- gh-comment-id:3232188232 --> @rainlycoris commented on GitHub (Aug 28, 2025): May I ask if it would be possible to get support for q5_1 quantization? Since the gap between q8_0 and q4_0 is quite significant, q5_1 seems to be an ideal compromise—especially as llama.cpp also supports it. Combining q8_0 for the K and q5_1 for the V sounds like a very promising option.
Author
Owner

@rainlycoris commented on GitHub (Sep 4, 2025):

In llama.cpp, when KV quantization is set separately, serious performance issues occur on M1 pro (half token generation speed) whenever the types of K and V differ, so we might need to wait a bit longer?

<!-- gh-comment-id:3252334879 --> @rainlycoris commented on GitHub (Sep 4, 2025): In llama.cpp, when KV quantization is set separately, serious performance issues occur on M1 pro (half token generation speed) whenever the types of K and V differ, so we might need to wait a bit longer?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53599