[PR #6276] [CLOSED] feat: K/V cache quantisation (massive vRAM improvement!) #43315

Closed
opened 2026-04-24 22:57:39 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/6276
Author: @sammcj
Created: 8/9/2024
Status: Closed

Base: mainHead: feature/kv-quant


📝 Commits (2)

  • acd571b fix: improved cache type estimations
  • 9b31d8f Merge branch 'main' into feature/kv-quant

📊 Changes

23 files changed (+161 additions, -40 deletions)

View changed files

📝 api/types.go (+13 -12)
📝 app/assets/app.ico (+0 -0)
📝 cmd/cmd.go (+2 -0)
📝 cmd/interactive.go (+2 -0)
📝 docs/api.md (+3 -2)
📝 docs/faq.md (+24 -1)
📝 envconfig/config.go (+6 -0)
📝 examples/modelfile-mario/logo.png (+0 -0)
📝 llm/memory.go (+30 -4)
📝 llm/memory_test.go (+2 -2)
📝 llm/server.go (+47 -6)
📝 macapp/assets/icon.icns (+0 -0)
📝 macapp/assets/iconDarkTemplate.png (+0 -0)
📝 macapp/assets/iconDarkTemplate@2x.png (+0 -0)
📝 macapp/assets/iconDarkUpdateTemplate.png (+0 -0)
📝 macapp/assets/iconDarkUpdateTemplate@2x.png (+0 -0)
📝 macapp/assets/iconTemplate.png (+0 -0)
📝 macapp/assets/iconTemplate@2x.png (+0 -0)
📝 macapp/assets/iconUpdateTemplate.png (+0 -0)
📝 macapp/assets/iconUpdateTemplate@2x.png (+0 -0)

...and 3 more files

📄 Description

This PR introduces optional K/V (context) cache quantisation.

(PR recreated after Github broke https://github.com/ollama/ollama/pull/5894 🤦)

In addition the deprecated F16KV parameter has been removed, if a user wishes for some reason to run the KV at f32, they can provide that as an option.

Impact

  • With defaults (f16) - none, behaviour is the same as the current defaults.
  • With q8_0
    • The K/V context cache will consume 1/2 the vRAM (!)
    • A very small loss in quality within the cache
  • With q4_0
    • the K/V context cache will consume 1/4 the vRAM (!!)
    • A small/medium loss in quality within the cache
    • For example, loading llama3.1 8b with a 32K context drops vRAM usage by cache from 4GB to 1.1GB
  • The and q4_1 -> q5_1 in between.

Additional quantisations supported by llama.cpp and this PR that may depend on the quantisation of the model you're running:

q5_1, q5_0, q4_1, iq4_nl

Screenshots

Example of estimated (v)RAM savings - f16 (q8_0,q4_0)

image

f16

kv_cache_f16

q4_0

kv_cache_q4_0

q8_0

kv_cache_q8_0

Performance

llama.cpp did some perplexity measurements (although looking at the commits things have likely improved even further since May when they were done, and CUDA graphs were later fixed etc....): https://github.com/ggerganov/llama.cpp/pull/7412#issuecomment-2120427347

As far as I can tell (at least with q6_k quants) there isn't much of a noticeable hit to performance.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/6276 **Author:** [@sammcj](https://github.com/sammcj) **Created:** 8/9/2024 **Status:** ❌ Closed **Base:** `main` ← **Head:** `feature/kv-quant` --- ### 📝 Commits (2) - [`acd571b`](https://github.com/ollama/ollama/commit/acd571bbf7cfc216f762814884e8dab84dba0b48) fix: improved cache type estimations - [`9b31d8f`](https://github.com/ollama/ollama/commit/9b31d8f26513711a80dbfe1f2bf06e2ef993153e) Merge branch 'main' into feature/kv-quant ### 📊 Changes **23 files changed** (+161 additions, -40 deletions) <details> <summary>View changed files</summary> 📝 `api/types.go` (+13 -12) 📝 `app/assets/app.ico` (+0 -0) 📝 `cmd/cmd.go` (+2 -0) 📝 `cmd/interactive.go` (+2 -0) 📝 `docs/api.md` (+3 -2) 📝 `docs/faq.md` (+24 -1) 📝 `envconfig/config.go` (+6 -0) 📝 `examples/modelfile-mario/logo.png` (+0 -0) 📝 `llm/memory.go` (+30 -4) 📝 `llm/memory_test.go` (+2 -2) 📝 `llm/server.go` (+47 -6) 📝 `macapp/assets/icon.icns` (+0 -0) 📝 `macapp/assets/iconDarkTemplate.png` (+0 -0) 📝 `macapp/assets/iconDarkTemplate@2x.png` (+0 -0) 📝 `macapp/assets/iconDarkUpdateTemplate.png` (+0 -0) 📝 `macapp/assets/iconDarkUpdateTemplate@2x.png` (+0 -0) 📝 `macapp/assets/iconTemplate.png` (+0 -0) 📝 `macapp/assets/iconTemplate@2x.png` (+0 -0) 📝 `macapp/assets/iconUpdateTemplate.png` (+0 -0) 📝 `macapp/assets/iconUpdateTemplate@2x.png` (+0 -0) _...and 3 more files_ </details> ### 📄 Description This PR introduces optional K/V (context) cache quantisation. (PR recreated after Github broke https://github.com/ollama/ollama/pull/5894 🤦) In addition the deprecated `F16KV` parameter has been removed, if a user wishes for some reason to run the KV at f32, they can provide that as an option. ## Impact - With defaults (f16) - none, behaviour is the same as the current defaults. - With q8_0 - **The K/V context cache will consume 1/2 the vRAM** (!) - A _very_ small loss in quality within the cache - With q4_0 - **the K/V context cache will consume 1/4 the vRAM** (!!) - A small/medium loss in quality within the cache - For example, loading llama3.1 8b with a 32K context drops vRAM usage by cache from 4GB to 1.1GB - The and q4_1 -> q5_1 in between. Additional quantisations supported by llama.cpp and this PR that may depend on the quantisation of the model you're running: `q5_1`, `q5_0`, `q4_1`, `iq4_nl` - Fixes https://github.com/ollama/ollama/issues/5091 - Related discussion in llama.cpp - https://github.com/ggerganov/llama.cpp/discussions/5932 - (Note that ExllamaV2 has a similar feature - https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md) ## Screenshots Example of estimated (v)RAM savings - f16 (q8_0,q4_0) <img width="1211" alt="image" src="https://github.com/user-attachments/assets/a3520770-7b31-40c7-b45b-4aad6db9b117"> ### f16 ![kv_cache_f16](https://github.com/user-attachments/assets/af0a3b40-70e2-47f1-90b0-6ecd09dc59df) ### q4_0 ![kv_cache_q4_0](https://github.com/user-attachments/assets/47ba6578-1f5b-4091-8594-f63ecfada49e) ### q8_0 ![kv_cache_q8_0](https://github.com/user-attachments/assets/c7c09e62-4b54-4536-9617-6b00b1af6f94) ## Performance llama.cpp did some perplexity measurements (although looking at the commits things have likely improved even further since May when they were done, and CUDA graphs were later fixed etc....): https://github.com/ggerganov/llama.cpp/pull/7412#issuecomment-2120427347 As far as I can tell (at least with q6_k quants) there isn't much of a noticeable hit to performance. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-24 22:57:39 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#43315