[PR #5894] [CLOSED] feat: K/V cache quantisation (massive vRAM improvement!) #43205

Closed
opened 2026-04-24 22:52:53 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/5894
Author: @sammcj
Created: 7/23/2024
Status: Closed

Base: mainHead: feature/kv-quant


📝 Commits (10+)

  • a872c90 feat: allow setting KV cache type, remove deprecated F16KV
  • 41f49f6 Merge branch 'ollama:main' into feature/kv-quant
  • c75d88a Merge branch 'ollama:main' into feature/kv-quant
  • 33f94ca Merge branch 'ollama:main' into feature/kv-quant
  • 141a7c1 align with new linting settings upstream
  • 66c894f Merge branch 'ollama:main' into feature/kv-quant
  • 67ef34a Merge branch 'main' into feature/kv-quant
  • f2dcf42 resolve conflicts from upstream
  • 3b32812 resolve conflicts from upstream
  • 4257809 Merge branch 'ollama:main' into feature/kv-quant

📊 Changes

21 files changed (+212 additions, -86 deletions)

View changed files

📝 api/types.go (+13 -12)
📝 cmd/cmd.go (+2 -0)
📝 cmd/interactive.go (+2 -0)
📝 docs/api.md (+7 -3)
📝 docs/faq.md (+24 -1)
📝 envconfig/config.go (+6 -0)
📝 llm/memory.go (+30 -4)
📝 llm/memory_test.go (+2 -2)
📝 llm/server.go (+47 -6)
📝 llm/status.go (+1 -0)
📝 parser/parser_test.go (+4 -3)
📝 scripts/install.sh (+6 -6)
📝 server/images.go (+10 -10)
📝 server/layer.go (+14 -14)
📝 server/manifest.go (+8 -8)
📝 server/model.go (+1 -1)
📝 server/routes.go (+1 -1)
📝 server/routes_delete_test.go (+1 -1)
📝 server/sched.go (+23 -8)
📝 server/sched_test.go (+8 -4)

...and 1 more files

📄 Description

THIS PR HAS MOVED TO https://github.com/ollama/ollama/pull/6279


This PR introduces optional K/V (context) cache quantisation.

In addition the deprecated F16KV parameter has been removed, if a user wishes for some reason to run the KV at f32, they can provide that as an option.

Impact

  • With defaults (f16) - none, behaviour is the same as the current defaults.
  • With q8_0
    • The K/V context cache will consume 1/2 the vRAM (!)
    • A very small loss in quality within the cache
  • With q4_0
    • the K/V context cache will consume 1/4 the vRAM (!!)
    • A small/medium loss in quality within the cache
    • For example, loading llama3.1 8b with a 32K context drops vRAM usage by cache from 4GB to 1.1GB
  • The and q4_1 -> q5_1 in between.

Additional quantisations supported by llama.cpp and this PR that may depend on the quantisation of the model you're running:

q5_1, q5_0, q4_1, iq4_nl

Screenshots

Example of estimated (v)RAM savings - f16 (q8_0,q4_0)

image

f16

kv_cache_f16

q4_0

kv_cache_q4_0

q8_0

kv_cache_q8_0

Performance

llama.cpp did some perplexity measurements (although looking at the commits things have likely improved even further since May when they were done, and CUDA graphs were later fixed etc....): https://github.com/ggerganov/llama.cpp/pull/7412#issuecomment-2120427347

As far as I can tell (at least with q6_k quants) there isn't much of a noticeable hit to performance.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/5894 **Author:** [@sammcj](https://github.com/sammcj) **Created:** 7/23/2024 **Status:** ❌ Closed **Base:** `main` ← **Head:** `feature/kv-quant` --- ### 📝 Commits (10+) - [`a872c90`](https://github.com/ollama/ollama/commit/a872c90554116aec043ff7855c8bdfdcf2fbe416) feat: allow setting KV cache type, remove deprecated F16KV - [`41f49f6`](https://github.com/ollama/ollama/commit/41f49f6611400e27dc38f1fce34f1861c6c5eb75) Merge branch 'ollama:main' into feature/kv-quant - [`c75d88a`](https://github.com/ollama/ollama/commit/c75d88a54569acdbf3f487339484658f0df522f2) Merge branch 'ollama:main' into feature/kv-quant - [`33f94ca`](https://github.com/ollama/ollama/commit/33f94ca1daf673ac5141823e3b420efcbde97c5d) Merge branch 'ollama:main' into feature/kv-quant - [`141a7c1`](https://github.com/ollama/ollama/commit/141a7c1ff9f385e0eba4f13746fd72d5eb370379) align with new linting settings upstream - [`66c894f`](https://github.com/ollama/ollama/commit/66c894f4d231bec50ce0f37d1590b5608f058991) Merge branch 'ollama:main' into feature/kv-quant - [`67ef34a`](https://github.com/ollama/ollama/commit/67ef34a994604ed0f4a52f971d8aef016378934f) Merge branch 'main' into feature/kv-quant - [`f2dcf42`](https://github.com/ollama/ollama/commit/f2dcf42914744826e500fe966e7d572e29cf88f5) resolve conflicts from upstream - [`3b32812`](https://github.com/ollama/ollama/commit/3b32812019c0588bff966e3139cbccf3fe6d5eec) resolve conflicts from upstream - [`4257809`](https://github.com/ollama/ollama/commit/4257809f334a6bfdefd7c170496a1fb701b7ebcf) Merge branch 'ollama:main' into feature/kv-quant ### 📊 Changes **21 files changed** (+212 additions, -86 deletions) <details> <summary>View changed files</summary> 📝 `api/types.go` (+13 -12) 📝 `cmd/cmd.go` (+2 -0) 📝 `cmd/interactive.go` (+2 -0) 📝 `docs/api.md` (+7 -3) 📝 `docs/faq.md` (+24 -1) 📝 `envconfig/config.go` (+6 -0) 📝 `llm/memory.go` (+30 -4) 📝 `llm/memory_test.go` (+2 -2) 📝 `llm/server.go` (+47 -6) 📝 `llm/status.go` (+1 -0) 📝 `parser/parser_test.go` (+4 -3) 📝 `scripts/install.sh` (+6 -6) 📝 `server/images.go` (+10 -10) 📝 `server/layer.go` (+14 -14) 📝 `server/manifest.go` (+8 -8) 📝 `server/model.go` (+1 -1) 📝 `server/routes.go` (+1 -1) 📝 `server/routes_delete_test.go` (+1 -1) 📝 `server/sched.go` (+23 -8) 📝 `server/sched_test.go` (+8 -4) _...and 1 more files_ </details> ### 📄 Description # THIS PR HAS MOVED TO https://github.com/ollama/ollama/pull/6279 --- This PR introduces optional K/V (context) cache quantisation. In addition the deprecated `F16KV` parameter has been removed, if a user wishes for some reason to run the KV at f32, they can provide that as an option. ## Impact - With defaults (f16) - none, behaviour is the same as the current defaults. - With q8_0 - **The K/V context cache will consume 1/2 the vRAM** (!) - A _very_ small loss in quality within the cache - With q4_0 - **the K/V context cache will consume 1/4 the vRAM** (!!) - A small/medium loss in quality within the cache - For example, loading llama3.1 8b with a 32K context drops vRAM usage by cache from 4GB to 1.1GB - The and q4_1 -> q5_1 in between. Additional quantisations supported by llama.cpp and this PR that may depend on the quantisation of the model you're running: `q5_1`, `q5_0`, `q4_1`, `iq4_nl` - Fixes https://github.com/ollama/ollama/issues/5091 - Related discussion in llama.cpp - https://github.com/ggerganov/llama.cpp/discussions/5932 - (Note that ExllamaV2 has a similar feature - https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md) ## Screenshots Example of estimated (v)RAM savings - f16 (q8_0,q4_0) <img width="1211" alt="image" src="https://github.com/user-attachments/assets/a3520770-7b31-40c7-b45b-4aad6db9b117"> ### f16 ![kv_cache_f16](https://github.com/user-attachments/assets/af0a3b40-70e2-47f1-90b0-6ecd09dc59df) ### q4_0 ![kv_cache_q4_0](https://github.com/user-attachments/assets/47ba6578-1f5b-4091-8594-f63ecfada49e) ### q8_0 ![kv_cache_q8_0](https://github.com/user-attachments/assets/c7c09e62-4b54-4536-9617-6b00b1af6f94) ## Performance llama.cpp did some perplexity measurements (although looking at the commits things have likely improved even further since May when they were done, and CUDA graphs were later fixed etc....): https://github.com/ggerganov/llama.cpp/pull/7412#issuecomment-2120427347 As far as I can tell (at least with q6_k quants) there isn't much of a noticeable hit to performance. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-24 22:52:53 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#43205