[PR #6279] [MERGED] feat: Introduce K/V Context Quantisation (vRAM improvements) #10807

Closed
opened 2025-11-12 15:37:06 -06:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/6279
Author: @sammcj
Created: 8/9/2024
Status: Merged
Merged: 12/3/2024
Merged by: @jmorganca

Base: mainHead: feature/kv-quant


📝 Commits (7)

  • 7eeeb5a feat: enable k/v cache quantisation
  • a5e6e64 Update llama/llama.go
  • 4cf0ab7 feat: enable k/v cache quantisation
  • d5d6b83 feat: enable k/v cache quantisation
  • e3a9e64 feat: enable k/v cache quantisation
  • 1554db4 Merge branch 'ollama:main' into feature/kv-quant
  • 9c618f7 feat: enable k/v cache quantisation

📊 Changes

10 files changed (+147 additions, -21 deletions)

View changed files

📝 cmd/cmd.go (+1 -0)
📝 discover/types.go (+14 -0)
📝 docs/faq.md (+26 -2)
📝 envconfig/config.go (+3 -0)
📝 llama/llama.go (+20 -1)
📝 llama/runner/runner.go (+4 -2)
📝 llm/ggml.go (+34 -2)
📝 llm/memory.go (+17 -4)
📝 llm/memory_test.go (+1 -0)
📝 llm/server.go (+27 -10)

📄 Description

This PR introduces optional K/V (context) cache quantisation.

TLDR; Set your k/v cache to Q8_0 and use 50% less vRAM for no noticeable quality impact.

Ollama is arguably the only remaining the popular model server to not support this.

This PR brings Ollama's K/V memory usage inline with likes of ExLlamav2, MistralRS, MLX, vLLM and those using llama.cpp directly which have supported this for some time.

  • The scheduler has been updated to take quantised K/V estimates into account.
  • Documentation added in the FAQ.
  • Re-factored (many times) over the past n months (since July) to fix various merge conflicts and the new runners/cgo implementation.

I've been running from this branch with q8_0 since I raised the original PR on 2024-07-24. It's been stable and unlocks a lot of models I wouldn't otherwise be able to run with a decent context size.

For future reference, llama.cpp's perplexity benchmarks are scattered all over the place and things have improved since these were done but to give you an idea - https://github.com/ggerganov/llama.cpp/pull/7412


Context:

Without K/V context cache quantisation every user is likely wasting vast amounts of (v)RAM - or simply thinking they're not able to run larger models or context sizes due to their available (v)RAM.

LLM Servers that support K/V context quantisation:

  • llama.cpp
  • exllamav2 (along with TabbyAPI)
  • MLX
  • Mistral.RS
  • vLLM
  • Transformers
  • Ollama

As things are currently with Ollama your only options are to:

  • Use another LLM server.
  • Not run the models/context sizes you want.
  • Build and run Ollama from the branch in this PR.

None of those are ideal, and over the second half of this year I've spoken with a lot of folks that are building and running Ollama from this feature branch, which has pushed me to keep it updated with frequent rebasing and refactoring while awaiting review.

This is not ideal and I would like to close off this PR and not have people rely on my fork.


PR recreated after Github broke https://github.com/ollama/ollama/pull/5894
image

Impact

Screenshots

Example of estimated (v)RAM savings

The numbers within each column (F16 (Q8_0,Q4_0)) are how much (v)RAM is required to run the model at the given K/V cache quant type.

For example: 30.8(22.8,18.8) would mean:

  • 30.8GB for F16 K/V
  • 22.8GB for Q8_0 K/V
  • 18.8 for Q4_0 K/V
SCR-20241116-haow

(via ingest or gollama)

f16

kv_cache_f16

q4_0

kv_cache_q4_0

q8_0

kv_cache_q8_0

Performance

llama.cpp perplexity measurements: https://github.com/ggerganov/llama.cpp/pull/7412#issuecomment-2120427347 - note that things have improved even further since these measurements which are now quite dated.


I broke down why this is important in a conversation with someone recently:

Let's say you're running a model, it's (v)RAM usage is determined by two things:

  • The size of the model (params, quant type)
  • The size of your context

Let's assume:

  • Your model is 7b at q4_k_m or something and takes up 7GB of memory.
  • You're working with a small code repo, or a few Obsidian documents that are around 30-40K tokens in total.

Your memory usage might look like this:

  • 7GB for the model
  • 5GB for the context
  • = 12GB of memory required

With q8 quantised k/v, that becomes:

  • 7GB for the model
  • 2.5GB for the context
  • = 9.5GB of memory required

If you go here: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator

  • Select exl2 (exllamav2 models)
  • Enter a model like Qwen/Qwen2.5-Coder-32B-Instruct
  • Enter a context size that would commonly be used for coding, e.g. 32k, or maybe 64k

Note the calculated memory requirements at full f16, now try q8 or q4 even (exllama is very good at this and q4 has essentially no loss)


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/6279 **Author:** [@sammcj](https://github.com/sammcj) **Created:** 8/9/2024 **Status:** ✅ Merged **Merged:** 12/3/2024 **Merged by:** [@jmorganca](https://github.com/jmorganca) **Base:** `main` ← **Head:** `feature/kv-quant` --- ### 📝 Commits (7) - [`7eeeb5a`](https://github.com/ollama/ollama/commit/7eeeb5a03204b9370c33565d1bf0df1c77d12463) feat: enable k/v cache quantisation - [`a5e6e64`](https://github.com/ollama/ollama/commit/a5e6e6449ab67c4099b164c0b044d3a88e3600f8) Update llama/llama.go - [`4cf0ab7`](https://github.com/ollama/ollama/commit/4cf0ab75371ef2b8930c42a123a872bc897da15e) feat: enable k/v cache quantisation - [`d5d6b83`](https://github.com/ollama/ollama/commit/d5d6b83d2b615994810143a515137ff1bcdd9b14) feat: enable k/v cache quantisation - [`e3a9e64`](https://github.com/ollama/ollama/commit/e3a9e64987bf74e5613901582f10e636b7e74403) feat: enable k/v cache quantisation - [`1554db4`](https://github.com/ollama/ollama/commit/1554db4d81b8e068eca7bd752ca5dc748de7295d) Merge branch 'ollama:main' into feature/kv-quant - [`9c618f7`](https://github.com/ollama/ollama/commit/9c618f761458f99a5c21ec3f324ad39275360c46) feat: enable k/v cache quantisation ### 📊 Changes **10 files changed** (+147 additions, -21 deletions) <details> <summary>View changed files</summary> 📝 `cmd/cmd.go` (+1 -0) 📝 `discover/types.go` (+14 -0) 📝 `docs/faq.md` (+26 -2) 📝 `envconfig/config.go` (+3 -0) 📝 `llama/llama.go` (+20 -1) 📝 `llama/runner/runner.go` (+4 -2) 📝 `llm/ggml.go` (+34 -2) 📝 `llm/memory.go` (+17 -4) 📝 `llm/memory_test.go` (+1 -0) 📝 `llm/server.go` (+27 -10) </details> ### 📄 Description This PR introduces optional K/V (context) cache quantisation. > TLDR; Set your k/v cache to Q8_0 and use 50% less vRAM for no noticeable quality impact. Ollama is arguably the only remaining the popular model server to not support this. This PR brings Ollama's K/V memory usage inline with likes of ExLlamav2, MistralRS, MLX, vLLM and those using llama.cpp directly which have supported this for some time. - The scheduler has been updated to take quantised K/V estimates into account. - Documentation added in the FAQ. - Re-factored (_many_ times) over the past n months (since July) to fix various merge conflicts and the new runners/cgo implementation. I've been running from this branch with q8_0 since I raised the original PR on **2024-07-24**. It's been stable and unlocks a lot of models I wouldn't otherwise be able to run with a decent context size. For future reference, llama.cpp's perplexity benchmarks are scattered all over the place and things have improved since these were done but to give you an idea - https://github.com/ggerganov/llama.cpp/pull/7412 --- Context: Without K/V context cache quantisation every user is likely wasting vast amounts of (v)RAM - or simply thinking they're not able to run larger models or context sizes due to their available (v)RAM. LLM Servers that support K/V context quantisation: - ✅ llama.cpp - ✅ exllamav2 (along with TabbyAPI) - ✅ MLX - ✅ Mistral.RS - ✅ vLLM - ✅ Transformers - ❌ Ollama As things are currently with Ollama your only options are to: - Use another LLM server. - Not run the models/context sizes you want. - Build and run Ollama from the branch in this PR. None of those are ideal, and over the second half of this year I've spoken with a lot of folks that are building and running Ollama from this feature branch, which has pushed me to keep it updated with frequent rebasing and refactoring while awaiting review. This is not ideal and I would like to close off this PR and not have people rely on my fork. --- _PR recreated after Github broke https://github.com/ollama/ollama/pull/5894_ <img width="480" alt="image" src="https://github.com/user-attachments/assets/739ffc99-4fc5-49b3-b5e1-4116852ae2f3"> ## Impact - With defaults (f16) - none, behaviour is the same as the current defaults. - With q8_0 - **The K/V context cache will consume 1/2 the vRAM** (!) - A _very_ small loss in quality within the cache - With q4_0 - **the K/V context cache will consume 1/4 the vRAM** (!!) - A small/medium loss in quality within the cache - For example, loading llama3.1 8b with a 32K context drops vRAM usage by cache from 4GB to 1.1GB - The and q4_1 -> q5_1 in between however Ollama is not currently supporting other llama.cpp quantisation types (`q5_1`, `q5_0`, `q4_1`, `iq4_nl`) - Fixes https://github.com/ollama/ollama/issues/5091 - Related discussion in llama.cpp - https://github.com/ggerganov/llama.cpp/discussions/5932 - (Note that ExllamaV2 has a similar feature - https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md) ## Screenshots Example of estimated (v)RAM savings The numbers within each column (`F16 (Q8_0,Q4_0)`) are how much (v)RAM is required to run the model at the given K/V cache quant type. For example: `30.8(22.8,18.8)` would mean: - 30.8GB for F16 K/V - 22.8GB for Q8_0 K/V - 18.8 for Q4_0 K/V <img width="1788" alt="SCR-20241116-haow" src="https://github.com/user-attachments/assets/cd0ebacd-caa3-4215-9901-2aedc0607903"> (via [ingest](https://github.com/sammcj/ingest/) or [gollama](https://github.com/sammcj/gollama)) ### f16 ![kv_cache_f16](https://github.com/user-attachments/assets/af0a3b40-70e2-47f1-90b0-6ecd09dc59df) ### q4_0 ![kv_cache_q4_0](https://github.com/user-attachments/assets/47ba6578-1f5b-4091-8594-f63ecfada49e) ### q8_0 ![kv_cache_q8_0](https://github.com/user-attachments/assets/c7c09e62-4b54-4536-9617-6b00b1af6f94) ## Performance llama.cpp perplexity measurements: https://github.com/ggerganov/llama.cpp/pull/7412#issuecomment-2120427347 - note that things have improved even further since these measurements which are now quite dated. --- I broke down why this is important in a conversation with someone recently: > Let's say you're running a model, it's (v)RAM usage is determined by two things: > - The size of the model (params, quant type) > - The size of your context > > Let's assume: > - Your model is 7b at q4_k_m or something and takes up 7GB of memory. > - You're working with a small code repo, or a few Obsidian documents that are around 30-40K tokens in total. > > Your memory usage might look like this: > - 7GB for the model > - 5GB for the context > - = 12GB of memory required > > With q8 quantised k/v, that becomes: > - 7GB for the model > - 2.5GB for the context > - = 9.5GB of memory required > > If you go here: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator > > - Select exl2 (exllamav2 models) > - Enter a model like Qwen/Qwen2.5-Coder-32B-Instruct > - Enter a context size that would commonly be used for coding, e.g. 32k, or maybe 64k > > Note the calculated memory requirements at full f16, now try q8 or q4 even (exllama is very good at this and q4 has essentially no loss) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the
pull-request
label 2025-11-12 15:37:06 -06:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#10807
No description provided.