[PR #15090] kvcache: add TurboQuant rotation-enhanced KV cache compression #46263

Open
opened 2026-04-25 01:45:10 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15090
Author: @YKesX
Created: 3/27/2026
Status: 🔄 Open

Base: mainHead: kvcache/turboquant-rotation-compression


📝 Commits (7)

  • 9d19dc3 kvcache: add TurboQuant rotation-enhanced KV cache compression
  • 64f5037 Merge branch 'ollama:main' into kvcache/turboquant-rotation-compression
  • 8195786 Merge branch 'ollama:main' into kvcache/turboquant-rotation-compression
  • 60c2bef kvcache: wire turboquant cache into new engine path
  • a1905cb Merge branch 'ollama:main' into kvcache/turboquant-rotation-compression
  • 0a767fb kvcache: add TurboQuant KV cache compression (FWHT + Lloyd-Max)
  • f89b9f4 Merge branch 'ollama:main' into kvcache/turboquant-rotation-compression

📊 Changes

34 files changed (+2806 additions, -5 deletions)

View changed files

📝 .gitignore (+3 -0)
📝 fs/ggml/ggml.go (+10 -1)
📝 kvcache/cache.go (+8 -0)
📝 kvcache/recurrent.go (+8 -1)
kvcache/turboquant.go (+207 -0)
kvcache/turboquant_test.go (+50 -0)
📝 llama/llama.go (+3 -0)
📝 llm/server.go (+7 -0)
📝 ml/backend.go (+27 -0)
📝 ml/backend/ggml/ggml.go (+52 -0)
📝 ml/backend/ggml/ggml/include/ggml.h (+56 -0)
📝 ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.c (+20 -0)
📝 ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp (+520 -0)
📝 ml/backend/ggml/ggml/src/ggml-cpu/ops.h (+4 -0)
ml/backend/ggml/ggml/src/ggml-cuda/fwht.cu (+152 -0)
ml/backend/ggml/ggml/src/ggml-cuda/fwht.cuh (+5 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu (+23 -0)
ml/backend/ggml/ggml/src/ggml-cuda/lloyd-max.cu (+200 -0)
ml/backend/ggml/ggml/src/ggml-cuda/lloyd-max.cuh (+6 -0)
ml/backend/ggml/ggml/src/ggml-cuda/tq-decompress.cu (+147 -0)

...and 14 more files

📄 Description

Add support for OLLAMA_KV_CACHE_TYPE=tq3 and tq4 which apply a randomized Hadamard rotation to key vectors before storing them in the quantized KV cache. This is the core technique from Google's TurboQuant paper (arxiv 2504.19874, ICLR 2026): the rotation distributes information uniformly across coordinates, eliminating outlier channels that cause standard block quantization to fail.

The rotation wraps the existing cache at the kvcache.Cache interface level, making it transparent to all model architectures. The inner cache uses Q4_0 for storage, providing ~4x memory compression over F16. For the legacy llamarunner engine, tq3/tq4 fall back to standard Q4_0 with a log message.

The turboquant package provides the mathematical primitives from the paper (Lloyd-Max codebooks, QJL projection matrices) for future per-coordinate quantization and residual correction stages.

Tests show that the allocation between the same context length with the same control model (qwen3.5:9b) did not change until 256k context since ollama allocates memory beforehand. This is my first try in contributing to a big project, there might be errors on my behalf. If so please state them so that i can become a better dev.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15090 **Author:** [@YKesX](https://github.com/YKesX) **Created:** 3/27/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `kvcache/turboquant-rotation-compression` --- ### 📝 Commits (7) - [`9d19dc3`](https://github.com/ollama/ollama/commit/9d19dc3599cad974aabbbde1114a8266320a95c2) kvcache: add TurboQuant rotation-enhanced KV cache compression - [`64f5037`](https://github.com/ollama/ollama/commit/64f503712138f7dcf82586f805ac391529d93772) Merge branch 'ollama:main' into kvcache/turboquant-rotation-compression - [`8195786`](https://github.com/ollama/ollama/commit/8195786db6c34c1593845bad87fcacb077ad277d) Merge branch 'ollama:main' into kvcache/turboquant-rotation-compression - [`60c2bef`](https://github.com/ollama/ollama/commit/60c2befa25f630b33b0cb6b89cf0c4df986ac8dc) kvcache: wire turboquant cache into new engine path - [`a1905cb`](https://github.com/ollama/ollama/commit/a1905cba7e4c6583f4c2980ab287d2f5446434a6) Merge branch 'ollama:main' into kvcache/turboquant-rotation-compression - [`0a767fb`](https://github.com/ollama/ollama/commit/0a767fb5404fab100369bac5f2753ae24c8b44cf) kvcache: add TurboQuant KV cache compression (FWHT + Lloyd-Max) - [`f89b9f4`](https://github.com/ollama/ollama/commit/f89b9f44396c80fe2feb37286da922582c61a0ce) Merge branch 'ollama:main' into kvcache/turboquant-rotation-compression ### 📊 Changes **34 files changed** (+2806 additions, -5 deletions) <details> <summary>View changed files</summary> 📝 `.gitignore` (+3 -0) 📝 `fs/ggml/ggml.go` (+10 -1) 📝 `kvcache/cache.go` (+8 -0) 📝 `kvcache/recurrent.go` (+8 -1) ➕ `kvcache/turboquant.go` (+207 -0) ➕ `kvcache/turboquant_test.go` (+50 -0) 📝 `llama/llama.go` (+3 -0) 📝 `llm/server.go` (+7 -0) 📝 `ml/backend.go` (+27 -0) 📝 `ml/backend/ggml/ggml.go` (+52 -0) 📝 `ml/backend/ggml/ggml/include/ggml.h` (+56 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.c` (+20 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cpu/ops.cpp` (+520 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cpu/ops.h` (+4 -0) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/fwht.cu` (+152 -0) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/fwht.cuh` (+5 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu` (+23 -0) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/lloyd-max.cu` (+200 -0) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/lloyd-max.cuh` (+6 -0) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/tq-decompress.cu` (+147 -0) _...and 14 more files_ </details> ### 📄 Description Add support for OLLAMA_KV_CACHE_TYPE=tq3 and tq4 which apply a randomized Hadamard rotation to key vectors before storing them in the quantized KV cache. This is the core technique from Google's TurboQuant paper (arxiv 2504.19874, ICLR 2026): the rotation distributes information uniformly across coordinates, eliminating outlier channels that cause standard block quantization to fail. The rotation wraps the existing cache at the kvcache.Cache interface level, making it transparent to all model architectures. The inner cache uses Q4_0 for storage, providing ~4x memory compression over F16. For the legacy llamarunner engine, tq3/tq4 fall back to standard Q4_0 with a log message. The turboquant package provides the mathematical primitives from the paper (Lloyd-Max codebooks, QJL projection matrices) for future per-coordinate quantization and residual correction stages. Tests show that the allocation between the same context length with the same control model (qwen3.5:9b) did not change until 256k context since ollama allocates memory beforehand. This is my first try in contributing to a big project, there might be errors on my behalf. If so please state them so that i can become a better dev. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 01:45:10 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#46263