[PR #15505] kvcache: add TurboQuant compressed KV cache #77473

Open
opened 2026-05-05 10:08:18 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15505
Author: @mverrilli
Created: 4/11/2026
Status: 🔄 Open

Base: mainHead: turboquant


📝 Commits (4)

  • c30ed52 turboquant: KV cache compression primitives
  • 43ffa30 ggml: add CUDA kernels and ops for TurboQuant KV compression
  • 65bda0f ml/backend/ggml: port TurboQuant kernels to Metal
  • 4834c76 kvcache: TurboQuant compressed KV cache

📊 Changes

64 files changed (+25395 additions, -81 deletions)

View changed files

📝 fs/ggml/ggml.go (+36 -5)
📝 kvcache/cache.go (+9 -0)
📝 kvcache/causal.go (+144 -53)
📝 kvcache/recurrent.go (+14 -1)
kvcache/turboquant.go (+805 -0)
kvcache/turboquant_test.go (+144 -0)
llama/patches/0037-ggml-add-CUDA-kernels-and-ops-for-TurboQuant-KV-comp.patch (+4026 -0)
llama/patches/0038-ml-backend-ggml-port-TurboQuant-kernels-to-Metal.patch (+3631 -0)
📝 ml/backend.go (+76 -1)
📝 ml/backend/ggml/ggml.go (+550 -12)
📝 ml/backend/ggml/ggml/include/ggml.h (+211 -1)
📝 ml/backend/ggml/ggml/src/ggml-backend.cpp (+10 -4)
📝 ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.c (+23 -0)
📝 ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.cpp (+9 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu (+28 -0)
ml/backend/ggml/ggml/src/ggml-cuda/tq-dequant.cu (+758 -0)
ml/backend/ggml/ggml/src/ggml-cuda/tq-dequant.cuh (+6 -0)
ml/backend/ggml/ggml/src/ggml-cuda/tq-encode-v.cu (+198 -0)
ml/backend/ggml/ggml/src/ggml-cuda/tq-encode-v.cuh (+5 -0)
ml/backend/ggml/ggml/src/ggml-cuda/tq-encode.cu (+900 -0)

...and 44 more files

📄 Description

Summary

Adds a GPU-resident compressed KV cache based on TurboQuant
(arXiv 2504.19874) with extensions for
real-model robustness: per-sub-block asymmetric primary quantization and a
32-channel outlier split. Compressed K (and optionally V) live in GPU memory
with a per-layer Householder QR rotation, a Lloyd-Max codebook, per-sub-block
mean/scale, and an outlier sub-stream stored at a higher bit width.

Non-TQ users see no behaviour or performance change — the entire path is
gated on OLLAMA_KV_CACHE_TYPE=tq*.

Presets

Set via OLLAMA_KV_CACHE_TYPE. All six presets use asymmetric per-sub-block
primary quantization with a 32-channel outlier split.

Preset Primary bits K bits V bits Effective bits/elem Use when
tq2k 2 2 f16 ~2.25 (K) tightest K, V quality matters
tq2 2 2 2 ~2.25 smallest cache
tq3k 3 3 f16 ~3.25 (K) balanced K-only (default-recommended for K-only)
tq3 3 3 3 ~3.25 balanced (default-recommended for K+V)
tq4k 4 4 f16 ~4.25 (K) highest K fidelity, V at f16
tq4 4 4 4 ~4.25 highest fidelity, both K and V

The *k variants halve the savings versus their K+V siblings but keep V at
full precision — useful when V quality matters more than total cache size.

Platform support

NVIDIA (CUDA). Pascal (cc 6.1+) and newer. Pascal is special-cased:
smaller block size and a pre-loaded shared-memory Q tile to avoid the
divergent-shuffle pattern that triggers on the older warp-shuffle path.
Validated on Tesla P40 (Pascal cc 6.1) and RTX 5060 (Blackwell).

AMD (ROCm). RDNA wave32 (gfx1030+). Wave64 AMD (Vega / GCN / CDNA) is
explicitly blocked — the __shfl_sync(width=32) codebook lookup produces
garbage on 64-wide wavefronts. RDNA1 (gfx1010–gfx1012) shares ccMajor=16
with RDNA2 and is admitted by the gate but untested. Validated on RX 7600
(RDNA3, gfx1102).

Apple Silicon (Metal). All M-series GPUs. SIMD groups are natively
32-wide so __shfl_sync(width=32) maps 1-to-1 to simd_shuffle /
simd_shuffle_xor. The Metal port covers all six ship presets (asymmetric
primary + outlier split) at head dims 64, 128, and 256. Metal and CUDA
share the on-disk block format byte-for-byte.

Head dim coverage. Fused inline-decode flash attention is instantiated
for D=64 (llama3.2:3b), D=128 (llama 3.x / qwen 2.5 / qwen 3), and D=256
(gemma 3/4). Other head dims fall back to the separate dequant + stock-FA
path.

Usage

# K+V (smallest cache) — flash attention required
OLLAMA_NEW_ENGINE=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=tq3 ollama serve

# K-only (V at f16) — works with or without flash attention
OLLAMA_NEW_ENGINE=1 OLLAMA_KV_CACHE_TYPE=tq3k ollama serve

K+V presets (tq2, tq3, tq4) require flash attention. K-only presets
(tq2k, tq3k, tq4k) work with or without it. The setting is global
and applies only to models running through Ollama's native engine.

Ablations

Two env vars allow disabling the extensions to recover plain TurboQuant
(rotation + Lloyd-Max codebook only):

OLLAMA_TQ_DISABLE_ASYMMETRIC=1   # remove per-sub-block mean centring
OLLAMA_TQ_DISABLE_OUTLIERS=1     # remove the 32-channel outlier split

Setting both reproduces the symmetric, single-stream TurboQuant baseline.
Useful for measuring the contribution of each extension on a given model.

Considerations

  • Head dims outside {64, 128, 256} route through dequant +
    stock flash attention. This path is correct on every backend but uses
    more compute-graph scratch memory than the fused path.
  • Older AMD waves (Vega / CDNA) fall back to f16. The gate is in
    tq_device_scan.go.
  • Pascal throughput. The fused kernel runs at lower utilisation on
    Pascal than on Volta+ because of the warp-shuffle implementation
    difference. On Ampere and newer the per-token overhead drops sharply.

Benchmarks

(( To be added ))

Prefill is routed through DequantKV + stock flash attention; decode
stays on the fused inline-decode kernel.

Implementation

Six new GGML ops (GGML_OP_TQ_ENCODE, GGML_OP_TQ_ENCODE_V,
GGML_OP_TQ_ENCODE_KV, GGML_OP_TQ_DEQUANT, GGML_OP_TQ_DEQUANT_KV,
GGML_OP_TQ_FLASH_ATTN_EXT). The CPU backend rejects all six via
supports_op so the scheduler keeps them on GPU.

The Go turboquant/ package is the algorithm reference and CPU/test path;
CUDA and Metal kernels operate on the same block layout (BlockVersion 7).
A TurboQuantCache wrapper around CausalCache intercepts K/V Put and
supplies decoded K/V to the attention kernel. Routing inside
TurboQuantCache picks between three attention paths based on preset,
head dim, and backend:

  1. Combined DequantKV + stock flash attention (preferred when the head
    dim is supported by stock FA; smallest compute-graph footprint).
  2. K+V fused inline-decode flash attention.
  3. K-only fused inline-decode for the *k presets.

Notable non-TQ change

ml/backend/ggml/ggml/src/ggml-backend.cpp bumps
GGML_SCHED_MAX_SPLIT_INPUTS from 30 to 128. Large MoE graphs combined
with TQ K+V encode push split input counts above the old 30-input ceiling,
causing GGML_ASSERT at graph build time. Cost is CPU heap only (OS
lazy-faults on write); typical non-MoE graphs never populate more than
~30 inputs per split. Behaviour for non-TQ users is unchanged.

Test plan

  • go test ./turboquant/... ./kvcache/... ./ml/backend/ggml/... ./runner/ollamarunner/...
  • CUDA validated on Tesla P40 (Pascal) and RTX 5060 (Blackwell)
  • ROCm validated on RX 7600 (RDNA3, gfx1102)
  • Metal validated on Apple Silicon (llama 3.2 / gemma 3)
  • Coherence + perplexity sweep across llama 3.x, qwen 2.5, qwen 3,
    gemma 3/4 under all six presets and the two ablation env vars
  • Patches gate (make -f Makefile.sync clean checkout apply-patches sync
    followed by git diff --exit-code) clean
  • Non-TQ cache paths (f16, q8_0, q4_0) unchanged

References


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15505 **Author:** [@mverrilli](https://github.com/mverrilli) **Created:** 4/11/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `turboquant` --- ### 📝 Commits (4) - [`c30ed52`](https://github.com/ollama/ollama/commit/c30ed52c4b1fcca2fb25e4c51ed5d4f83657a42c) turboquant: KV cache compression primitives - [`43ffa30`](https://github.com/ollama/ollama/commit/43ffa303e5340ad81dd3926bdb6c045a5da47e2b) ggml: add CUDA kernels and ops for TurboQuant KV compression - [`65bda0f`](https://github.com/ollama/ollama/commit/65bda0f45cd2b2c3db66decf852c61be12ca44e5) ml/backend/ggml: port TurboQuant kernels to Metal - [`4834c76`](https://github.com/ollama/ollama/commit/4834c767f4b5ce8e4e09181c24879a6d809830e7) kvcache: TurboQuant compressed KV cache ### 📊 Changes **64 files changed** (+25395 additions, -81 deletions) <details> <summary>View changed files</summary> 📝 `fs/ggml/ggml.go` (+36 -5) 📝 `kvcache/cache.go` (+9 -0) 📝 `kvcache/causal.go` (+144 -53) 📝 `kvcache/recurrent.go` (+14 -1) ➕ `kvcache/turboquant.go` (+805 -0) ➕ `kvcache/turboquant_test.go` (+144 -0) ➕ `llama/patches/0037-ggml-add-CUDA-kernels-and-ops-for-TurboQuant-KV-comp.patch` (+4026 -0) ➕ `llama/patches/0038-ml-backend-ggml-port-TurboQuant-kernels-to-Metal.patch` (+3631 -0) 📝 `ml/backend.go` (+76 -1) 📝 `ml/backend/ggml/ggml.go` (+550 -12) 📝 `ml/backend/ggml/ggml/include/ggml.h` (+211 -1) 📝 `ml/backend/ggml/ggml/src/ggml-backend.cpp` (+10 -4) 📝 `ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.c` (+23 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.cpp` (+9 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu` (+28 -0) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/tq-dequant.cu` (+758 -0) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/tq-dequant.cuh` (+6 -0) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/tq-encode-v.cu` (+198 -0) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/tq-encode-v.cuh` (+5 -0) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/tq-encode.cu` (+900 -0) _...and 44 more files_ </details> ### 📄 Description ## Summary Adds a GPU-resident compressed KV cache based on TurboQuant ([arXiv 2504.19874](https://arxiv.org/abs/2504.19874)) with extensions for real-model robustness: per-sub-block asymmetric primary quantization and a 32-channel outlier split. Compressed K (and optionally V) live in GPU memory with a per-layer Householder QR rotation, a Lloyd-Max codebook, per-sub-block mean/scale, and an outlier sub-stream stored at a higher bit width. Non-TQ users see no behaviour or performance change — the entire path is gated on `OLLAMA_KV_CACHE_TYPE=tq*`. ## Presets Set via `OLLAMA_KV_CACHE_TYPE`. All six presets use asymmetric per-sub-block primary quantization with a 32-channel outlier split. | Preset | Primary bits | K bits | V bits | Effective bits/elem | Use when | |--------|--------------|--------|--------|---------------------|----------| | `tq2k` | 2 | 2 | f16 | ~2.25 (K) | tightest K, V quality matters | | `tq2` | 2 | 2 | 2 | ~2.25 | smallest cache | | `tq3k` | 3 | 3 | f16 | ~3.25 (K) | balanced K-only (default-recommended for K-only) | | `tq3` | 3 | 3 | 3 | ~3.25 | balanced (default-recommended for K+V) | | `tq4k` | 4 | 4 | f16 | ~4.25 (K) | highest K fidelity, V at f16 | | `tq4` | 4 | 4 | 4 | ~4.25 | highest fidelity, both K and V | The `*k` variants halve the savings versus their K+V siblings but keep V at full precision — useful when V quality matters more than total cache size. ## Platform support **NVIDIA (CUDA).** Pascal (cc 6.1+) and newer. Pascal is special-cased: smaller block size and a pre-loaded shared-memory Q tile to avoid the divergent-shuffle pattern that triggers on the older warp-shuffle path. Validated on Tesla P40 (Pascal cc 6.1) and RTX 5060 (Blackwell). **AMD (ROCm).** RDNA wave32 (gfx1030+). Wave64 AMD (Vega / GCN / CDNA) is explicitly blocked — the `__shfl_sync(width=32)` codebook lookup produces garbage on 64-wide wavefronts. RDNA1 (gfx1010–gfx1012) shares `ccMajor=16` with RDNA2 and is admitted by the gate but untested. Validated on RX 7600 (RDNA3, gfx1102). **Apple Silicon (Metal).** All M-series GPUs. SIMD groups are natively 32-wide so `__shfl_sync(width=32)` maps 1-to-1 to `simd_shuffle` / `simd_shuffle_xor`. The Metal port covers all six ship presets (asymmetric primary + outlier split) at head dims 64, 128, and 256. Metal and CUDA share the on-disk block format byte-for-byte. **Head dim coverage.** Fused inline-decode flash attention is instantiated for D=64 (llama3.2:3b), D=128 (llama 3.x / qwen 2.5 / qwen 3), and D=256 (gemma 3/4). Other head dims fall back to the separate dequant + stock-FA path. ## Usage ```sh # K+V (smallest cache) — flash attention required OLLAMA_NEW_ENGINE=1 OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=tq3 ollama serve # K-only (V at f16) — works with or without flash attention OLLAMA_NEW_ENGINE=1 OLLAMA_KV_CACHE_TYPE=tq3k ollama serve ``` K+V presets (`tq2`, `tq3`, `tq4`) require flash attention. K-only presets (`tq2k`, `tq3k`, `tq4k`) work with or without it. The setting is global and applies only to models running through Ollama's native engine. ## Ablations Two env vars allow disabling the extensions to recover plain TurboQuant (rotation + Lloyd-Max codebook only): ```sh OLLAMA_TQ_DISABLE_ASYMMETRIC=1 # remove per-sub-block mean centring OLLAMA_TQ_DISABLE_OUTLIERS=1 # remove the 32-channel outlier split ``` Setting both reproduces the symmetric, single-stream TurboQuant baseline. Useful for measuring the contribution of each extension on a given model. ## Considerations - **Head dims outside {64, 128, 256}** route through dequant + stock flash attention. This path is correct on every backend but uses more compute-graph scratch memory than the fused path. - **Older AMD waves (Vega / CDNA)** fall back to f16. The gate is in `tq_device_scan.go`. - **Pascal throughput.** The fused kernel runs at lower utilisation on Pascal than on Volta+ because of the warp-shuffle implementation difference. On Ampere and newer the per-token overhead drops sharply. ## Benchmarks (( To be added )) Prefill is routed through `DequantKV` + stock flash attention; decode stays on the fused inline-decode kernel. ## Implementation Six new GGML ops (`GGML_OP_TQ_ENCODE`, `GGML_OP_TQ_ENCODE_V`, `GGML_OP_TQ_ENCODE_KV`, `GGML_OP_TQ_DEQUANT`, `GGML_OP_TQ_DEQUANT_KV`, `GGML_OP_TQ_FLASH_ATTN_EXT`). The CPU backend rejects all six via `supports_op` so the scheduler keeps them on GPU. The Go `turboquant/` package is the algorithm reference and CPU/test path; CUDA and Metal kernels operate on the same block layout (BlockVersion 7). A `TurboQuantCache` wrapper around `CausalCache` intercepts K/V Put and supplies decoded K/V to the attention kernel. Routing inside `TurboQuantCache` picks between three attention paths based on preset, head dim, and backend: 1. Combined `DequantKV` + stock flash attention (preferred when the head dim is supported by stock FA; smallest compute-graph footprint). 2. K+V fused inline-decode flash attention. 3. K-only fused inline-decode for the `*k` presets. ## Notable non-TQ change `ml/backend/ggml/ggml/src/ggml-backend.cpp` bumps `GGML_SCHED_MAX_SPLIT_INPUTS` from **30 to 128**. Large MoE graphs combined with TQ K+V encode push split input counts above the old 30-input ceiling, causing `GGML_ASSERT` at graph build time. Cost is CPU heap only (OS lazy-faults on write); typical non-MoE graphs never populate more than ~30 inputs per split. Behaviour for non-TQ users is unchanged. ## Test plan - [x] `go test ./turboquant/... ./kvcache/... ./ml/backend/ggml/... ./runner/ollamarunner/...` - [x] CUDA validated on Tesla P40 (Pascal) and RTX 5060 (Blackwell) - [x] ROCm validated on RX 7600 (RDNA3, gfx1102) - [x] Metal validated on Apple Silicon (llama 3.2 / gemma 3) - [x] Coherence + perplexity sweep across llama 3.x, qwen 2.5, qwen 3, gemma 3/4 under all six presets and the two ablation env vars - [x] Patches gate (`make -f Makefile.sync clean checkout apply-patches sync` followed by `git diff --exit-code`) clean - [x] Non-TQ cache paths (`f16`, `q8_0`, `q4_0`) unchanged ## References - Resolves #15051 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 10:08:18 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#77473