[PR #15505] kvcache: add TurboQuant compressed KV cache (tq2/tq3/tq2k/tq3k) #25717

Open
opened 2026-04-19 18:22:26 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15505
Author: @mverrilli
Created: 4/11/2026
Status: 🔄 Open

Base: mainHead: turboquant


📝 Commits (5)

  • 1f33d3e turboquant: add TurboQuant primitives and block encoder
  • 8430f13 ml/backend/ggml: add CUDA kernels and GGML ops for TurboQuant KV compression
  • b308f09 ml: add Go bindings for the TurboQuant compressed KV manager
  • aab6fcb kvcache: add TurboQuant compressed KV cache with tq2/tq3/tq2k/tq3k presets
  • 35f6b4e Revert GGML_CUDA_FA_ALL_QUANTS to OFF

📊 Changes

51 files changed (+8336 additions, -69 deletions)

View changed files

📝 CMakeLists.txt (+1 -0)
📝 docs/faq.mdx (+10 -0)
📝 fs/ggml/ggml.go (+27 -1)
📝 kvcache/cache.go (+9 -0)
📝 kvcache/causal.go (+104 -52)
kvcache/turboquant.go (+485 -0)
kvcache/turboquant_test.go (+101 -0)
📝 llm/server.go (+11 -7)
📝 ml/backend.go (+73 -0)
📝 ml/backend/ggml/ggml.go (+368 -1)
📝 ml/backend/ggml/ggml/include/ggml.h (+155 -0)
📝 ml/backend/ggml/ggml/src/ggml-backend.cpp (+6 -1)
📝 ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.c (+23 -0)
📝 ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.cpp (+9 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu (+28 -0)
ml/backend/ggml/ggml/src/ggml-cuda/tq-dequant.cu (+392 -0)
ml/backend/ggml/ggml/src/ggml-cuda/tq-dequant.cuh (+6 -0)
ml/backend/ggml/ggml/src/ggml-cuda/tq-encode-v.cu (+155 -0)
ml/backend/ggml/ggml/src/ggml-cuda/tq-encode-v.cuh (+5 -0)
ml/backend/ggml/ggml/src/ggml-cuda/tq-encode.cu (+532 -0)

...and 31 more files

📄 Description

Summary

Adds a GPU-resident compressed KV cache based on the TurboQuant paper
(arXiv 2504.19874), implementing
Algorithm 1 (Householder QR rotation + Lloyd-Max scalar quantization)
as four new OLLAMA_KV_CACHE_TYPE options.

Name K bits V bits Approx VRAM vs f16 Notes
tq3k 3 f16 ~60% quality-preserving K-only
tq3 3 3 ~20% balanced tier
tq2k 2 f16 ~57% smallest K footprint with f16 V
tq2 2 2 ~14% most aggressive

The compressed K (and optionally V) tensors live in GPU memory with
their own Lloyd-Max codebook, rotation matrix, and per-cell RMS scales.
Encode, dequant, and an optional inline-decode fused flash attention
path are implemented as new GGML ops (GGML_OP_TQ_ENCODE,
GGML_OP_TQ_DEQUANT, GGML_OP_TQ_FLASH_ATTN_EXT, plus variants for V
and combined K+V). Kernels require compute capability 6.0+ (Pascal or
newer) for warp shuffle; the CPU backend rejects the new ops so the
scheduler never routes them off-GPU.

How it's organised

Four commits, each a self-contained layer:

  1. turboquant: add TurboQuant primitives and block encoder
    pure Go package with the Lloyd-Max codebook, Householder QR
    rotation, Lloyd-Max boundaries, and the block / per-head encoders.
    Algorithmic property tests (round-trip quality, paper distortion
    bounds, unbiasedness, heavy-tailed input resilience) are all
    covered here with plain go test ./turboquant/.... Also includes
    an optional paper §4.3 outlier-split reference encoder that is
    bit-exact with the GPU kernel, used for CPU↔GPU equivalence tests.

  2. ml/backend/ggml: add CUDA kernels and GGML ops for TurboQuant KV compression
    six new GGML ops and the CUDA kernels that back them. Includes the
    uniform and optional outlier-split variants, a fused inline-decode
    flash attention fallback, CPU backend rejection for all TQ ops, and
    dispatch wiring. Also bumps GGML_SCHED_MAX_SPLIT_INPUTS from 30
    to 128 (see Notable non-TQ change below).

  3. ml: add Go bindings for the TurboQuant compressed KV manager
    the Go side of the GGML ops: a per-cache TQCompressedKManager
    interface, the concrete ggml-backed implementation, per-call
    rotation flags consumed by SDPA, and GPU capability detection with
    graceful fallback on pre-Pascal hardware.

  4. kvcache: add TurboQuant compressed KV cache with tq2/tq3/tq2k/tq3k presets
    the user-facing wrapper. Extends kvcache.Causal with SkipK /
    SkipV / DTypeK / DTypeV so the TurboQuant wrapper can
    suppress the inner cache's f16 allocations. Adds a
    CausalConfigurable interface so gemma3 / gemma4 multimodal
    masking keeps working on TQ-wrapped global sub-caches. Wires the
    dispatch in runner/ollamarunner/cache.go, updates
    llm/server.go to accept the new cache types with and without
    flash attention, teaches fs/ggml to report the right graph size,
    and documents the new cache types in docs/faq.mdx.

Benchmark results

Measured on GTX 950 + Tesla P40 (Pascal, 23GB). Non-TQ users see no
behavioural or performance changes — all TQ code is gated on
OLLAMA_KV_CACHE_TYPE=tq*.

Note on hardware scaling: these numbers are from a Pascal GPU.
On Ampere and newer — tensor cores for the rotation mul_mat, higher
HBM bandwidth for the dequant pass, lower kernel launch latency —
the decode regression vs f16 should shrink meaningfully, with a
rough projection putting tq3k at roughly half the measured Pascal
delta. Further kernel work (fused inline dequant on tensor cores,
tighter graph fusion for the encode/dequant ops) could close more
of the gap. Benchmarks on newer hardware and those optimisations
are deliberately out of scope for this PR.

llama3.2:3b @ ctx=2048

Mode Prefill tok/s Decode tok/s PPL KV MB
f16 1790 74.6 1.0157 224
tq3k 1415 (-21%) 65.1 (-13%) 1.0094 (-0.62%) 135 (-40%)
tq3 1132 (-37%) 57.9 (-22%) 1.0130 (-0.27%) 46 (-79%)
tq2k 1417 (-21%) 65.6 (-12%) 1.0189 (+0.31%) 128 (-43%)
tq2 1140 (-36%) 58.5 (-22%) 1.0355 (+1.95%) 32 (-86%)

qwen3-coder:30b @ ctx=32768

Mode Decode tok/s PPL KV MB
f16 40.5 1.0000 3072
tq3k 25.3 (-37%) 1.0010 (within noise) 2156 (-30%)
tq3 17.6 (-57%) 1.0015 (within noise) 728 (-76%)

tq3k gives near-f16 PPL on qwen3-coder:30b with 30% VRAM savings
and fits a 32k context where f16 would need more headroom.

gemma3:1b (WrapperCache global + SWA)

All 9 cells tested (2048/8192/32768 × f16/tq3/tq3k). PPL within +0.5%
of f16 across the board; decode within 5-10% of f16. The WrapperCache
recursion correctly wraps only the global sub-cache and leaves the
sliding-window sub-cache alone.

Known limitations

  1. Qwen 2 family (Qwen2.5, Qwen2-VL, etc.) — learned K bias
    produces a bias-dominated asymmetric K distribution that none of
    the symmetric quantizers in this PR handle well. PPL degrades
    sharply (6-13 vs f16 baseline ~1.01). Documented in docs/faq.mdx
    as a known weak spot. A follow-up PR adding per-vector asymmetric
    quantization (based on the NVIDIA TensorRT-LLM hint in
    #4218) is
    planned to address this; it can plug into the same encode/dequant
    infrastructure this PR lands.

  2. Paper §4.3 outlier split and Algorithm 2 QJL residual are
    implemented but disabled in the shipped presets (OutlierCount=0)
    because on the validated model set they regress decode throughput
    and PPL without improving heavy-tailed quality the paper targets.
    The infrastructure is kept in the tree — future dynamic dispatch
    can enable either conditionally per model.

  3. Fused inline-decode flash attention kernel is only instantiated
    at headDim=128 with k_bits ∈ {2, 3}. Wider head dims or other
    bit widths fall through to the separate dequant + stock FA path,
    which is faster in practice anyway.

Notable non-TQ change

ml/backend/ggml/ggml/src/ggml-backend.cpp bumps
GGML_SCHED_MAX_SPLIT_INPUTS from 30 to 128. This is the only
global change in the PR that is not gated on a TQ cache type, so it
affects all backends and all users. Rationale:

  • On large MoE models like qwen3-coder:30b (48 layers with expert
    routing) the combined K+V encode op plus the multi-expert gather
    paths produce split input counts above the old 30-input ceiling,
    causing the scheduler to hit GGML_ASSERT during graph build.
  • The old limit was artificially conservative. Upstream GGML has a
    FIXME in the same area noting that the check only fires when the
    split is exactly full, so multi-input ops can already overshoot it.
  • Cost: each ggml_backend_sched_split struct grows by ~784 bytes
    (two inputs[128] arrays replacing inputs[30]). For a typical
    graph this is a few KB total, negligible.
  • Behavioural impact on non-TQ users: none. A graph that fit under 30
    inputs before still fits. The change only unblocks previously-failing
    configurations.

Flagged explicitly so reviewers can see the blast radius isn't
limited to the TQ code paths.

Test plan

  • go test ./turboquant/... ./kvcache/... ./ml/backend/ggml/... ./runner/ollamarunner/... passes
  • Smoke test generation on llama3.2:3b with f16, tq3k, tq3, tq2k, tq2 — all produce coherent output
  • Smoke test generation on gemma3:1b with tq3k — WrapperCache path, only global sub-cache wrapped
  • Full benchmark matrix (llama3.2:3b / gemma3:1b / qwen2.5:7b / qwen3-coder:30b × ctx 2048 / 8192 / 32768 × f16 / tq3 / tq3k) in results_option_a/ and results_outlier_final/
  • Verified non-TQ cache paths (f16, q8_0, q4_0) take the same code paths as before (see commit-level review in the PR description above: SkipK/SkipV gates, CausalConfigurable type widening, PresetFromDType returning false for non-tq types)
  • Verified the fused FA kernel dispatch paths compile and are gated off for head dims ≠ 128

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15505 **Author:** [@mverrilli](https://github.com/mverrilli) **Created:** 4/11/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `turboquant` --- ### 📝 Commits (5) - [`1f33d3e`](https://github.com/ollama/ollama/commit/1f33d3e5e66d568dc046482d085c0d71240ed092) turboquant: add TurboQuant primitives and block encoder - [`8430f13`](https://github.com/ollama/ollama/commit/8430f134aaf536199c1abeb2fa486249066f5cf1) ml/backend/ggml: add CUDA kernels and GGML ops for TurboQuant KV compression - [`b308f09`](https://github.com/ollama/ollama/commit/b308f090183452264bac2a450812a051cdbaeeda) ml: add Go bindings for the TurboQuant compressed KV manager - [`aab6fcb`](https://github.com/ollama/ollama/commit/aab6fcb08a12da62d9b630a04cac49c629c9432d) kvcache: add TurboQuant compressed KV cache with tq2/tq3/tq2k/tq3k presets - [`35f6b4e`](https://github.com/ollama/ollama/commit/35f6b4effdfed495965cf19dc4746f11bd2783ad) Revert GGML_CUDA_FA_ALL_QUANTS to OFF ### 📊 Changes **51 files changed** (+8336 additions, -69 deletions) <details> <summary>View changed files</summary> 📝 `CMakeLists.txt` (+1 -0) 📝 `docs/faq.mdx` (+10 -0) 📝 `fs/ggml/ggml.go` (+27 -1) 📝 `kvcache/cache.go` (+9 -0) 📝 `kvcache/causal.go` (+104 -52) ➕ `kvcache/turboquant.go` (+485 -0) ➕ `kvcache/turboquant_test.go` (+101 -0) 📝 `llm/server.go` (+11 -7) 📝 `ml/backend.go` (+73 -0) 📝 `ml/backend/ggml/ggml.go` (+368 -1) 📝 `ml/backend/ggml/ggml/include/ggml.h` (+155 -0) 📝 `ml/backend/ggml/ggml/src/ggml-backend.cpp` (+6 -1) 📝 `ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.c` (+23 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cpu/ggml-cpu.cpp` (+9 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu` (+28 -0) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/tq-dequant.cu` (+392 -0) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/tq-dequant.cuh` (+6 -0) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/tq-encode-v.cu` (+155 -0) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/tq-encode-v.cuh` (+5 -0) ➕ `ml/backend/ggml/ggml/src/ggml-cuda/tq-encode.cu` (+532 -0) _...and 31 more files_ </details> ### 📄 Description ## Summary Adds a GPU-resident compressed KV cache based on the TurboQuant paper ([arXiv 2504.19874](https://arxiv.org/abs/2504.19874)), implementing Algorithm 1 (Householder QR rotation + Lloyd-Max scalar quantization) as four new `OLLAMA_KV_CACHE_TYPE` options. | Name | K bits | V bits | Approx VRAM vs f16 | Notes | |-------|--------|--------|---------------------|-------| | `tq3k` | 3 | f16 | ~60% | quality-preserving K-only | | `tq3` | 3 | 3 | ~20% | balanced tier | | `tq2k` | 2 | f16 | ~57% | smallest K footprint with f16 V | | `tq2` | 2 | 2 | ~14% | most aggressive | The compressed K (and optionally V) tensors live in GPU memory with their own Lloyd-Max codebook, rotation matrix, and per-cell RMS scales. Encode, dequant, and an optional inline-decode fused flash attention path are implemented as new GGML ops (`GGML_OP_TQ_ENCODE`, `GGML_OP_TQ_DEQUANT`, `GGML_OP_TQ_FLASH_ATTN_EXT`, plus variants for V and combined K+V). Kernels require compute capability 6.0+ (Pascal or newer) for warp shuffle; the CPU backend rejects the new ops so the scheduler never routes them off-GPU. ## How it's organised Four commits, each a self-contained layer: 1. **`turboquant: add TurboQuant primitives and block encoder`** — pure Go package with the Lloyd-Max codebook, Householder QR rotation, Lloyd-Max boundaries, and the block / per-head encoders. Algorithmic property tests (round-trip quality, paper distortion bounds, unbiasedness, heavy-tailed input resilience) are all covered here with plain `go test ./turboquant/...`. Also includes an optional paper §4.3 outlier-split reference encoder that is bit-exact with the GPU kernel, used for CPU↔GPU equivalence tests. 2. **`ml/backend/ggml: add CUDA kernels and GGML ops for TurboQuant KV compression`** — six new GGML ops and the CUDA kernels that back them. Includes the uniform and optional outlier-split variants, a fused inline-decode flash attention fallback, CPU backend rejection for all TQ ops, and dispatch wiring. Also bumps `GGML_SCHED_MAX_SPLIT_INPUTS` from 30 to 128 (see **Notable non-TQ change** below). 3. **`ml: add Go bindings for the TurboQuant compressed KV manager`** — the Go side of the GGML ops: a per-cache `TQCompressedKManager` interface, the concrete ggml-backed implementation, per-call rotation flags consumed by SDPA, and GPU capability detection with graceful fallback on pre-Pascal hardware. 4. **`kvcache: add TurboQuant compressed KV cache with tq2/tq3/tq2k/tq3k presets`** — the user-facing wrapper. Extends `kvcache.Causal` with `SkipK` / `SkipV` / `DTypeK` / `DTypeV` so the TurboQuant wrapper can suppress the inner cache's f16 allocations. Adds a `CausalConfigurable` interface so gemma3 / gemma4 multimodal masking keeps working on TQ-wrapped global sub-caches. Wires the dispatch in `runner/ollamarunner/cache.go`, updates `llm/server.go` to accept the new cache types with and without flash attention, teaches `fs/ggml` to report the right graph size, and documents the new cache types in `docs/faq.mdx`. ## Benchmark results Measured on GTX 950 + Tesla P40 (Pascal, 23GB). Non-TQ users see no behavioural or performance changes — all TQ code is gated on `OLLAMA_KV_CACHE_TYPE=tq*`. > **Note on hardware scaling:** these numbers are from a Pascal GPU. > On Ampere and newer — tensor cores for the rotation mul_mat, higher > HBM bandwidth for the dequant pass, lower kernel launch latency — > the decode regression vs `f16` should shrink meaningfully, with a > rough projection putting `tq3k` at roughly half the measured Pascal > delta. Further kernel work (fused inline dequant on tensor cores, > tighter graph fusion for the encode/dequant ops) could close more > of the gap. Benchmarks on newer hardware and those optimisations > are deliberately out of scope for this PR. ### llama3.2:3b @ ctx=2048 | Mode | Prefill tok/s | Decode tok/s | PPL | KV MB | |------|---------------|---------------|-----|-------| | f16 | 1790 | 74.6 | 1.0157 | 224 | | tq3k | 1415 (-21%) | 65.1 (-13%) | 1.0094 (**-0.62%**) | 135 (-40%) | | tq3 | 1132 (-37%) | 57.9 (-22%) | 1.0130 (-0.27%) | 46 (-79%) | | tq2k | 1417 (-21%) | 65.6 (-12%) | 1.0189 (+0.31%) | 128 (-43%) | | tq2 | 1140 (-36%) | 58.5 (-22%) | 1.0355 (+1.95%) | 32 (-86%) | ### qwen3-coder:30b @ ctx=32768 | Mode | Decode tok/s | PPL | KV MB | |------|--------------|-----|-------| | f16 | 40.5 | 1.0000 | 3072 | | tq3k | 25.3 (-37%) | **1.0010** (within noise) | 2156 (-30%) | | tq3 | 17.6 (-57%) | **1.0015** (within noise) | 728 (-76%) | `tq3k` gives near-f16 PPL on qwen3-coder:30b with 30% VRAM savings and fits a 32k context where f16 would need more headroom. ### gemma3:1b (WrapperCache global + SWA) All 9 cells tested (2048/8192/32768 × f16/tq3/tq3k). PPL within +0.5% of f16 across the board; decode within 5-10% of f16. The WrapperCache recursion correctly wraps only the global sub-cache and leaves the sliding-window sub-cache alone. ## Known limitations 1. **Qwen 2 family (Qwen2.5, Qwen2-VL, etc.)** — learned K bias produces a bias-dominated asymmetric K distribution that none of the symmetric quantizers in this PR handle well. PPL degrades sharply (6-13 vs f16 baseline ~1.01). Documented in `docs/faq.mdx` as a known weak spot. A follow-up PR adding per-vector asymmetric quantization (based on the NVIDIA TensorRT-LLM hint in [#4218](https://github.com/NVIDIA/TensorRT-LLM/issues/4218)) is planned to address this; it can plug into the same encode/dequant infrastructure this PR lands. 2. **Paper §4.3 outlier split and Algorithm 2 QJL residual** are implemented but disabled in the shipped presets (`OutlierCount=0`) because on the validated model set they regress decode throughput and PPL without improving heavy-tailed quality the paper targets. The infrastructure is kept in the tree — future dynamic dispatch can enable either conditionally per model. 3. **Fused inline-decode flash attention kernel** is only instantiated at `headDim=128` with `k_bits ∈ {2, 3}`. Wider head dims or other bit widths fall through to the separate dequant + stock FA path, which is faster in practice anyway. ## Notable non-TQ change `ml/backend/ggml/ggml/src/ggml-backend.cpp` bumps `GGML_SCHED_MAX_SPLIT_INPUTS` from **30 to 128**. This is the only global change in the PR that is not gated on a TQ cache type, so it affects all backends and all users. Rationale: - On large MoE models like qwen3-coder:30b (48 layers with expert routing) the combined K+V encode op plus the multi-expert gather paths produce split input counts above the old 30-input ceiling, causing the scheduler to hit `GGML_ASSERT` during graph build. - The old limit was artificially conservative. Upstream GGML has a FIXME in the same area noting that the check only fires when the split is exactly full, so multi-input ops can already overshoot it. - Cost: each `ggml_backend_sched_split` struct grows by ~784 bytes (two `inputs[128]` arrays replacing `inputs[30]`). For a typical graph this is a few KB total, negligible. - Behavioural impact on non-TQ users: none. A graph that fit under 30 inputs before still fits. The change only unblocks previously-failing configurations. Flagged explicitly so reviewers can see the blast radius isn't limited to the TQ code paths. ## Test plan - [x] `go test ./turboquant/... ./kvcache/... ./ml/backend/ggml/... ./runner/ollamarunner/...` passes - [x] Smoke test generation on `llama3.2:3b` with `f16`, `tq3k`, `tq3`, `tq2k`, `tq2` — all produce coherent output - [x] Smoke test generation on `gemma3:1b` with `tq3k` — WrapperCache path, only global sub-cache wrapped - [x] Full benchmark matrix (llama3.2:3b / gemma3:1b / qwen2.5:7b / qwen3-coder:30b × ctx 2048 / 8192 / 32768 × f16 / tq3 / tq3k) in `results_option_a/` and `results_outlier_final/` - [x] Verified non-TQ cache paths (`f16`, `q8_0`, `q4_0`) take the same code paths as before (see commit-level review in the PR description above: SkipK/SkipV gates, CausalConfigurable type widening, `PresetFromDType` returning false for non-tq types) - [x] Verified the fused FA kernel dispatch paths compile and are gated off for head dims ≠ 128 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 18:22:26 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#25717