[GH-ISSUE #15051] native ollama-go-engine: TurboQuant+RotorQuant implementation #71720

Open
opened 2026-05-05 02:24:17 -05:00 by GiteaMirror · 57 comments
Owner

Originally created by @grepin on GitHub (Mar 25, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15051

@rick-github @jessegross jfyi https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/ + https://arxiv.org/pdf/2504.19874

Originally created by @grepin on GitHub (Mar 25, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15051 @rick-github @jessegross jfyi https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/ + https://arxiv.org/pdf/2504.19874
GiteaMirror added the feature request label 2026-05-05 02:24:17 -05:00
Author
Owner

@grepin commented on GitHub (Mar 25, 2026):

meta-algorithm description provided in paper; seems not too hard to implement it. implementation could significantly reduce kv-cache size while keeping quality + introduce compute speedup in most cases.

<!-- gh-comment-id:4124208100 --> @grepin commented on GitHub (Mar 25, 2026): meta-algorithm description provided in paper; seems not too hard to implement it. implementation could significantly reduce kv-cache size while keeping quality + introduce compute speedup in most cases.
Author
Owner

@goedzo commented on GitHub (Mar 25, 2026):

Upvote👍

<!-- gh-comment-id:4128956798 --> @goedzo commented on GitHub (Mar 25, 2026): Upvote👍
Author
Owner

@OrBeProgrammed commented on GitHub (Mar 25, 2026):

Critical need.

<!-- gh-comment-id:4130447646 --> @OrBeProgrammed commented on GitHub (Mar 25, 2026): Critical need.
Author
Owner

@postEntropy commented on GitHub (Mar 25, 2026):

Need it ASAP

<!-- gh-comment-id:4130449898 --> @postEntropy commented on GitHub (Mar 25, 2026): Need it ASAP
Author
Owner

@grepin commented on GitHub (Mar 26, 2026):

@OrBeProgrammed @postEntropy @goedzo guys, don't push devs, let them to decide "when and how to implement (or not)". yes, TQ is a cool thing (and from my pov it will first of all will help their business with ollama:cloud), but as in any project there are limited resources and plans. feature request made, so "let the things go as they go"

<!-- gh-comment-id:4131259163 --> @grepin commented on GitHub (Mar 26, 2026): @OrBeProgrammed @postEntropy @goedzo guys, don't push devs, let them to decide "when and how to implement (or not)". yes, TQ is a cool thing (and from my pov it will first of all will help their business with ollama:cloud), but as in any project there are limited resources and plans. feature request made, so "let the things go as they go"
Author
Owner

@OrBeProgrammed commented on GitHub (Mar 26, 2026):

I'm not pushing anyone I'm working on a PR myself :) We're all in this together!

<!-- gh-comment-id:4131342701 --> @OrBeProgrammed commented on GitHub (Mar 26, 2026): I'm not pushing anyone I'm working on a PR myself :) We're all in this together!
Author
Owner

@grepin commented on GitHub (Mar 26, 2026):

btw, @OrBeProgrammed: https://github.com/ggml-org/llama.cpp/issues/20977, enthusiasts already try to implement: https://github.com/ggml-org/llama.cpp/compare/master...mudler:llama.cpp:feat/turbo-quant
https://github.com/TheTom/turboquant_plus (the code seems useful in terms of "understanding of technical implementation & transferring/porting to ollama go-based codebase")

<!-- gh-comment-id:4131642990 --> @grepin commented on GitHub (Mar 26, 2026): btw, @OrBeProgrammed: https://github.com/ggml-org/llama.cpp/issues/20977, enthusiasts already try to implement: https://github.com/ggml-org/llama.cpp/compare/master...mudler:llama.cpp:feat/turbo-quant https://github.com/TheTom/turboquant_plus (the code seems useful in terms of "understanding of technical implementation & transferring/porting to ollama go-based codebase")
Author
Owner

@grepin commented on GitHub (Mar 26, 2026):

<!-- gh-comment-id:4131663721 --> @grepin commented on GitHub (Mar 26, 2026): + vllm (FR for now only): https://github.com/vllm-project/vllm/issues/38171 most likely all the frontier inference engines will implement TQ in a couple of month.
Author
Owner

@grepin commented on GitHub (Mar 26, 2026):

yep, this looks good as reference python implementation with "post-generation kv-cache analisys & attention quality tests": https://github.com/TheTom/turboquant_plus

<!-- gh-comment-id:4131867060 --> @grepin commented on GitHub (Mar 26, 2026): yep, this looks good as reference python implementation with "post-generation kv-cache analisys & attention quality tests": https://github.com/TheTom/turboquant_plus
Author
Owner

@123Haynes commented on GitHub (Mar 26, 2026):

the discussion here also documents some pitfalls during the implementation: https://github.com/ggml-org/llama.cpp/discussions/20969

<!-- gh-comment-id:4132433040 --> @123Haynes commented on GitHub (Mar 26, 2026): the discussion here also documents some pitfalls during the implementation: https://github.com/ggml-org/llama.cpp/discussions/20969
Author
Owner

@grepin commented on GitHub (Mar 26, 2026):

+1 to implementations (PoC with measurement results): https://github.com/vllm-project/vllm/issues/38171#issuecomment-4134002937

<!-- gh-comment-id:4135605714 --> @grepin commented on GitHub (Mar 26, 2026): +1 to implementations (PoC with measurement results): https://github.com/vllm-project/vllm/issues/38171#issuecomment-4134002937
Author
Owner

@kblood commented on GitHub (Mar 26, 2026):

Oh yes, hoped to see Ollama support for this the first time I read about it :)

<!-- gh-comment-id:4135975916 --> @kblood commented on GitHub (Mar 26, 2026): Oh yes, hoped to see Ollama support for this the first time I read about it :)
Author
Owner

@codyseally commented on GitHub (Mar 26, 2026):

Absolute upvote !

<!-- gh-comment-id:4136047948 --> @codyseally commented on GitHub (Mar 26, 2026): Absolute upvote !
Author
Owner

@XZzYassin commented on GitHub (Mar 26, 2026):

Oleh! 🌹

<!-- gh-comment-id:4137581451 --> @XZzYassin commented on GitHub (Mar 26, 2026): Oleh! 🌹
Author
Owner

@grepin commented on GitHub (Mar 27, 2026):

<!-- gh-comment-id:4140960567 --> @grepin commented on GitHub (Mar 27, 2026): + https://github.com/scrya-com/rotorquant looks very interesting and more effective. "i love this world full of genius engineers".
Author
Owner

@mobilexmt commented on GitHub (Mar 28, 2026):

please also consider DGX Spark, thanks!

<!-- gh-comment-id:4146822258 --> @mobilexmt commented on GitHub (Mar 28, 2026): please also consider DGX Spark, thanks!
Author
Owner

@richardokonicha commented on GitHub (Mar 28, 2026):

Hurray

<!-- gh-comment-id:4147882440 --> @richardokonicha commented on GitHub (Mar 28, 2026): Hurray
Author
Owner

@christopheduc-me commented on GitHub (Mar 29, 2026):

100% Upvoted of course ! We need it to unlock new possibilities for our local usage

<!-- gh-comment-id:4150469509 --> @christopheduc-me commented on GitHub (Mar 29, 2026): 100% Upvoted of course ! We need it to unlock new possibilities for our local usage
Author
Owner

@dorinsimionescu commented on GitHub (Mar 30, 2026):

upvote too

<!-- gh-comment-id:4152440714 --> @dorinsimionescu commented on GitHub (Mar 30, 2026): upvote too
Author
Owner

@QAM commented on GitHub (Mar 30, 2026):

+1 plz

<!-- gh-comment-id:4152932067 --> @QAM commented on GitHub (Mar 30, 2026): +1 plz
Author
Owner

@grepin commented on GitHub (Mar 30, 2026):

in fact, wip: https://github.com/ollama/ollama/pull/15125
if you have enough understanding of engine + paper/algo (or ready to dive into it by yourself or with any AI+any-agent + "non-blind acceptance of everything AI tries to make during implementation"), you can help. As always in opensource, "you are on your own" and "code & tests are the only source of truth", but i hope that Dankguy17 and YKesX could guide your efforts (for example, as usual, much more test-runs on different models needed to find problems & improve implementation). Many thanks to @Dankguy17 and @YKesX anyway for their contribution.

<!-- gh-comment-id:4153118784 --> @grepin commented on GitHub (Mar 30, 2026): in fact, wip: https://github.com/ollama/ollama/pull/15125 if you have enough understanding of engine + paper/algo (or ready to dive into it by yourself or with any AI+any-agent + "non-blind acceptance of everything AI tries to make during implementation"), you can help. As always in opensource, "you are on your own" and "code & tests are the only source of truth", but i hope that [Dankguy17](https://github.com/Dankguy17) and [YKesX](https://github.com/YKesX) could guide your efforts (for example, as usual, much more test-runs on different models needed to find problems & improve implementation). Many thanks to @Dankguy17 and @YKesX anyway for their contribution.
Author
Owner

@Sreekmans commented on GitHub (Mar 30, 2026):

+1

<!-- gh-comment-id:4155914332 --> @Sreekmans commented on GitHub (Mar 30, 2026): +1
Author
Owner

@Reikagilu commented on GitHub (Mar 30, 2026):

+1

<!-- gh-comment-id:4157269124 --> @Reikagilu commented on GitHub (Mar 30, 2026): +1
Author
Owner

@medenijazbec commented on GitHub (Mar 31, 2026):

I took a pass at implementing this on top of Ollama v0.18.3 and published the work here:

https://github.com/medenijazbec/ollama-turboquant/tree/turboquant-0.18.3

Current status

  • added TurboQuant productization surface for Ollama
  • request/API override path for kv_cache_type
  • CLI support for ollama run --turboquant
  • bench support for ollama-bench -turboquant
  • model default support via PARAMETER kv_cache_type ...
  • added custom Dockerfiles because I’m compiling against a CUDA 11.8-compatible setup for older GPU compatibility reasons

Benchmarking

I’ve also been running baseline vs TurboQuant experiments focused on KV pressure and long-context behavior.

I included multiple benchmark variants as well, including lighter / extra-light passes, because the available hardware is limited and some of the longer sweeps are expensive to run reliably.

Documentation

I also wrote up the implementation and benchmark notes in the repo here:

  • docs/turboquant_paper_design.md
  • docs/turboquant_audit.md
  • docs/kvstress-test-commands.txt

Important caveats

  • this is a fork / WIP branch, not something I’d call upstream-ready yet
  • I’m still working through benchmarking, validation, and upstream-merge hygiene
  • I also want to tighten the implementation against the paper details to make sure the algorithm path is correct
  • VRAM numbers are missing from the currently published benchmark results because I realized too late that the benchmark container did not have proper CUDA/NVML visibility
  • rerunning the full matrix is a bit painful on Tesla M40s, especially since they are older, passively cooled cards

So I would treat the current branch as a reference implementation / experimentation branch rather than a finished upstream proposal.

Sharing it in case it helps others compare approaches, reuse some of the API / CLI / benchmark surface work, or validate against their own hardware. If others run similar tests and publish results, that would be very useful too.

Happy to compare notes with anyone working on the engine-side implementation or benchmark methodology.

<!-- gh-comment-id:4159597644 --> @medenijazbec commented on GitHub (Mar 31, 2026): I took a pass at implementing this on top of **Ollama `v0.18.3`** and published the work here: `https://github.com/medenijazbec/ollama-turboquant/tree/turboquant-0.18.3` ### Current status - added TurboQuant productization surface for Ollama - request/API override path for `kv_cache_type` - CLI support for `ollama run --turboquant` - bench support for `ollama-bench -turboquant` - model default support via `PARAMETER kv_cache_type ...` - added custom Dockerfiles because I’m compiling against a **CUDA 11.8-compatible** setup for older GPU compatibility reasons ### Benchmarking I’ve also been running baseline vs TurboQuant experiments focused on **KV pressure** and **long-context behavior**. I included multiple benchmark variants as well, including lighter / extra-light passes, because the available hardware is limited and some of the longer sweeps are expensive to run reliably. ### Documentation I also wrote up the implementation and benchmark notes in the repo here: - `docs/turboquant_paper_design.md` - `docs/turboquant_audit.md` - `docs/kvstress-test-commands.txt` ### Important caveats - this is a **fork / WIP branch**, not something I’d call upstream-ready yet - I’m still working through benchmarking, validation, and upstream-merge hygiene - I also want to tighten the implementation against the paper details to make sure the algorithm path is correct - VRAM numbers are missing from the currently published benchmark results because I realized too late that the benchmark container did not have proper CUDA/NVML visibility - rerunning the full matrix is a bit painful on **Tesla M40s**, especially since they are older, passively cooled cards So I would treat the current branch as a **reference implementation / experimentation branch** rather than a finished upstream proposal. Sharing it in case it helps others compare approaches, reuse some of the API / CLI / benchmark surface work, or validate against their own hardware. If others run similar tests and publish results, that would be very useful too. Happy to compare notes with anyone working on the engine-side implementation or benchmark methodology.
Author
Owner

@jclab-joseph commented on GitHub (Mar 31, 2026):

https://github.com/ggml-org/llama.cpp/discussions/20969 There seems to be lively discussion taking place here! I think it would be good to refer to.

<!-- gh-comment-id:4159630726 --> @jclab-joseph commented on GitHub (Mar 31, 2026): https://github.com/ggml-org/llama.cpp/discussions/20969 There seems to be lively discussion taking place here! I think it would be good to refer to.
Author
Owner

@msk-one commented on GitHub (Apr 1, 2026):

+1

<!-- gh-comment-id:4171740570 --> @msk-one commented on GitHub (Apr 1, 2026): +1
Author
Owner

@Blue-Crescent commented on GitHub (Apr 3, 2026):

vote

<!-- gh-comment-id:4183212739 --> @Blue-Crescent commented on GitHub (Apr 3, 2026): vote
Author
Owner

@lxdlam commented on GitHub (Apr 4, 2026):

vote for this

<!-- gh-comment-id:4187517120 --> @lxdlam commented on GitHub (Apr 4, 2026): vote for this
Author
Owner

@Readon commented on GitHub (Apr 7, 2026):

I think ggml-org/llama.cpp#21038 has already implement this.

<!-- gh-comment-id:4195975545 --> @Readon commented on GitHub (Apr 7, 2026): I think ggml-org/llama.cpp#21038 has already implement this.
Author
Owner

@ZevAlain commented on GitHub (Apr 11, 2026):

vote

<!-- gh-comment-id:4227615947 --> @ZevAlain commented on GitHub (Apr 11, 2026): vote
Author
Owner

@hajhouj commented on GitHub (Apr 11, 2026):

If Ollama actually pulls this off, it’s going to hit hard. It’s wild to think that a setup with just 8GB of VRAM could handle a model originally designed for something like a 48GB GPU. That kind of leap really pushes us closer to a world where fully autonomous, self-hosted AI isn’t just hype… it’s right around the corner.

<!-- gh-comment-id:4230042550 --> @hajhouj commented on GitHub (Apr 11, 2026): If Ollama actually pulls this off, it’s going to hit hard. It’s wild to think that a setup with just 8GB of VRAM could handle a model originally designed for something like a 48GB GPU. That kind of leap really pushes us closer to a world where fully autonomous, self-hosted AI isn’t just hype… it’s right around the corner.
Author
Owner

@mverrilli commented on GitHub (Apr 11, 2026):

I put together this PR if anyone wants to review and build. #15505

● Adds tq2/tq3/tq2k/tq3k KV cache types implementing TurboQuant (arXiv 2504.19874) — a GPU-resident compressed K/V path built from Householder QR rotation plus Lloyd-Max scalar quantization, with new CUDA kernels for encode, dequant, and fused flash attention.

Roughly doubles usable context per VRAM dollar on Pascal+ GPUs at near-f16 quality: tq3k matches f16 PPL on llama3.2/gemma3/qwen3-coder with ~40% KV savings, tq3 gives ~80% KV savings for a ~0.5% PPL cost, and the K-only variants (tq3k/tq2k) work even with flash attention disabled.

<!-- gh-comment-id:4230331253 --> @mverrilli commented on GitHub (Apr 11, 2026): I put together this PR if anyone wants to review and build. #15505 ● Adds tq2/tq3/tq2k/tq3k KV cache types implementing TurboQuant (arXiv 2504.19874) — a GPU-resident compressed K/V path built from Householder QR rotation plus Lloyd-Max scalar quantization, with new CUDA kernels for encode, dequant, and fused flash attention. Roughly doubles usable context per VRAM dollar on Pascal+ GPUs at near-f16 quality: tq3k matches f16 PPL on llama3.2/gemma3/qwen3-coder with ~40% KV savings, tq3 gives ~80% KV savings for a ~0.5% PPL cost, and the K-only variants (tq3k/tq2k) work even with flash attention disabled.
Author
Owner

@mverrilli commented on GitHub (Apr 11, 2026):

@hajhouj On any GPU, TurboQuant tq3k gets you ~40% KV cache savings with PPL essentially unchanged from f16 and tq3 gets you ~80% KV savings at a small cost. That doesn’t change what models fit on your card, but it roughly doubles the usable context length for whatever model you’re already running.

<!-- gh-comment-id:4230365889 --> @mverrilli commented on GitHub (Apr 11, 2026): @hajhouj On any GPU, TurboQuant tq3k gets you ~40% KV cache savings with PPL essentially unchanged from f16 and tq3 gets you ~80% KV savings at a small cost. That doesn’t change what models fit on your card, but it roughly doubles the usable context length for whatever model you’re already running.
Author
Owner

@hajhouj commented on GitHub (Apr 11, 2026):

@hajhouj On any GPU, TurboQuant tq3k gets you ~40% KV cache savings with PPL essentially unchanged from f16 and tq3 gets you ~80% KV savings at a small cost. That doesn’t change what models fit on your card, but it roughly doubles the usable context length for whatever model you’re already running.

Thanks for the info. I came across a news article about this new algorithm claiming it reduces memory usage by a factor of six, which might be a bit exaggerated. Still, the ability to increase context length without sacrificing speed is genuinely promising. It feels like another big step toward making low-cost, self-hosted AI more practical, potentially opening the door for wider desktop-level use.

<!-- gh-comment-id:4230400741 --> @hajhouj commented on GitHub (Apr 11, 2026): > @hajhouj On any GPU, TurboQuant tq3k gets you ~40% KV cache savings with PPL essentially unchanged from f16 and tq3 gets you ~80% KV savings at a small cost. That doesn’t change what models fit on your card, but it roughly doubles the usable context length for whatever model you’re already running. Thanks for the info. I came across a news article about this new algorithm claiming it reduces memory usage by a factor of six, which might be a bit exaggerated. Still, the ability to increase context length without sacrificing speed is genuinely promising. It feels like another big step toward making low-cost, self-hosted AI more practical, potentially opening the door for wider desktop-level use.
Author
Owner

@mverrilli commented on GitHub (Apr 12, 2026):

Thanks for the info. I came across a news article about this new algorithm claiming it reduces memory usage by a factor of six, which might be a bit exaggerated. Still, the ability to increase context length without sacrificing speed is genuinely promising. It feels like another big step toward making low-cost, self-hosted AI more practical, potentially opening the door for wider desktop-level use.

All good. This does that, but it's just not total VRAM. It's KV cache. I was able to get a factor of 5 reduction (80%) with tq3. It's a little bit hyped but I think still very good.

<!-- gh-comment-id:4230412238 --> @mverrilli commented on GitHub (Apr 12, 2026): > Thanks for the info. I came across a news article about this new algorithm claiming it reduces memory usage by a factor of six, which might be a bit exaggerated. Still, the ability to increase context length without sacrificing speed is genuinely promising. It feels like another big step toward making low-cost, self-hosted AI more practical, potentially opening the door for wider desktop-level use. All good. This does that, but it's just not total VRAM. It's KV cache. I was able to get a factor of 5 reduction (80%) with tq3. It's a little bit hyped but I think still very good.
Author
Owner

@achraf99999 commented on GitHub (Apr 13, 2026):

did you manage to run turbo quant from this PR ? if yes , can you share your configuration and hardware setup please ?

<!-- gh-comment-id:4235095696 --> @achraf99999 commented on GitHub (Apr 13, 2026): > * vllm (FR for now only): [[Feature]: Add TurboQuant Support for KV Cache Quantization vllm-project/vllm#38171](https://github.com/vllm-project/vllm/issues/38171) > most likely all the frontier inference engines will implement TQ in a couple of month. did you manage to run turbo quant from this PR ? if yes , can you share your configuration and hardware setup please ?
Author
Owner

@OrBeProgrammed commented on GitHub (Apr 13, 2026):

I got this working. I am not sure it is working with all models. It loads the model insanely faster than before. I just told claude to make the PR happen, not sure exactly what all it did. Is there a particular piece of info i can share that would help?

<!-- gh-comment-id:4236857355 --> @OrBeProgrammed commented on GitHub (Apr 13, 2026): I got this working. I am not sure it is working with all models. It loads the model insanely faster than before. I just told claude to make the PR happen, not sure exactly what all it did. Is there a particular piece of info i can share that would help?
Author
Owner

@medenijazbec commented on GitHub (Apr 13, 2026):

I got this working. I am not sure it is working with all models. It loads the model insanely faster than before. I just told claude to make the PR happen, not sure exactly what all it did. Is there a particular piece of info i can share that would help?

please provide some tests ive got mine ready but cant find the time to run them all, there also already seems to be a full implementation in llama, I havent really been following that thread for about 2 weeks, there might be alot of new findings, what Ive done is took inspiration from their ideas and credited them in my code for turboquant since folk over there is way smarter than i am lmao, anyway, I think a good test for this would be to connect it to something like claude code and make it run 3 runs of https://gist.github.com/ivanfioravanti/98ba7e5d3f7a88c1756b045d3e565630 using native ollama, then compare the average to the average results of your turboquant implementation

<!-- gh-comment-id:4238445905 --> @medenijazbec commented on GitHub (Apr 13, 2026): > I got this working. I am not sure it is working with all models. It loads the model insanely faster than before. I just told claude to make the PR happen, not sure exactly what all it did. Is there a particular piece of info i can share that would help? please provide some tests ive got mine ready but cant find the time to run them all, there also already seems to be a full implementation in llama, I havent really been following that thread for about 2 weeks, there might be alot of new findings, what Ive done is took inspiration from their ideas and credited them in my code for turboquant since folk over there is way smarter than i am lmao, anyway, I think a good test for this would be to connect it to something like claude code and make it run 3 runs of https://gist.github.com/ivanfioravanti/98ba7e5d3f7a88c1756b045d3e565630 using native ollama, then compare the average to the average results of your turboquant implementation
Author
Owner

@mverrilli commented on GitHub (Apr 21, 2026):

I put together this PR if anyone wants to review and build. #15505

FYI: Added Metal, and ROCM if anyone wants to test it out and report back.

<!-- gh-comment-id:4292439554 --> @mverrilli commented on GitHub (Apr 21, 2026): > I put together this PR if anyone wants to review and build. [#15505](https://github.com/ollama/ollama/pull/15505) FYI: Added Metal, and ROCM if anyone wants to test it out and report back.
Author
Owner

@Dankguy17 commented on GitHub (Apr 21, 2026):

Yeah I can try right now - although, there is definetely a lower chance that the PR gets accepted because it adds nearly 60k lines of code lol. Did you forget to gitignore something??

<!-- gh-comment-id:4292489049 --> @Dankguy17 commented on GitHub (Apr 21, 2026): Yeah I can try right now - although, there is definetely a lower chance that the PR gets accepted because it adds nearly 60k lines of code lol. Did you forget to gitignore something??
Author
Owner

@mverrilli commented on GitHub (Apr 21, 2026):

@Dankguy17 vendor patch issue. I thought I fixed it but I think it was in another branch. Pushing the branch after my build test.

<!-- gh-comment-id:4292563706 --> @mverrilli commented on GitHub (Apr 21, 2026): @Dankguy17 vendor patch issue. I thought I fixed it but I think it was in another branch. Pushing the branch after my build test.
Author
Owner

@Dankguy17 commented on GitHub (Apr 21, 2026):

cool! will discuss more in your pr

<!-- gh-comment-id:4292566181 --> @Dankguy17 commented on GitHub (Apr 21, 2026): cool! will discuss more in your pr
Author
Owner

@DATEx2 commented on GitHub (Apr 25, 2026):

So when will it be released? We kind of all need TurboQuant

<!-- gh-comment-id:4320687244 --> @DATEx2 commented on GitHub (Apr 25, 2026): So when will it be released? We kind of all need `TurboQuant`
Author
Owner

@mverrilli commented on GitHub (Apr 26, 2026):

TQ is not a small PR. To be useful, it has to compress KV cache, not slow down prefill or decode too much, and stay off the paths that use the scratch buffer which would offset the VRAM savings, and keep it coherent.

The branch I have right now does TQ, but has two issues:

  1. qwen2 family is incoherent. I think because qwen2 has a learnable bias on the K projection. I've tried a few things here, but still working through it.
  2. scratch buffer usage due to slow path. This should be solvable and I have a branch to wire it up in progress.
<!-- gh-comment-id:4322151561 --> @mverrilli commented on GitHub (Apr 26, 2026): TQ is not a small PR. To be useful, it has to compress KV cache, not slow down prefill or decode too much, and stay off the paths that use the scratch buffer which would offset the VRAM savings, and keep it coherent. The branch I have right now does TQ, but has two issues: 1. qwen2 family is incoherent. I think because qwen2 has a learnable bias on the K projection. I've tried a few things here, but still working through it. 2. scratch buffer usage due to slow path. This should be solvable and I have a branch to wire it up in progress.
Author
Owner

@mverrilli commented on GitHub (Apr 27, 2026):

After really digging in on this, I am starting to think TQ isn't really the best solution despite the claims in the paper. Certainly I was able to get a coherent, highly compressed KV cache. Performance issues aside (some of which can be improved, and some that are much improved on newer hardware), the perplexity scores drift from f16 quite a bit. This does not appear to be as lossless as expected.

It's possible it is due to something in my implementation, however I went back to the paper and noticed some things. First, the paper abstract sounds as if this is a general solution, however the paper itself is pretty specific about the models it selected and the method in which the loss was measured.

In addition, I read several papers that cite or critique TurboQuant. One key finding: the QJL residual component (part of what makes TQ's compression work) has a known accuracy degradation that compounds per layer and the paper only tested on 32-layer models, and the math suggests it would break down badly on larger ones (arXiv:2604.19528). Another paper points out that minimizing reconstruction error (what TQ optimizes for) isn't the same as minimizing perplexity loss and they can diverge significantly (arXiv:2602.05367). Also recommend reading arXiv:2604.18555 (and I did correct the flaw stated and benchmarked, minimal effect, though).

I'm doing some more benchmarks and will post them when they finish. I'll update a branch tonight in case anyone wants to also take a look. I do have two other approaches in progress and they are simpler so I may be able to get those benchmarked as well.

<!-- gh-comment-id:4331000996 --> @mverrilli commented on GitHub (Apr 27, 2026): After really digging in on this, I am starting to think TQ isn't really the best solution despite the claims in the paper. Certainly I was able to get a coherent, highly compressed KV cache. Performance issues aside (some of which can be improved, and some that are much improved on newer hardware), the perplexity scores drift from f16 quite a bit. This does not appear to be as lossless as expected. It's possible it is due to something in my implementation, however I went back to the paper and noticed some things. First, the paper abstract sounds as if this is a general solution, however the paper itself is pretty specific about the models it selected and the method in which the loss was measured. In addition, I read several papers that cite or critique TurboQuant. One key finding: the QJL residual component (part of what makes TQ's compression work) has a known accuracy degradation that compounds per layer and the paper only tested on 32-layer models, and the math suggests it would break down badly on larger ones (arXiv:2604.19528). Another paper points out that minimizing reconstruction error (what TQ optimizes for) isn't the same as minimizing perplexity loss and they can diverge significantly (arXiv:2602.05367). Also recommend reading arXiv:2604.18555 (and I did correct the flaw stated and benchmarked, minimal effect, though). I'm doing some more benchmarks and will post them when they finish. I'll update a branch tonight in case anyone wants to also take a look. I do have two other approaches in progress and they are simpler so I may be able to get those benchmarked as well.
Author
Owner

@mverrilli commented on GitHub (Apr 27, 2026):

Also I put together a better perplexity measurement tool than I used previously. The previous one I used was self-PPL, and now I'm using reference PPL (forward passes on WikiText-2, etc).

<!-- gh-comment-id:4331058945 --> @mverrilli commented on GitHub (Apr 27, 2026): Also I put together a better perplexity measurement tool than I used previously. The previous one I used was self-PPL, and now I'm using reference PPL (forward passes on WikiText-2, etc).
Author
Owner

@mverrilli commented on GitHub (Apr 29, 2026):

Here's some output from some experimental compressions. https://gist.github.com/mverrilli/dbd9935bdec44495e635a3c5cdf611d0

f16 - baseline, no compression
q4_0 / q8_0 - block quantization borrowed from weight quant.
tq (TurboQuant) - rotation + Lloyd-Max codebook, *qa were some tests where I added some extra features (qjl, outlier split, asymmetric).
q8k / q4k - per-group asymmetric int8/int4 (something clean and simple, modified of an idea common in a few papers)
saw - same as q8k/q4k but with Hadamard rot first (arXiv:2604.19157)

This run was really about perplexity, not compression. Larger ctx would have better kv cache compression rates due to overhead.

But you can see TQ really isn't that great PPL-wise. saw8kv and saw4kv are the winners here. Need more tests though.

<!-- gh-comment-id:4340953223 --> @mverrilli commented on GitHub (Apr 29, 2026): Here's some output from some experimental compressions. https://gist.github.com/mverrilli/dbd9935bdec44495e635a3c5cdf611d0 f16 - baseline, no compression q4_0 / q8_0 - block quantization borrowed from weight quant. tq (TurboQuant) - rotation + Lloyd-Max codebook, *qa were some tests where I added some extra features (qjl, outlier split, asymmetric). q8k / q4k - per-group asymmetric int8/int4 (something clean and simple, modified of an idea common in a few papers) saw - same as q8k/q4k but with Hadamard rot first (arXiv:2604.19157) This run was really about perplexity, not compression. Larger ctx would have better kv cache compression rates due to overhead. But you can see TQ really isn't that great PPL-wise. saw8kv and saw4kv are the winners here. Need more tests though.
Author
Owner

@EddyChen commented on GitHub (May 1, 2026):

Critical need.

<!-- gh-comment-id:4358227247 --> @EddyChen commented on GitHub (May 1, 2026): Critical need.
Author
Owner

@johny-mnemonic commented on GitHub (May 1, 2026):

But you can see TQ really isn't that great PPL-wise. saw8kv and saw4kv are the winners here. Need more tests though.

I have seen tests on bigger models and longer contexts and when you use tq4 it is quite good. Especially when using q8k+tq4v as most models seems to be much more sensitive to k compression than v compression.

Why don't you test tq4 at all?

<!-- gh-comment-id:4358247292 --> @johny-mnemonic commented on GitHub (May 1, 2026): > But you can see TQ really isn't that great PPL-wise. saw8kv and saw4kv are the winners here. Need more tests though. I have seen tests on bigger models and longer contexts and when you use tq4 it is quite good. Especially when using q8k+tq4v as most models seems to be much more sensitive to k compression than v compression. Why don't you test tq4 at all?
Author
Owner

@YKesX commented on GitHub (May 1, 2026):

On my PR with @Dankguy17, i have already started using tq2 tq3 and tq4 internally, the thing is maintainers will use the mlx implementation when it is out, so i saw no reason to push the latest local updates(also was quite busy with other projects). So if there are people who just want to test it, i can update the repo to the latest ollama base and push my changes. On my tests all turboquant quantizations made sense only after 32k context window (with qwen3.5 9b, will test gemma4 26b a4b if i update the repo) compared to normal q4. I must also test quality but my home agents are currently woking great (i know that is not a good metric, qwen3.5 9b 128k), so if anyone else is also willing to test, i can push and let you guys test, until the mlx support comes. I should also look at what @mverrilli has done on his pr, to see if i am missing something, tried to be as close to the google's paper. Thanks for your efforts @Dankguy17 and @mverrilli !

<!-- gh-comment-id:4358526739 --> @YKesX commented on GitHub (May 1, 2026): On my PR with @Dankguy17, i have already started using tq2 tq3 and tq4 internally, the thing is maintainers will use the mlx implementation when it is out, so i saw no reason to push the latest local updates(also was quite busy with other projects). So if there are people who just want to test it, i can update the repo to the latest ollama base and push my changes. On my tests all turboquant quantizations made sense only after 32k context window (with qwen3.5 9b, will test gemma4 26b a4b if i update the repo) compared to normal q4. I must also test quality but my home agents are currently woking great (i know that is not a good metric, qwen3.5 9b 128k), so if anyone else is also willing to test, i can push and let you guys test, until the mlx support comes. I should also look at what @mverrilli has done on his pr, to see if i am missing something, tried to be as close to the google's paper. Thanks for your efforts @Dankguy17 and @mverrilli !
Author
Owner

@mverrilli commented on GitHub (May 1, 2026):

@johny-mnemonic Sure, I'm running a 32k benchmark for tq3/tq4.

@YKesX I do have a new branch to merge in, I haven't tested it on rocm or metal, yet, which is why I held it back. I'll try and run those tests today after the 32k benchmarks run.

<!-- gh-comment-id:4359699223 --> @mverrilli commented on GitHub (May 1, 2026): @johny-mnemonic Sure, I'm running a 32k benchmark for tq3/tq4. @YKesX I do have a new branch to merge in, I haven't tested it on rocm or metal, yet, which is why I held it back. I'll try and run those tests today after the 32k benchmarks run.
Author
Owner

@mverrilli commented on GitHub (May 2, 2026):

@johny-mnemonic

I had to fix a few things and reran llama3.2:3b ctx=512:

  ┌────────┬───────┬─────────┬────────┬────────┬───────────┐
  │ preset │  PPL  │ prefill │ decode │ KV MiB │ total MiB │
  ├────────┼───────┼─────────┼────────┼────────┼───────────┤
  │ f16    │ 14.58 │ 391     │ 75.8   │ 84     │ 2469      │
  ├────────┼───────┼─────────┼────────┼────────┼───────────┤
  │ tq2    │ 24.83 │ 282     │ 46.5   │ 12     │ 2377      │
  ├────────┼───────┼─────────┼────────┼────────┼───────────┤
  │ tq3    │ 15.87 │ 278     │ 45.8   │ 17     │ 2381      │
  ├────────┼───────┼─────────┼────────┼────────┼───────────┤
  │ tq4    │ 14.77 │ 281     │ 46.9   │ 22     │ 2387      │
  ├────────┼───────┼─────────┼────────┼────────┼───────────┤
  │ tq2k   │ 23.09 │ 303     │ 53.5   │ 48     │ 2413      │
  ├────────┼───────┼─────────┼────────┼────────┼───────────┤
  │ tq3k   │ 15.83 │ 301     │ 53.3   │ 51     │ 2413      │
  ├────────┼───────┼─────────┼────────┼────────┼───────────┤
  │ tq4k   │ 14.82 │ 300     │ 53.4   │ 53     │ 2415      │
  └────────┴───────┴─────────┴────────┴────────┴───────────┘

I ran ctx=32k on AMD RX 7600 (gfx1102), llama3.2:3b. A bit of a perf issue I need to look into.

  ┌────────┬───────┬─────────┬────────┬────────┬───────────┐
  │ preset │  PPL  │ prefill │ decode │ KV MiB │ total MiB │
  ├────────┼───────┼─────────┼────────┼────────┼───────────┤
  │ f16    │  3.65 │ 53      │ 21.3   │ 3612   │ 0         │
  ├────────┼───────┼─────────┼────────┼────────┼───────────┤
  │ tq3    │  4.03 │ 60      │  7.1   │  734   │ 0         │
  ├────────┼───────┼─────────┼────────┼────────┼───────────┤
  │ tq4    │  3.71 │ 66      │  8.3   │  959   │ 0         │
  └────────┴───────┴─────────┴���───────┴────────┴───────────┘
<!-- gh-comment-id:4363692696 --> @mverrilli commented on GitHub (May 2, 2026): @johny-mnemonic I had to fix a few things and reran llama3.2:3b ctx=512: ``` ┌────────┬───────┬─────────┬────────┬────────┬───────────┐ │ preset │ PPL │ prefill │ decode │ KV MiB │ total MiB │ ├────────┼───────┼─────────┼────────┼────────┼───────────┤ │ f16 │ 14.58 │ 391 │ 75.8 │ 84 │ 2469 │ ├────────┼───────┼─────────┼────────┼────────┼───────────┤ │ tq2 │ 24.83 │ 282 │ 46.5 │ 12 │ 2377 │ ├────────┼───────┼─────────┼────────┼────────┼───────────┤ │ tq3 │ 15.87 │ 278 │ 45.8 │ 17 │ 2381 │ ├────────┼───────┼─────────┼────────┼────────┼───────────┤ │ tq4 │ 14.77 │ 281 │ 46.9 │ 22 │ 2387 │ ├────────┼───────┼─────────┼────────┼────────┼───────────┤ │ tq2k │ 23.09 │ 303 │ 53.5 │ 48 │ 2413 │ ├────────┼───────┼─────────┼────────┼────────┼───────────┤ │ tq3k │ 15.83 │ 301 │ 53.3 │ 51 │ 2413 │ ├────────┼───────┼─────────┼────────┼────────┼───────────┤ │ tq4k │ 14.82 │ 300 │ 53.4 │ 53 │ 2415 │ └────────┴───────┴─────────┴────────┴────────┴───────────┘ ``` I ran ctx=32k on AMD RX 7600 (gfx1102), llama3.2:3b. A bit of a perf issue I need to look into. ``` ┌────────┬───────┬─────────┬────────┬────────┬───────────┐ │ preset │ PPL │ prefill │ decode │ KV MiB │ total MiB │ ├────────┼───────┼─────────┼────────┼────────┼───────────┤ │ f16 │ 3.65 │ 53 │ 21.3 │ 3612 │ 0 │ ├────────┼───────┼─────────┼────────┼────────┼───────────┤ │ tq3 │ 4.03 │ 60 │ 7.1 │ 734 │ 0 │ ├────────┼───────┼─────────┼────────┼────────┼───────────┤ │ tq4 │ 3.71 │ 66 │ 8.3 │ 959 │ 0 │ └────────┴───────┴─────────┴���───────┴────────┴───────────┘
Author
Owner

@mverrilli commented on GitHub (May 2, 2026):

@johny-mnemonic @YKesX Here is my branch: https://github.com/mverrilli/ollama/tree/turboquant-qjl

It works on CUDA, Metal and ROCm. I have a lot of presets in this branch since I'm doing side by side testing of different quants. ROCm I am digging into performance.

The closest implementation to the paper is tq*qa. However PPL and performance (even at ctx=512) is bad. This I where I am right now- looking into tq*qa to try and isolate what I assume is a defect (I think narrowed down to the asymmetric mean centering, still working through it).

  ┌────────┬───────┬─────────┬────────┬────────┬───────────┐
  │ preset │  PPL  │ prefill │ decode │ KV MiB │ total MiB │
  ├────────┼───────┼─────────┼────────┼────────┼───────────┤
  │ f16    │ 14.58 │ 391     │ 75.8   │ 84     │ 2469      │
  ├────────┼───────┼─────────┼────────┼────────┼───────────┤
  │ tq2qa  │ 36.78 │ 163     │ 17.4   │ 23     │ 2381      │
  ├────────┼───────┼─────────┼────────┼────────┼───────────┤
  │ tq3qa  │ 39.33 │ 163     │ 17.3   │ 28     │ 2387      │
  ├────────┼───────┼─────────┼────────┼────────┼───────────┤
  │ tq4qa  │ 24.63 │ 163     │ 17.4   │ 33     │ 2391      │
  └────────┴───────┴─────────┴────────┴────────┴───────────┘

TurboQuant

  • tq[2|3|4](k) -> symmetric, no outliers, no QJL.
  • tq[2|3|4](k)a -> adds asymmetric mean-centering.
  • tq[2|3]q -> adds QJL (no tq4q).
  • tq[2|3|4](k)qa -> full stack: asymmetric + outliers + QJL.

Notes:

  • k suffix means K-only; V stays at f16.
  • QJL and outlier split are not in the primary tq set because they hurt PPL and performance.
  • I have these presets split out so I can contrast and isolate issues. Intention is to clean up for the PR.

Others

  • q8k/q8kv -> per-group int8, group=32.
  • q4k/q4kv -> per-group int4, group=32.
<!-- gh-comment-id:4363793914 --> @mverrilli commented on GitHub (May 2, 2026): @johny-mnemonic @YKesX Here is my branch: https://github.com/mverrilli/ollama/tree/turboquant-qjl It works on CUDA, Metal and ROCm. I have a lot of presets in this branch since I'm doing side by side testing of different quants. ROCm I am digging into performance. The closest implementation to the paper is `tq*qa`. However PPL and performance (even at ctx=512) is bad. This I where I am right now- looking into tq*qa to try and isolate what I assume is a defect (I think narrowed down to the asymmetric mean centering, still working through it). ``` ┌────────┬───────┬─────────┬────────┬────────┬───────────┐ │ preset │ PPL │ prefill │ decode │ KV MiB │ total MiB │ ├────────┼───────┼─────────┼────────┼────────┼───────────┤ │ f16 │ 14.58 │ 391 │ 75.8 │ 84 │ 2469 │ ├────────┼───────┼─────────┼────────┼────────┼───────────┤ │ tq2qa │ 36.78 │ 163 │ 17.4 │ 23 │ 2381 │ ├────────┼───────┼─────────┼────────┼────────┼───────────┤ │ tq3qa │ 39.33 │ 163 │ 17.3 │ 28 │ 2387 │ ├────────┼───────┼─────────┼────────┼────────┼───────────┤ │ tq4qa │ 24.63 │ 163 │ 17.4 │ 33 │ 2391 │ └────────┴───────┴─────────┴────────┴────────┴───────────┘ ``` # TurboQuant - `tq[2|3|4](k)` -> symmetric, no outliers, no QJL. - `tq[2|3|4](k)a` -> adds asymmetric mean-centering. - `tq[2|3]q` -> adds QJL (no tq4q). - `tq[2|3|4](k)qa` -> full stack: asymmetric + outliers + QJL. **Notes:** - k suffix means K-only; V stays at f16. - QJL and outlier split are not in the primary tq set because they hurt PPL and performance. - I have these presets split out so I can contrast and isolate issues. Intention is to clean up for the PR. # Others - `q8k`/`q8kv` -> per-group int8, group=32. - `q4k`/`q4kv` -> per-group int4, group=32.
Author
Owner

@mverrilli commented on GitHub (May 2, 2026):

Just a quick update: I did solve the correctness issue with tq*qa (the full-paper implementation) and solved some of the performance issues. The branch is updated, but I'm going to continue on with some more performance investigations. Metal actually performs quite well at low ctx (f16 speeds) even though I do see some areas for improvement there.

Also on my list is to check the qwen2 family now that I fixed the QJL correctness issue.

I'll post an update later.

<!-- gh-comment-id:4364403075 --> @mverrilli commented on GitHub (May 2, 2026): Just a quick update: I did solve the correctness issue with `tq*qa` (the full-paper implementation) and solved some of the performance issues. The branch is updated, but I'm going to continue on with some more performance investigations. Metal actually performs quite well at low ctx (f16 speeds) even though I do see some areas for improvement there. Also on my list is to check the qwen2 family now that I fixed the QJL correctness issue. I'll post an update later.
Author
Owner

@mverrilli commented on GitHub (May 3, 2026):

QJL is a net negative on PPL on every model I tested, not a small-margin tradeoff. Removing it also gave 12–27% prefill / 9–27% decode tok/s better performance.

At this point, I don't see any reason to keep it as part of the implementation. Happy to leave it in if anyone wants to test themselves. This was also critiqued in the other papers I mentioned previously.

I am going to test outliers and asym next just to be sure but I think they help with qwen2 style models (biased K).

<!-- gh-comment-id:4365092807 --> @mverrilli commented on GitHub (May 3, 2026): QJL is a net negative on PPL on every model I tested, not a small-margin tradeoff. Removing it also gave 12–27% prefill / 9–27% decode tok/s better performance. At this point, I don't see any reason to keep it as part of the implementation. Happy to leave it in if anyone wants to test themselves. This was also critiqued in the other papers I mentioned previously. I am going to test outliers and asym next just to be sure but I think they help with qwen2 style models (biased K).
Author
Owner

@mverrilli commented on GitHub (May 3, 2026):

Confirmed.

QJL is universally net negative.

Asymmetric alone is slightly better for qwen2 but hurts other models.

Outliers help universally.

Outliers also fixed asymmetric on all other models.

So recommend keep asymmetric and outliers. Drop QJL.

<!-- gh-comment-id:4365157721 --> @mverrilli commented on GitHub (May 3, 2026): Confirmed. QJL is universally net negative. Asymmetric alone is slightly better for qwen2 but hurts other models. Outliers help universally. Outliers also fixed asymmetric on all other models. So recommend keep asymmetric and outliers. Drop QJL.
Author
Owner

@mverrilli commented on GitHub (May 4, 2026):

I've updated the PR #15505. tq2 / tq3 / tq4 for CUDA, ROCm, Metal. I'll add some benchmarks in the next couple days.

<!-- gh-comment-id:4368538886 --> @mverrilli commented on GitHub (May 4, 2026): I've updated the PR #15505. tq2 / tq3 / tq4 for CUDA, ROCm, Metal. I'll add some benchmarks in the next couple days.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71720