[GH-ISSUE #15051] native ollama-go-engine: TurboQuant+RotorQuant implementation #56173

New Issue

GiteaMirror · 2026-04-29T10:21:12-05:00

GiteaMirror commented

2026-04-29 10:21:12 -05:00

Originally created by @grepin on GitHub (Mar 25, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15051

@rick-github @jessegross jfyi https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/ + https://arxiv.org/pdf/2504.19874

Originally created by @grepin on GitHub (Mar 25, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15051 @rick-github @jessegross jfyi https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/ + https://arxiv.org/pdf/2504.19874

GiteaMirror added the feature request label 2026-04-29 10:21:12 -05:00

GiteaMirror commented

2026-04-29 10:21:14 -05:00

@grepin commented on GitHub (Mar 25, 2026):

meta-algorithm description provided in paper; seems not too hard to implement it. implementation could significantly reduce kv-cache size while keeping quality + introduce compute speedup in most cases.

@grepin commented on GitHub (Mar 25, 2026): meta-algorithm description provided in paper; seems not too hard to implement it. implementation could significantly reduce kv-cache size while keeping quality + introduce compute speedup in most cases.

GiteaMirror commented

2026-04-29 10:21:16 -05:00

@goedzo commented on GitHub (Mar 25, 2026):

Upvote👍

@goedzo commented on GitHub (Mar 25, 2026): Upvote👍

GiteaMirror commented

2026-04-29 10:21:16 -05:00

@OrBeProgrammed commented on GitHub (Mar 25, 2026):

Critical need.

@OrBeProgrammed commented on GitHub (Mar 25, 2026): Critical need.

GiteaMirror commented

2026-04-29 10:21:17 -05:00

@postEntropy commented on GitHub (Mar 25, 2026):

Need it ASAP

@postEntropy commented on GitHub (Mar 25, 2026): Need it ASAP

GiteaMirror commented

2026-04-29 10:21:18 -05:00

@grepin commented on GitHub (Mar 26, 2026):

@OrBeProgrammed @postEntropy @goedzo guys, don't push devs, let them to decide "when and how to implement (or not)". yes, TQ is a cool thing (and from my pov it will first of all will help their business with ollama:cloud), but as in any project there are limited resources and plans. feature request made, so "let the things go as they go"

@grepin commented on GitHub (Mar 26, 2026): @OrBeProgrammed @postEntropy @goedzo guys, don't push devs, let them to decide "when and how to implement (or not)". yes, TQ is a cool thing (and from my pov it will first of all will help their business with ollama:cloud), but as in any project there are limited resources and plans. feature request made, so "let the things go as they go"

GiteaMirror commented

2026-04-29 10:21:18 -05:00

@OrBeProgrammed commented on GitHub (Mar 26, 2026):

I'm not pushing anyone I'm working on a PR myself :) We're all in this together!

@OrBeProgrammed commented on GitHub (Mar 26, 2026): I'm not pushing anyone I'm working on a PR myself :) We're all in this together!

GiteaMirror commented

2026-04-29 10:21:19 -05:00

@grepin commented on GitHub (Mar 26, 2026):

btw, @OrBeProgrammed: https://github.com/ggml-org/llama.cpp/issues/20977, enthusiasts already try to implement: https://github.com/ggml-org/llama.cpp/compare/master...mudler:llama.cpp:feat/turbo-quant
https://github.com/TheTom/turboquant_plus (the code seems useful in terms of "understanding of technical implementation & transferring/porting to ollama go-based codebase")

@grepin commented on GitHub (Mar 26, 2026): btw, @OrBeProgrammed: https://github.com/ggml-org/llama.cpp/issues/20977, enthusiasts already try to implement: https://github.com/ggml-org/llama.cpp/compare/master...mudler:llama.cpp:feat/turbo-quant https://github.com/TheTom/turboquant_plus (the code seems useful in terms of "understanding of technical implementation & transferring/porting to ollama go-based codebase")

GiteaMirror commented

2026-04-29 10:21:19 -05:00

@grepin commented on GitHub (Mar 26, 2026):

vllm (FR for now only): https://github.com/vllm-project/vllm/issues/38171
most likely all the frontier inference engines will implement TQ in a couple of month.

@grepin commented on GitHub (Mar 26, 2026): + vllm (FR for now only): https://github.com/vllm-project/vllm/issues/38171 most likely all the frontier inference engines will implement TQ in a couple of month.

GiteaMirror commented

2026-04-29 10:21:20 -05:00

@grepin commented on GitHub (Mar 26, 2026):

yep, this looks good as reference python implementation with "post-generation kv-cache analisys & attention quality tests": https://github.com/TheTom/turboquant_plus

@grepin commented on GitHub (Mar 26, 2026): yep, this looks good as reference python implementation with "post-generation kv-cache analisys & attention quality tests": https://github.com/TheTom/turboquant_plus

GiteaMirror commented

2026-04-29 10:21:22 -05:00

@123Haynes commented on GitHub (Mar 26, 2026):

the discussion here also documents some pitfalls during the implementation: https://github.com/ggml-org/llama.cpp/discussions/20969

@123Haynes commented on GitHub (Mar 26, 2026): the discussion here also documents some pitfalls during the implementation: https://github.com/ggml-org/llama.cpp/discussions/20969

GiteaMirror commented

2026-04-29 10:21:24 -05:00

@grepin commented on GitHub (Mar 26, 2026):

+1 to implementations (PoC with measurement results): https://github.com/vllm-project/vllm/issues/38171#issuecomment-4134002937

@grepin commented on GitHub (Mar 26, 2026): +1 to implementations (PoC with measurement results): https://github.com/vllm-project/vllm/issues/38171#issuecomment-4134002937

GiteaMirror commented

2026-04-29 10:21:25 -05:00

@kblood commented on GitHub (Mar 26, 2026):

Oh yes, hoped to see Ollama support for this the first time I read about it :)

@kblood commented on GitHub (Mar 26, 2026): Oh yes, hoped to see Ollama support for this the first time I read about it :)

GiteaMirror commented

2026-04-29 10:21:26 -05:00

@codyseally commented on GitHub (Mar 26, 2026):

Absolute upvote !

@codyseally commented on GitHub (Mar 26, 2026): Absolute upvote !

GiteaMirror commented

2026-04-29 10:21:26 -05:00

@XZzYassin commented on GitHub (Mar 26, 2026):

Oleh! 🌹

@XZzYassin commented on GitHub (Mar 26, 2026): Oleh! 🌹

GiteaMirror commented

2026-04-29 10:21:27 -05:00

@grepin commented on GitHub (Mar 27, 2026):

https://github.com/scrya-com/rotorquant looks very interesting and more effective. "i love this world full of genius engineers".

@grepin commented on GitHub (Mar 27, 2026): + https://github.com/scrya-com/rotorquant looks very interesting and more effective. "i love this world full of genius engineers".

GiteaMirror commented

2026-04-29 10:21:27 -05:00

@mobilexmt commented on GitHub (Mar 28, 2026):

please also consider DGX Spark, thanks!

@mobilexmt commented on GitHub (Mar 28, 2026): please also consider DGX Spark, thanks!

GiteaMirror commented

2026-04-29 10:21:28 -05:00

@richardokonicha commented on GitHub (Mar 28, 2026):

Hurray

@richardokonicha commented on GitHub (Mar 28, 2026): Hurray

GiteaMirror commented

2026-04-29 10:21:29 -05:00

@christopheduc-me commented on GitHub (Mar 29, 2026):

100% Upvoted of course ! We need it to unlock new possibilities for our local usage

@christopheduc-me commented on GitHub (Mar 29, 2026): 100% Upvoted of course ! We need it to unlock new possibilities for our local usage

GiteaMirror commented

2026-04-29 10:21:29 -05:00

@dorinsimionescu commented on GitHub (Mar 30, 2026):

upvote too

@dorinsimionescu commented on GitHub (Mar 30, 2026): upvote too

GiteaMirror commented

2026-04-29 10:21:30 -05:00

@QAM commented on GitHub (Mar 30, 2026):

+1 plz

@QAM commented on GitHub (Mar 30, 2026): +1 plz

GiteaMirror commented

2026-04-29 10:21:30 -05:00

@grepin commented on GitHub (Mar 30, 2026):

in fact, wip: https://github.com/ollama/ollama/pull/15125
if you have enough understanding of engine + paper/algo (or ready to dive into it by yourself or with any AI+any-agent + "non-blind acceptance of everything AI tries to make during implementation"), you can help. As always in opensource, "you are on your own" and "code & tests are the only source of truth", but i hope that Dankguy17 and YKesX could guide your efforts (for example, as usual, much more test-runs on different models needed to find problems & improve implementation). Many thanks to @Dankguy17 and @YKesX anyway for their contribution.

@grepin commented on GitHub (Mar 30, 2026): in fact, wip: https://github.com/ollama/ollama/pull/15125 if you have enough understanding of engine + paper/algo (or ready to dive into it by yourself or with any AI+any-agent + "non-blind acceptance of everything AI tries to make during implementation"), you can help. As always in opensource, "you are on your own" and "code & tests are the only source of truth", but i hope that [Dankguy17](https://github.com/Dankguy17) and [YKesX](https://github.com/YKesX) could guide your efforts (for example, as usual, much more test-runs on different models needed to find problems & improve implementation). Many thanks to @Dankguy17 and @YKesX anyway for their contribution.

GiteaMirror commented

2026-04-29 10:21:31 -05:00

@Sreekmans commented on GitHub (Mar 30, 2026):

+1

@Sreekmans commented on GitHub (Mar 30, 2026): +1

GiteaMirror commented

2026-04-29 10:21:32 -05:00

@Reikagilu commented on GitHub (Mar 30, 2026):

+1

@Reikagilu commented on GitHub (Mar 30, 2026): +1

GiteaMirror commented

2026-04-29 10:21:33 -05:00

@medenijazbec commented on GitHub (Mar 31, 2026):

I took a pass at implementing this on top of Ollama v0.18.3 and published the work here:

https://github.com/medenijazbec/ollama-turboquant/tree/turboquant-0.18.3

Current status

added TurboQuant productization surface for Ollama
request/API override path for kv_cache_type
CLI support for ollama run --turboquant
bench support for ollama-bench -turboquant
model default support via PARAMETER kv_cache_type ...
added custom Dockerfiles because I’m compiling against a CUDA 11.8-compatible setup for older GPU compatibility reasons

Benchmarking

I’ve also been running baseline vs TurboQuant experiments focused on KV pressure and long-context behavior.

I included multiple benchmark variants as well, including lighter / extra-light passes, because the available hardware is limited and some of the longer sweeps are expensive to run reliably.

Documentation

I also wrote up the implementation and benchmark notes in the repo here:

docs/turboquant_paper_design.md
docs/turboquant_audit.md
docs/kvstress-test-commands.txt

Important caveats

this is a fork / WIP branch, not something I’d call upstream-ready yet
I’m still working through benchmarking, validation, and upstream-merge hygiene
I also want to tighten the implementation against the paper details to make sure the algorithm path is correct
VRAM numbers are missing from the currently published benchmark results because I realized too late that the benchmark container did not have proper CUDA/NVML visibility
rerunning the full matrix is a bit painful on Tesla M40s, especially since they are older, passively cooled cards

So I would treat the current branch as a reference implementation / experimentation branch rather than a finished upstream proposal.

Sharing it in case it helps others compare approaches, reuse some of the API / CLI / benchmark surface work, or validate against their own hardware. If others run similar tests and publish results, that would be very useful too.

Happy to compare notes with anyone working on the engine-side implementation or benchmark methodology.

@medenijazbec commented on GitHub (Mar 31, 2026): I took a pass at implementing this on top of **Ollama `v0.18.3`** and published the work here: `https://github.com/medenijazbec/ollama-turboquant/tree/turboquant-0.18.3` ### Current status - added TurboQuant productization surface for Ollama - request/API override path for `kv_cache_type` - CLI support for `ollama run --turboquant` - bench support for `ollama-bench -turboquant` - model default support via `PARAMETER kv_cache_type ...` - added custom Dockerfiles because I’m compiling against a **CUDA 11.8-compatible** setup for older GPU compatibility reasons ### Benchmarking I’ve also been running baseline vs TurboQuant experiments focused on **KV pressure** and **long-context behavior**. I included multiple benchmark variants as well, including lighter / extra-light passes, because the available hardware is limited and some of the longer sweeps are expensive to run reliably. ### Documentation I also wrote up the implementation and benchmark notes in the repo here: - `docs/turboquant_paper_design.md` - `docs/turboquant_audit.md` - `docs/kvstress-test-commands.txt` ### Important caveats - this is a **fork / WIP branch**, not something I’d call upstream-ready yet - I’m still working through benchmarking, validation, and upstream-merge hygiene - I also want to tighten the implementation against the paper details to make sure the algorithm path is correct - VRAM numbers are missing from the currently published benchmark results because I realized too late that the benchmark container did not have proper CUDA/NVML visibility - rerunning the full matrix is a bit painful on **Tesla M40s**, especially since they are older, passively cooled cards So I would treat the current branch as a **reference implementation / experimentation branch** rather than a finished upstream proposal. Sharing it in case it helps others compare approaches, reuse some of the API / CLI / benchmark surface work, or validate against their own hardware. If others run similar tests and publish results, that would be very useful too. Happy to compare notes with anyone working on the engine-side implementation or benchmark methodology.

GiteaMirror commented

2026-04-29 10:21:35 -05:00

@jclab-joseph commented on GitHub (Mar 31, 2026):

https://github.com/ggml-org/llama.cpp/discussions/20969 There seems to be lively discussion taking place here! I think it would be good to refer to.

@jclab-joseph commented on GitHub (Mar 31, 2026): https://github.com/ggml-org/llama.cpp/discussions/20969 There seems to be lively discussion taking place here! I think it would be good to refer to.

GiteaMirror commented

2026-04-29 10:21:38 -05:00

@msk-one commented on GitHub (Apr 1, 2026):

+1

@msk-one commented on GitHub (Apr 1, 2026): +1

GiteaMirror commented

2026-04-29 10:21:39 -05:00

@Blue-Crescent commented on GitHub (Apr 3, 2026):

vote

@Blue-Crescent commented on GitHub (Apr 3, 2026): vote

GiteaMirror commented

2026-04-29 10:21:39 -05:00

@lxdlam commented on GitHub (Apr 4, 2026):

vote for this

@lxdlam commented on GitHub (Apr 4, 2026): vote for this

GiteaMirror commented

2026-04-29 10:21:40 -05:00

@Readon commented on GitHub (Apr 7, 2026):

I think ggml-org/llama.cpp#21038 has already implement this.

@Readon commented on GitHub (Apr 7, 2026): I think ggml-org/llama.cpp#21038 has already implement this.

GiteaMirror commented

2026-04-29 10:21:40 -05:00

@ZevAlain commented on GitHub (Apr 11, 2026):

vote

@ZevAlain commented on GitHub (Apr 11, 2026): vote

GiteaMirror commented

2026-04-29 10:21:41 -05:00

@hajhouj commented on GitHub (Apr 11, 2026):

If Ollama actually pulls this off, it’s going to hit hard. It’s wild to think that a setup with just 8GB of VRAM could handle a model originally designed for something like a 48GB GPU. That kind of leap really pushes us closer to a world where fully autonomous, self-hosted AI isn’t just hype… it’s right around the corner.

@hajhouj commented on GitHub (Apr 11, 2026): If Ollama actually pulls this off, it’s going to hit hard. It’s wild to think that a setup with just 8GB of VRAM could handle a model originally designed for something like a 48GB GPU. That kind of leap really pushes us closer to a world where fully autonomous, self-hosted AI isn’t just hype… it’s right around the corner.

GiteaMirror commented

2026-04-29 10:21:41 -05:00

@mverrilli commented on GitHub (Apr 11, 2026):

I put together this PR if anyone wants to review and build. #15505

● Adds tq2/tq3/tq2k/tq3k KV cache types implementing TurboQuant (arXiv 2504.19874) — a GPU-resident compressed K/V path built from Householder QR rotation plus Lloyd-Max scalar quantization, with new CUDA kernels for encode, dequant, and fused flash attention.

Roughly doubles usable context per VRAM dollar on Pascal+ GPUs at near-f16 quality: tq3k matches f16 PPL on llama3.2/gemma3/qwen3-coder with ~40% KV savings, tq3 gives ~80% KV savings for a ~0.5% PPL cost, and the K-only variants (tq3k/tq2k) work even with flash attention disabled.

@mverrilli commented on GitHub (Apr 11, 2026): I put together this PR if anyone wants to review and build. #15505 ● Adds tq2/tq3/tq2k/tq3k KV cache types implementing TurboQuant (arXiv 2504.19874) — a GPU-resident compressed K/V path built from Householder QR rotation plus Lloyd-Max scalar quantization, with new CUDA kernels for encode, dequant, and fused flash attention. Roughly doubles usable context per VRAM dollar on Pascal+ GPUs at near-f16 quality: tq3k matches f16 PPL on llama3.2/gemma3/qwen3-coder with ~40% KV savings, tq3 gives ~80% KV savings for a ~0.5% PPL cost, and the K-only variants (tq3k/tq2k) work even with flash attention disabled.

GiteaMirror commented

2026-04-29 10:21:43 -05:00

@mverrilli commented on GitHub (Apr 11, 2026):

@hajhouj On any GPU, TurboQuant tq3k gets you ~40% KV cache savings with PPL essentially unchanged from f16 and tq3 gets you ~80% KV savings at a small cost. That doesn’t change what models fit on your card, but it roughly doubles the usable context length for whatever model you’re already running.

@mverrilli commented on GitHub (Apr 11, 2026): @hajhouj On any GPU, TurboQuant tq3k gets you ~40% KV cache savings with PPL essentially unchanged from f16 and tq3 gets you ~80% KV savings at a small cost. That doesn’t change what models fit on your card, but it roughly doubles the usable context length for whatever model you’re already running.

GiteaMirror commented

2026-04-29 10:21:43 -05:00

@hajhouj commented on GitHub (Apr 11, 2026):

@hajhouj On any GPU, TurboQuant tq3k gets you ~40% KV cache savings with PPL essentially unchanged from f16 and tq3 gets you ~80% KV savings at a small cost. That doesn’t change what models fit on your card, but it roughly doubles the usable context length for whatever model you’re already running.

Thanks for the info. I came across a news article about this new algorithm claiming it reduces memory usage by a factor of six, which might be a bit exaggerated. Still, the ability to increase context length without sacrificing speed is genuinely promising. It feels like another big step toward making low-cost, self-hosted AI more practical, potentially opening the door for wider desktop-level use.

@hajhouj commented on GitHub (Apr 11, 2026): > @hajhouj On any GPU, TurboQuant tq3k gets you ~40% KV cache savings with PPL essentially unchanged from f16 and tq3 gets you ~80% KV savings at a small cost. That doesn’t change what models fit on your card, but it roughly doubles the usable context length for whatever model you’re already running. Thanks for the info. I came across a news article about this new algorithm claiming it reduces memory usage by a factor of six, which might be a bit exaggerated. Still, the ability to increase context length without sacrificing speed is genuinely promising. It feels like another big step toward making low-cost, self-hosted AI more practical, potentially opening the door for wider desktop-level use.

GiteaMirror commented

2026-04-29 10:21:43 -05:00

@mverrilli commented on GitHub (Apr 12, 2026):

Thanks for the info. I came across a news article about this new algorithm claiming it reduces memory usage by a factor of six, which might be a bit exaggerated. Still, the ability to increase context length without sacrificing speed is genuinely promising. It feels like another big step toward making low-cost, self-hosted AI more practical, potentially opening the door for wider desktop-level use.

All good. This does that, but it's just not total VRAM. It's KV cache. I was able to get a factor of 5 reduction (80%) with tq3. It's a little bit hyped but I think still very good.

@mverrilli commented on GitHub (Apr 12, 2026): > Thanks for the info. I came across a news article about this new algorithm claiming it reduces memory usage by a factor of six, which might be a bit exaggerated. Still, the ability to increase context length without sacrificing speed is genuinely promising. It feels like another big step toward making low-cost, self-hosted AI more practical, potentially opening the door for wider desktop-level use. All good. This does that, but it's just not total VRAM. It's KV cache. I was able to get a factor of 5 reduction (80%) with tq3. It's a little bit hyped but I think still very good.

GiteaMirror commented

2026-04-29 10:21:44 -05:00

@achraf99999 commented on GitHub (Apr 13, 2026):

vllm (FR for now only): [Feature]: Add TurboQuant Support for KV Cache Quantization vllm-project/vllm#38171
most likely all the frontier inference engines will implement TQ in a couple of month.

did you manage to run turbo quant from this PR ? if yes , can you share your configuration and hardware setup please ?

@achraf99999 commented on GitHub (Apr 13, 2026): > * vllm (FR for now only): [[Feature]: Add TurboQuant Support for KV Cache Quantization vllm-project/vllm#38171](https://github.com/vllm-project/vllm/issues/38171) > most likely all the frontier inference engines will implement TQ in a couple of month. did you manage to run turbo quant from this PR ? if yes , can you share your configuration and hardware setup please ?

GiteaMirror commented

2026-04-29 10:21:45 -05:00

@OrBeProgrammed commented on GitHub (Apr 13, 2026):

I got this working. I am not sure it is working with all models. It loads the model insanely faster than before. I just told claude to make the PR happen, not sure exactly what all it did. Is there a particular piece of info i can share that would help?

@OrBeProgrammed commented on GitHub (Apr 13, 2026): I got this working. I am not sure it is working with all models. It loads the model insanely faster than before. I just told claude to make the PR happen, not sure exactly what all it did. Is there a particular piece of info i can share that would help?

GiteaMirror commented

2026-04-29 10:21:45 -05:00

@medenijazbec commented on GitHub (Apr 13, 2026):

I got this working. I am not sure it is working with all models. It loads the model insanely faster than before. I just told claude to make the PR happen, not sure exactly what all it did. Is there a particular piece of info i can share that would help?

please provide some tests ive got mine ready but cant find the time to run them all, there also already seems to be a full implementation in llama, I havent really been following that thread for about 2 weeks, there might be alot of new findings, what Ive done is took inspiration from their ideas and credited them in my code for turboquant since folk over there is way smarter than i am lmao, anyway, I think a good test for this would be to connect it to something like claude code and make it run 3 runs of https://gist.github.com/ivanfioravanti/98ba7e5d3f7a88c1756b045d3e565630 using native ollama, then compare the average to the average results of your turboquant implementation

@medenijazbec commented on GitHub (Apr 13, 2026): > I got this working. I am not sure it is working with all models. It loads the model insanely faster than before. I just told claude to make the PR happen, not sure exactly what all it did. Is there a particular piece of info i can share that would help? please provide some tests ive got mine ready but cant find the time to run them all, there also already seems to be a full implementation in llama, I havent really been following that thread for about 2 weeks, there might be alot of new findings, what Ive done is took inspiration from their ideas and credited them in my code for turboquant since folk over there is way smarter than i am lmao, anyway, I think a good test for this would be to connect it to something like claude code and make it run 3 runs of https://gist.github.com/ivanfioravanti/98ba7e5d3f7a88c1756b045d3e565630 using native ollama, then compare the average to the average results of your turboquant implementation

GiteaMirror commented

2026-04-29 10:21:46 -05:00

@mverrilli commented on GitHub (Apr 21, 2026):

I put together this PR if anyone wants to review and build. #15505

FYI: Added Metal, and ROCM if anyone wants to test it out and report back.

@mverrilli commented on GitHub (Apr 21, 2026): > I put together this PR if anyone wants to review and build. [#15505](https://github.com/ollama/ollama/pull/15505) FYI: Added Metal, and ROCM if anyone wants to test it out and report back.

GiteaMirror commented

2026-04-29 10:21:46 -05:00

@Dankguy17 commented on GitHub (Apr 21, 2026):

Yeah I can try right now - although, there is definetely a lower chance that the PR gets accepted because it adds nearly 60k lines of code lol. Did you forget to gitignore something??

@Dankguy17 commented on GitHub (Apr 21, 2026): Yeah I can try right now - although, there is definetely a lower chance that the PR gets accepted because it adds nearly 60k lines of code lol. Did you forget to gitignore something??

GiteaMirror commented

2026-04-29 10:21:48 -05:00

@mverrilli commented on GitHub (Apr 21, 2026):

@Dankguy17 vendor patch issue. I thought I fixed it but I think it was in another branch. Pushing the branch after my build test.

@mverrilli commented on GitHub (Apr 21, 2026): @Dankguy17 vendor patch issue. I thought I fixed it but I think it was in another branch. Pushing the branch after my build test.

GiteaMirror commented

2026-04-29 10:21:50 -05:00

@Dankguy17 commented on GitHub (Apr 21, 2026):

cool! will discuss more in your pr

@Dankguy17 commented on GitHub (Apr 21, 2026): cool! will discuss more in your pr

GiteaMirror commented

2026-04-29 10:21:51 -05:00

@DATEx2 commented on GitHub (Apr 25, 2026):

So when will it be released? We kind of all need TurboQuant

@DATEx2 commented on GitHub (Apr 25, 2026): So when will it be released? We kind of all need `TurboQuant`

GiteaMirror commented

2026-04-29 10:21:51 -05:00

@mverrilli commented on GitHub (Apr 26, 2026):

TQ is not a small PR. To be useful, it has to compress KV cache, not slow down prefill or decode too much, and stay off the paths that use the scratch buffer which would offset the VRAM savings, and keep it coherent.

The branch I have right now does TQ, but has two issues:

qwen2 family is incoherent. I think because qwen2 has a learnable bias on the K projection. I've tried a few things here, but still working through it.
scratch buffer usage due to slow path. This should be solvable and I have a branch to wire it up in progress.

@mverrilli commented on GitHub (Apr 26, 2026): TQ is not a small PR. To be useful, it has to compress KV cache, not slow down prefill or decode too much, and stay off the paths that use the scratch buffer which would offset the VRAM savings, and keep it coherent. The branch I have right now does TQ, but has two issues: 1. qwen2 family is incoherent. I think because qwen2 has a learnable bias on the K projection. I've tried a few things here, but still working through it. 2. scratch buffer usage due to slow path. This should be solvable and I have a branch to wire it up in progress.

GiteaMirror commented

2026-04-29 10:21:52 -05:00

@mverrilli commented on GitHub (Apr 27, 2026):

After really digging in on this, I am starting to think TQ isn't really the best solution despite the claims in the paper. Certainly I was able to get a coherent, highly compressed KV cache. Performance issues aside (some of which can be improved, and some that are much improved on newer hardware), the perplexity scores drift from f16 quite a bit. This does not appear to be as lossless as expected.

It's possible it is due to something in my implementation, however I went back to the paper and noticed some things. First, the paper abstract sounds as if this is a general solution, however the paper itself is pretty specific about the models it selected and the method in which the loss was measured.

In addition, I read several papers that cite or critique TurboQuant. One key finding: the QJL residual component (part of what makes TQ's compression work) has a known accuracy degradation that compounds per layer and the paper only tested on 32-layer models, and the math suggests it would break down badly on larger ones (arXiv:2604.19528). Another paper points out that minimizing reconstruction error (what TQ optimizes for) isn't the same as minimizing perplexity loss and they can diverge significantly (arXiv:2602.05367). Also recommend reading arXiv:2604.18555 (and I did correct the flaw stated and benchmarked, minimal effect, though).

I'm doing some more benchmarks and will post them when they finish. I'll update a branch tonight in case anyone wants to also take a look. I do have two other approaches in progress and they are simpler so I may be able to get those benchmarked as well.

@mverrilli commented on GitHub (Apr 27, 2026): After really digging in on this, I am starting to think TQ isn't really the best solution despite the claims in the paper. Certainly I was able to get a coherent, highly compressed KV cache. Performance issues aside (some of which can be improved, and some that are much improved on newer hardware), the perplexity scores drift from f16 quite a bit. This does not appear to be as lossless as expected. It's possible it is due to something in my implementation, however I went back to the paper and noticed some things. First, the paper abstract sounds as if this is a general solution, however the paper itself is pretty specific about the models it selected and the method in which the loss was measured. In addition, I read several papers that cite or critique TurboQuant. One key finding: the QJL residual component (part of what makes TQ's compression work) has a known accuracy degradation that compounds per layer and the paper only tested on 32-layer models, and the math suggests it would break down badly on larger ones (arXiv:2604.19528). Another paper points out that minimizing reconstruction error (what TQ optimizes for) isn't the same as minimizing perplexity loss and they can diverge significantly (arXiv:2602.05367). Also recommend reading arXiv:2604.18555 (and I did correct the flaw stated and benchmarked, minimal effect, though). I'm doing some more benchmarks and will post them when they finish. I'll update a branch tonight in case anyone wants to also take a look. I do have two other approaches in progress and they are simpler so I may be able to get those benchmarked as well.

GiteaMirror commented

2026-04-29 10:21:53 -05:00

@mverrilli commented on GitHub (Apr 27, 2026):

Also I put together a better perplexity measurement tool than I used previously. The previous one I used was self-PPL, and now I'm using reference PPL (forward passes on WikiText-2, etc).

@mverrilli commented on GitHub (Apr 27, 2026): Also I put together a better perplexity measurement tool than I used previously. The previous one I used was self-PPL, and now I'm using reference PPL (forward passes on WikiText-2, etc).

GiteaMirror commented

2026-04-29 10:21:53 -05:00

@mverrilli commented on GitHub (Apr 29, 2026):

Here's some output from some experimental compressions. https://gist.github.com/mverrilli/dbd9935bdec44495e635a3c5cdf611d0

f16 - baseline, no compression
q4_0 / q8_0 - block quantization borrowed from weight quant.
tq (TurboQuant) - rotation + Lloyd-Max codebook, *qa were some tests where I added some extra features (qjl, outlier split, asymmetric).
q8k / q4k - per-group asymmetric int8/int4 (something clean and simple, modified of an idea common in a few papers)
saw - same as q8k/q4k but with Hadamard rot first (arXiv:2604.19157)

This run was really about perplexity, not compression. Larger ctx would have better kv cache compression rates due to overhead.

But you can see TQ really isn't that great PPL-wise. saw8kv and saw4kv are the winners here. Need more tests though.

@mverrilli commented on GitHub (Apr 29, 2026): Here's some output from some experimental compressions. https://gist.github.com/mverrilli/dbd9935bdec44495e635a3c5cdf611d0 f16 - baseline, no compression q4_0 / q8_0 - block quantization borrowed from weight quant. tq (TurboQuant) - rotation + Lloyd-Max codebook, *qa were some tests where I added some extra features (qjl, outlier split, asymmetric). q8k / q4k - per-group asymmetric int8/int4 (something clean and simple, modified of an idea common in a few papers) saw - same as q8k/q4k but with Hadamard rot first (arXiv:2604.19157) This run was really about perplexity, not compression. Larger ctx would have better kv cache compression rates due to overhead. But you can see TQ really isn't that great PPL-wise. saw8kv and saw4kv are the winners here. Need more tests though.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#56173