[GH-ISSUE #11949] Bad performance on Gemma3 with OLLAMA_KV_CACHE_TYPE less than fp16 #33694

Closed
opened 2026-04-22 16:36:41 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @McBane87 on GitHub (Aug 18, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11949

What is the issue?

When using gemma3:12b-it-qat and set OLLAMA_KV_CACHE_TYPE=q8_0 the token per second performance is very bad compared to fp16.

Results

Quant Token per second
FP16 29.9
Q8_0 5.5
Q4_0 5.5

As far as I know, it is not possible to set the KV cache per model. So it's an all or nothing option. That's why I decided to report this as a bug. Maybe you can disbale KV quantization, for this model, hardcoded in the source code. Same as you do already for other models?

Relevant log output


OS

Linux Ubuntu 24.04 + Docker Container

GPU

2 x Nvidia RTX 3060 12GB

CPU

AMD Ryzen 7900

Ollama version

0.11.5-rc2

Originally created by @McBane87 on GitHub (Aug 18, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11949 ### What is the issue? When using `gemma3:12b-it-qat` and set `OLLAMA_KV_CACHE_TYPE=q8_0` the token per second performance is very bad compared to `fp16`. **Results** | Quant | Token per second | | ------- | ------------------- | | FP16 | 29.9 | | Q8_0 | 5.5 | | Q4_0 | 5.5 | As far as I know, it is not possible to set the KV cache per model. So it's an all or nothing option. That's why I decided to report this as a bug. Maybe you can disbale KV quantization, for this model, hardcoded in the source code. Same as you do already for other models? ### Relevant log output ```shell ``` ### OS Linux Ubuntu 24.04 + Docker Container ### GPU 2 x Nvidia RTX 3060 12GB ### CPU AMD Ryzen 7900 ### Ollama version 0.11.5-rc2
GiteaMirror added the bug label 2026-04-22 16:36:41 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 18, 2025):

#9683

<!-- gh-comment-id:3195314754 --> @rick-github commented on GitHub (Aug 18, 2025): #9683
Author
Owner

@pdevine commented on GitHub (Aug 19, 2025):

Going to close as a dupe.

<!-- gh-comment-id:3202840994 --> @pdevine commented on GitHub (Aug 19, 2025): Going to close as a dupe.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33694