[GH-ISSUE #10945] KV Cache Quantization breaks Gemma3 #69266

Closed
opened 2026-05-04 17:36:33 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @mlaihk on GitHub (Jun 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10945

Originally assigned to: @jessegross on GitHub.

What is the issue?

With Ollama 0.8.0 and up to at least 0.9.0, if I have set KV_CACHE_TYPE to anything but default (I have it at q8_0), Gemma3 models will perform extremely poorly in terms of tops and also response accuracy.
Symptoms includes long model load time, difficulties in tool calling (on OpenWebUI), RAG difficulties.... etc.

Deleting environment variable OLLAMA_KV_CACHE_TYPE and restart ollama will restore performance to Gemma3.

Relevant log output


OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.9.0

Originally created by @mlaihk on GitHub (Jun 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10945 Originally assigned to: @jessegross on GitHub. ### What is the issue? With Ollama 0.8.0 and up to at least 0.9.0, if I have set KV_CACHE_TYPE to anything but default (I have it at q8_0), Gemma3 models will perform extremely poorly in terms of tops and also response accuracy. Symptoms includes long model load time, difficulties in tool calling (on OpenWebUI), RAG difficulties.... etc. Deleting environment variable OLLAMA_KV_CACHE_TYPE and restart ollama will restore performance to Gemma3. ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.9.0
GiteaMirror added the bug label 2026-05-04 17:36:33 -05:00
Author
Owner

@MarshallBelles commented on GitHub (Jul 2, 2025):

@mlaihk Hey, just curious, did you try any other models? Also it might help to know your exact CPU and GPU, and Windows version. Were you running ollama in WSL?

<!-- gh-comment-id:3029650455 --> @MarshallBelles commented on GitHub (Jul 2, 2025): @mlaihk Hey, just curious, did you try any other models? Also it might help to know your exact CPU and GPU, and Windows version. Were you running ollama in WSL?
Author
Owner

@warfair1337 commented on GitHub (Jul 6, 2025):

I'm having the same issue in Ollama 0.9.5 with Gemma3:12b (pulled from ollama)
Whenever I use OLLAMA_KV_CACHE_TYPE with Q4_0 or Q8_0, gemma3:12b is extremely slow at loading and inferencing, observing high CPU utilization and only 25-35% utilization on GPU.

Not having any issues with models like mistral-nemo:12b and phi4:14b using the same configuration.

Running in Docker on Arch linux.
RTX 4090 24GB, Ryzen 9 9900x, 64GB RAM

<!-- gh-comment-id:3041469442 --> @warfair1337 commented on GitHub (Jul 6, 2025): I'm having the same issue in Ollama 0.9.5 with Gemma3:12b (pulled from ollama) Whenever I use OLLAMA_KV_CACHE_TYPE with Q4_0 or Q8_0, gemma3:12b is extremely slow at loading and inferencing, observing high CPU utilization and only 25-35% utilization on GPU. Not having any issues with models like mistral-nemo:12b and phi4:14b using the same configuration. Running in Docker on Arch linux. RTX 4090 24GB, Ryzen 9 9900x, 64GB RAM
Author
Owner

@mohammedgomaa commented on GitHub (Jul 19, 2025):

same here

<!-- gh-comment-id:3092346378 --> @mohammedgomaa commented on GitHub (Jul 19, 2025): same here
Author
Owner

@wagneramichael commented on GitHub (Oct 10, 2025):

Same issue here where enabling KV causes gemma3 to use a lot of cpu and have a massive performance drop.

Owen models are not affected.

KV OF: PetrosStav/gemma3-tools:4b 50aaf11740b7 5.3 GB 100% GPU 4096 24 hours from now

OLLAMA_KV_CACHE_TYPE=q8_0: PetrosStav/gemma3-tools:4b 50aaf11740b7 5.2 GB 100% GPU 4096 24 hours from now

running Fedora Linux 42.20251006.0 (IoT Edition)
Dell Inc. OptiPlex 5060
6x Intel(R) Core(TM) i5-8600 CPU @ 3.10GHz
NVIDIA GeForce RTX 2060 super
podman cuda
rootfull container for now, im not dealing with user namespacing till I get the performace as it should to simplify things

Image Image

Variables for Ollama:
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_NUM_PARALLEL=1
OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_KEEP_ALIVE=24h
OLLAMA_ORIGINS=http://ollama.pod:11434,http://127.0.0.1:11434
OLLAMA_HOST=0.0.0.0:11434
OLLAMA_LOAD_TIMEOUT=2m

Podman quadlet config

[Unit]
Description=Ollama container
Requires=nvidia-cdi-generator
After=nvidia-cdi-generator
 
[Container]
ContainerName=ollama
HostName=ollama
Image=docker.io/ollama/ollama:latest
Pod=ollama.pod

Timezone=local

AddDevice=nvidia.com/gpu=all

EnvironmentFile=./%n.d/env

AutoUpdate=registry

Volume=ollama.volume:/root/.ollama

HealthCmd="pgrep -x ollama || exit 1"
HealthStartPeriod=5s
HealthInterval=30s
HealthTimeout=20s
HealthRetries=5

LogDriver=k8s-file
LogOpt=path=/tmp/ollama.log

Label=role=models
Label=app=ai

PodmanArgs=--cpus=4
PodmanArgs=--memory=9g

[Service]
Restart=always
TimeoutStartSec=60

[Install]
WantedBy=multi-user.target default.target

nvidia-cdi-generator.service

[Unit]
Description=Generate NVIDIA CDI configuration
 
[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
RemainAfterExit=yes
 
[Install]
WantedBy=multi-user.target
<!-- gh-comment-id:3387966962 --> @wagneramichael commented on GitHub (Oct 10, 2025): Same issue here where enabling KV causes gemma3 to use a lot of cpu and have a massive performance drop. Owen models are not affected. KV OF: PetrosStav/gemma3-tools:4b 50aaf11740b7 5.3 GB 100% GPU 4096 24 hours from now OLLAMA_KV_CACHE_TYPE=q8_0: PetrosStav/gemma3-tools:4b 50aaf11740b7 5.2 GB 100% GPU 4096 24 hours from now running Fedora Linux 42.20251006.0 (IoT Edition) Dell Inc. OptiPlex 5060 6x Intel(R) Core(TM) i5-8600 CPU @ 3.10GHz NVIDIA GeForce RTX 2060 super podman cuda rootfull container for now, im not dealing with user namespacing till I get the performace as it should to simplify things <img width="828" height="1792" alt="Image" src="https://github.com/user-attachments/assets/9238592e-985f-4e2d-94a1-679f31cfd5f3" /> <img width="828" height="1792" alt="Image" src="https://github.com/user-attachments/assets/15b63b4c-c01f-4631-b051-f5c94ef4367b" /> Variables for Ollama: NVIDIA_VISIBLE_DEVICES=all NVIDIA_DRIVER_CAPABILITIES=compute,utility OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_KEEP_ALIVE=24h OLLAMA_ORIGINS=http://ollama.pod:11434,http://127.0.0.1:11434 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_LOAD_TIMEOUT=2m Podman quadlet config ``` [Unit] Description=Ollama container Requires=nvidia-cdi-generator After=nvidia-cdi-generator [Container] ContainerName=ollama HostName=ollama Image=docker.io/ollama/ollama:latest Pod=ollama.pod Timezone=local AddDevice=nvidia.com/gpu=all EnvironmentFile=./%n.d/env AutoUpdate=registry Volume=ollama.volume:/root/.ollama HealthCmd="pgrep -x ollama || exit 1" HealthStartPeriod=5s HealthInterval=30s HealthTimeout=20s HealthRetries=5 LogDriver=k8s-file LogOpt=path=/tmp/ollama.log Label=role=models Label=app=ai PodmanArgs=--cpus=4 PodmanArgs=--memory=9g [Service] Restart=always TimeoutStartSec=60 [Install] WantedBy=multi-user.target default.target ``` nvidia-cdi-generator.service ``` [Unit] Description=Generate NVIDIA CDI configuration [Service] Type=oneshot ExecStart=/usr/bin/nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml RemainAfterExit=yes [Install] WantedBy=multi-user.target ```
Author
Owner

@jessegross commented on GitHub (Oct 10, 2025):

This is caused by the combination of features being unsupported by the existing CUDA kernels and causing a fallback to CPU. It is fixed for 0.12.5.

<!-- gh-comment-id:3391554387 --> @jessegross commented on GitHub (Oct 10, 2025): This is caused by the combination of features being unsupported by the existing CUDA kernels and causing a fallback to CPU. It is fixed for 0.12.5.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69266