[GH-ISSUE #8158] IBM Granite MoE & Dense-2b is very slow when KV Cache quantization is enabled #67265

Open
opened 2026-05-04 09:45:03 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @vYLQs6 on GitHub (Dec 18, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/8158

What is the issue?

I found all Granite MoE models + dense:2b runs extremely slow when KV Cache is enabled, there didn't seems to be any hit on models response quality, just speed, kinda strange

I'm using Windows 11 + RTX 4090

Here is an example using model: granite3.1-moe:3b-instruct-q8_0

set OLLAMA_FLASH_ATTENTION=1 && set OLLAMA_KV_CACHE_TYPE=q8_0 && ollama serve

>>> how far is the moon
The distance from Earth to the Moon can vary due to the elliptical shape of its orbit around our planet. On
average, it's about 238,855 miles (384,400 kilometers) away from Earth. However, this is approximately 238,855
miles (384,400 kilometers) at its closest approach and can range up to 252,088 miles (405,696 kilometers) during
its farthest point in its elliptical path.

total duration:       8.3218603s
load duration:        15.7633ms
prompt eval count:    49 token(s)
prompt eval duration: 242ms
prompt eval rate:     202.48 tokens/s
eval count:           130 token(s)
eval duration:        8.005s
eval rate:            16.24 tokens/s

ollama serve

>>> how far is the moon
The average distance from Earth to the Moon is approximately 238,855 miles (384,400 kilometers). However, it's
important to note that this can fluctuate slightly due to the elliptical nature of its orbit around our planet. At
its closest point, known as perigee, it's about 225,623 miles (363,104 kilometers) away from Earth, while at its
farthest point, called apogee, it can reach up to 252,088 miles (405,696 kilometers).

total duration:       4.2702016s
load duration:        805.8374ms
prompt eval count:    193 token(s)
prompt eval duration: 287ms
prompt eval rate:     672.47 tokens/s
eval count:           142 token(s)
eval duration:        3.115s
eval rate:            45.59 tokens/s

granite3.1-dense:2b also have the same issue

ollama run granite3.1-dense:2b-instruct-q8_0 --verbose

set OLLAMA_FLASH_ATTENTION=1 && set OLLAMA_KV_CACHE_TYPE=q8_0 && ollama serve

>>> how far is the moon
As previously mentioned, the average distance from the Earth to the Moon is approximately 238,855 miles (384,400
kilometers). This value remains constant throughout their orbital motion around each other.

total duration:       3.7165709s
load duration:        847.2622ms
prompt eval count:    124 token(s)
prompt eval duration: 93ms
prompt eval rate:     1333.33 tokens/s
eval count:           52 token(s)
eval duration:        2.717s
eval rate:            19.14 tokens/s

ollama serve

>>> how far is the moon
The average distance from the Earth to the Moon is about 238,855 miles (384,400 kilometers). This distance is
often referred to as the semi-major axis of the Moon's elliptical orbit around the Earth.

total duration:       815.8894ms
load duration:        16.3291ms
prompt eval count:    49 token(s)
prompt eval duration: 286ms
prompt eval rate:     171.33 tokens/s
eval count:           61 token(s)
eval duration:        458ms
eval rate:            133.19 tokens/s

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.5.4

Originally created by @vYLQs6 on GitHub (Dec 18, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/8158 ### What is the issue? I found all Granite MoE models + dense:2b runs extremely slow when KV Cache is enabled, there didn't seems to be any hit on models response quality, just speed, kinda strange I'm using Windows 11 + RTX 4090 Here is an example using model: granite3.1-moe:3b-instruct-q8_0 ### `set OLLAMA_FLASH_ATTENTION=1 && set OLLAMA_KV_CACHE_TYPE=q8_0 && ollama serve` ``` >>> how far is the moon The distance from Earth to the Moon can vary due to the elliptical shape of its orbit around our planet. On average, it's about 238,855 miles (384,400 kilometers) away from Earth. However, this is approximately 238,855 miles (384,400 kilometers) at its closest approach and can range up to 252,088 miles (405,696 kilometers) during its farthest point in its elliptical path. total duration: 8.3218603s load duration: 15.7633ms prompt eval count: 49 token(s) prompt eval duration: 242ms prompt eval rate: 202.48 tokens/s eval count: 130 token(s) eval duration: 8.005s eval rate: 16.24 tokens/s ``` ### `ollama serve` ``` >>> how far is the moon The average distance from Earth to the Moon is approximately 238,855 miles (384,400 kilometers). However, it's important to note that this can fluctuate slightly due to the elliptical nature of its orbit around our planet. At its closest point, known as perigee, it's about 225,623 miles (363,104 kilometers) away from Earth, while at its farthest point, called apogee, it can reach up to 252,088 miles (405,696 kilometers). total duration: 4.2702016s load duration: 805.8374ms prompt eval count: 193 token(s) prompt eval duration: 287ms prompt eval rate: 672.47 tokens/s eval count: 142 token(s) eval duration: 3.115s eval rate: 45.59 tokens/s ``` --- granite3.1-dense:2b also have the same issue `ollama run granite3.1-dense:2b-instruct-q8_0 --verbose` ### `set OLLAMA_FLASH_ATTENTION=1 && set OLLAMA_KV_CACHE_TYPE=q8_0 && ollama serve` ``` >>> how far is the moon As previously mentioned, the average distance from the Earth to the Moon is approximately 238,855 miles (384,400 kilometers). This value remains constant throughout their orbital motion around each other. total duration: 3.7165709s load duration: 847.2622ms prompt eval count: 124 token(s) prompt eval duration: 93ms prompt eval rate: 1333.33 tokens/s eval count: 52 token(s) eval duration: 2.717s eval rate: 19.14 tokens/s ``` ### `ollama serve` ``` >>> how far is the moon The average distance from the Earth to the Moon is about 238,855 miles (384,400 kilometers). This distance is often referred to as the semi-major axis of the Moon's elliptical orbit around the Earth. total duration: 815.8894ms load duration: 16.3291ms prompt eval count: 49 token(s) prompt eval duration: 286ms prompt eval rate: 171.33 tokens/s eval count: 61 token(s) eval duration: 458ms eval rate: 133.19 tokens/s ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.5.4
GiteaMirror added the bug label 2026-05-04 09:45:03 -05:00
Author
Owner

@coder543 commented on GitHub (Dec 18, 2024):

Strangely, with all Ollama settings at default, I have noticed that granite3.1-moe (3B) is never really any faster than llama3.2 (3B), and sometimes dramatically slower, even after trying several different quantizations for both. I would expect the MoE to be much faster given that most of the parameters are inactive.

Maybe this is related to the KV Cache issue that you're describing here.

All tests conducted on a Ubuntu server with a single RTX 3090:

// llama3.2:3b-instruct-q8_0
total duration:       2.87966925s
load duration:        11.646699ms
prompt eval count:    31 token(s)
prompt eval duration: 87ms
prompt eval rate:     356.32 tokens/s
eval count:           449 token(s)
eval duration:        2.78s
eval rate:            161.51 tokens/s
// llama3.2:3b-instruct-fp16
total duration:       3.585121225s
load duration:        11.333181ms
prompt eval count:    31 token(s)
prompt eval duration: 88ms
prompt eval rate:     352.27 tokens/s
eval count:           373 token(s)
eval duration:        3.484s
eval rate:            107.06 tokens/s
// granite3.1-moe:3b-instruct-q8_0
total duration:       2.946489419s
load duration:        4.619928ms
prompt eval count:    50 token(s)
prompt eval duration: 176ms
prompt eval rate:     284.09 tokens/s
eval count:           309 token(s)
eval duration:        2.692s
eval rate:            114.78 tokens/s
// granite3.1-moe:3b-instruct-fp16
total duration:       4.864525024s
load duration:        4.680112ms
prompt eval count:    50 token(s)
prompt eval duration: 11ms
prompt eval rate:     4545.45 tokens/s
eval count:           532 token(s)
eval duration:        4.847s
eval rate:            109.76 tokens/s
<!-- gh-comment-id:2551801315 --> @coder543 commented on GitHub (Dec 18, 2024): Strangely, with all Ollama settings at default, I have noticed that `granite3.1-moe` (3B) is never really any faster than `llama3.2` (3B), and sometimes dramatically slower, even after trying several different quantizations for both. I would expect the MoE to be _much_ faster given that most of the parameters are inactive. Maybe this is related to the KV Cache issue that you're describing here. All tests conducted on a Ubuntu server with a single RTX 3090: ``` // llama3.2:3b-instruct-q8_0 total duration: 2.87966925s load duration: 11.646699ms prompt eval count: 31 token(s) prompt eval duration: 87ms prompt eval rate: 356.32 tokens/s eval count: 449 token(s) eval duration: 2.78s eval rate: 161.51 tokens/s ``` ``` // llama3.2:3b-instruct-fp16 total duration: 3.585121225s load duration: 11.333181ms prompt eval count: 31 token(s) prompt eval duration: 88ms prompt eval rate: 352.27 tokens/s eval count: 373 token(s) eval duration: 3.484s eval rate: 107.06 tokens/s ``` ``` // granite3.1-moe:3b-instruct-q8_0 total duration: 2.946489419s load duration: 4.619928ms prompt eval count: 50 token(s) prompt eval duration: 176ms prompt eval rate: 284.09 tokens/s eval count: 309 token(s) eval duration: 2.692s eval rate: 114.78 tokens/s ``` ``` // granite3.1-moe:3b-instruct-fp16 total duration: 4.864525024s load duration: 4.680112ms prompt eval count: 50 token(s) prompt eval duration: 11ms prompt eval rate: 4545.45 tokens/s eval count: 532 token(s) eval duration: 4.847s eval rate: 109.76 tokens/s ```
Author
Owner

@Rundao commented on GitHub (Dec 20, 2024):

It is also significantly slower than qwen2.5:3b, especially with longer inputs.

<!-- gh-comment-id:2556300462 --> @Rundao commented on GitHub (Dec 20, 2024): It is also significantly slower than qwen2.5:3b, especially with longer inputs.
Author
Owner

@Justus-Jonas commented on GitHub (Jan 3, 2025):

same issue, any updates?

<!-- gh-comment-id:2569542346 --> @Justus-Jonas commented on GitHub (Jan 3, 2025): same issue, any updates?
Author
Owner

@vYLQs6 commented on GitHub (Mar 6, 2025):

The issue still exists in Granite 3.2 2B models, vision included

<!-- gh-comment-id:2702689767 --> @vYLQs6 commented on GitHub (Mar 6, 2025): The issue still exists in Granite 3.2 2B models, vision included
Author
Owner

@jessegross commented on GitHub (Mar 25, 2025):

The 2B versions of the Granite models have a head dimension of 64 and we don't currently have a kernel for this when using a quantized KV cache. As a result, these operations get executed on the CPU and are slow.

You may have better luck with the 8b version, which has a head dimension of 128, which is supported on the GPU.

<!-- gh-comment-id:2752755030 --> @jessegross commented on GitHub (Mar 25, 2025): The 2B versions of the Granite models have a head dimension of 64 and we don't currently have a kernel for this when using a quantized KV cache. As a result, these operations get executed on the CPU and are slow. You may have better luck with the 8b version, which has a head dimension of 128, which is supported on the GPU.
Author
Owner

@jeffsweep commented on GitHub (Mar 26, 2025):

There's not an 8b version of the vision model is there?

<!-- gh-comment-id:2754321400 --> @jeffsweep commented on GitHub (Mar 26, 2025): There's not an 8b version of the vision model is there?
Author
Owner

@jessegross commented on GitHub (Mar 26, 2025):

There's not an 8b version of the vision model is there?

Not that I'm aware of.

<!-- gh-comment-id:2755332394 --> @jessegross commented on GitHub (Mar 26, 2025): > There's not an 8b version of the vision model is there? Not that I'm aware of.
Author
Owner

@dotmobo commented on GitHub (Jun 25, 2025):

Same problem with granite3.3:2b and OLLAMA_KV_CACHE_TYPE: "q8_0", it's very very slow ...

<!-- gh-comment-id:3004722057 --> @dotmobo commented on GitHub (Jun 25, 2025): Same problem with granite3.3:2b and OLLAMA_KV_CACHE_TYPE: "q8_0", it's very very slow ...
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#67265