[GH-ISSUE #11365] KV cache not being used for Gemma 3 models #7499

Closed
opened 2026-04-12 19:34:55 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @lgruen-vcgs on GitHub (Jul 11, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11365

What is the issue?

When using Gemma 3 models, despite sending the exact same request multiple times in a row (with nothing else in between), the KV cache doesn't seem to get used.

The log messages for loading cache slot always show used=0 and the request timings don't show the expected KV cache hit speed-up.

time=2025-07-11T10:18:03.500+10:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=40557 format=""
time=2025-07-11T10:18:03.567+10:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[2]
time=2025-07-11T10:18:03.568+10:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=10429 used=0 remaining=10429
[GIN] 2025/07/11 - 10:18:08 | 200 | 10.801837987s |       127.0.0.1 | POST     "/api/generate"
time=2025-07-11T10:18:08.128+10:00 level=DEBUG source=sched.go:503 msg="context for request finished"
time=2025-07-11T10:18:08.128+10:00 level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/g
emma3:27b-it-qat runner.inference=cuda runner.devices=1 runner.size="24.7 GiB" runner.vram="24.7 GiB" runner.parallel=1 runner.pid=1233500 runner.model=/misc/data/ollama-m
odels/blobs/sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 runner.num_ctx=40000 duration=2562047h47m16.854775807s
time=2025-07-11T10:18:08.128+10:00 level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gemma3:27b-it-qat r
unner.inference=cuda runner.devices=1 runner.size="24.7 GiB" runner.vram="24.7 GiB" runner.parallel=1 runner.pid=1233500 runner.model=/misc/data/ollama-models/blobs/sha256
-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 runner.num_ctx=40000 refCount=0
time=2025-07-11T10:18:09.798+10:00 level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=/misc/data/ollama-models/blobs/sha256-ccc0cddac56136ef0969cf2e3e9a
c051124c937be42503b47ec570dead85ff87
time=2025-07-11T10:18:09.799+10:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=40557 format=""
time=2025-07-11T10:18:09.839+10:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[2]
time=2025-07-11T10:18:09.840+10:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=10432 prompt=10429 used=0 remaining=10429
time=2025-07-11T10:18:14.347+10:00 level=DEBUG source=runner.go:548 msg="hit stop token" pending="[< end _ of _ turn >]" stop=<end_of_turn>
[GIN] 2025/07/11 - 10:18:14 | 200 |  4.636694058s |       127.0.0.1 | POST     "/api/generate"

The exact same setup works perfectly fine for Qwen 3, with proper KV cache reuse, so I suspect this might be specific to Gemma 3. The log snippet below is for Qwen 3, showing used=10725.

time=2025-07-11T09:58:42.754+10:00 level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=/misc/data/ollama-models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312
time=2025-07-11T09:58:42.755+10:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=40551 format=""
time=2025-07-11T09:58:42.782+10:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=10775 prompt=10726 used=10725 remaining=1
[GIN] 2025/07/11 - 09:58:43 | 200 |  1.273330533s |       127.0.0.1 | POST     "/api/generate"

The only obvious difference is this line, but would that break the KV cache?

time=2025-07-11T10:18:09.839+10:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[2]

I've tried the gemma3:4b, gemma3:27b, gemma3:4b-it-qat, and gemma3:27b-it-qat variants, all behaving the same.

I've attached the full logs where the same request was sent three times in a row:
ollama-serve-gemma-kv-cache-miss.log

I've also attached the request payload, which was simply submitted using cURL:
kv_cache_hit_test_gemma.json

curl -X POST http://localhost:11434/api/generate -d @kv_cache_hit_test_gemma.json

I've also tried adding cache_prompt: true, but that didn't make a difference.

The server was started with OLLAMA_NUM_PARALLEL=1, OLLAMA_KEEP_ALIVE=-1, and OLLAMA_DEBUG=1 for the debug logs.

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.9.6

Originally created by @lgruen-vcgs on GitHub (Jul 11, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11365 ### What is the issue? When using Gemma 3 models, despite sending the exact same request multiple times in a row (with nothing else in between), the KV cache doesn't seem to get used. The log messages for `loading cache slot` always show `used=0` and the request timings don't show the expected KV cache hit speed-up. ``` time=2025-07-11T10:18:03.500+10:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=40557 format="" time=2025-07-11T10:18:03.567+10:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[2] time=2025-07-11T10:18:03.568+10:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=10429 used=0 remaining=10429 [GIN] 2025/07/11 - 10:18:08 | 200 | 10.801837987s | 127.0.0.1 | POST "/api/generate" time=2025-07-11T10:18:08.128+10:00 level=DEBUG source=sched.go:503 msg="context for request finished" time=2025-07-11T10:18:08.128+10:00 level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/g emma3:27b-it-qat runner.inference=cuda runner.devices=1 runner.size="24.7 GiB" runner.vram="24.7 GiB" runner.parallel=1 runner.pid=1233500 runner.model=/misc/data/ollama-m odels/blobs/sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 runner.num_ctx=40000 duration=2562047h47m16.854775807s time=2025-07-11T10:18:08.128+10:00 level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gemma3:27b-it-qat r unner.inference=cuda runner.devices=1 runner.size="24.7 GiB" runner.vram="24.7 GiB" runner.parallel=1 runner.pid=1233500 runner.model=/misc/data/ollama-models/blobs/sha256 -ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 runner.num_ctx=40000 refCount=0 time=2025-07-11T10:18:09.798+10:00 level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=/misc/data/ollama-models/blobs/sha256-ccc0cddac56136ef0969cf2e3e9a c051124c937be42503b47ec570dead85ff87 time=2025-07-11T10:18:09.799+10:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=40557 format="" time=2025-07-11T10:18:09.839+10:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[2] time=2025-07-11T10:18:09.840+10:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=10432 prompt=10429 used=0 remaining=10429 time=2025-07-11T10:18:14.347+10:00 level=DEBUG source=runner.go:548 msg="hit stop token" pending="[< end _ of _ turn >]" stop=<end_of_turn> [GIN] 2025/07/11 - 10:18:14 | 200 | 4.636694058s | 127.0.0.1 | POST "/api/generate" ``` The exact same setup works perfectly fine for Qwen 3, with proper KV cache reuse, so I suspect this might be specific to Gemma 3. The log snippet below is for Qwen 3, showing `used=10725`. ``` time=2025-07-11T09:58:42.754+10:00 level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=/misc/data/ollama-models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 time=2025-07-11T09:58:42.755+10:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=40551 format="" time=2025-07-11T09:58:42.782+10:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=10775 prompt=10726 used=10725 remaining=1 [GIN] 2025/07/11 - 09:58:43 | 200 | 1.273330533s | 127.0.0.1 | POST "/api/generate" ``` The only obvious difference is this line, but would that break the KV cache? ``` time=2025-07-11T10:18:09.839+10:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[2] ``` I've tried the `gemma3:4b`, `gemma3:27b`, `gemma3:4b-it-qat`, and `gemma3:27b-it-qat` variants, all behaving the same. I've attached the full logs where the same request was sent three times in a row: [ollama-serve-gemma-kv-cache-miss.log](https://github.com/user-attachments/files/21173166/ollama-serve-gemma-kv-cache-miss.log) I've also attached the request payload, which was simply submitted using cURL: [kv_cache_hit_test_gemma.json](https://github.com/user-attachments/files/21173244/kv_cache_hit_test_gemma.json) ```sh curl -X POST http://localhost:11434/api/generate -d @kv_cache_hit_test_gemma.json ``` I've also tried adding `cache_prompt: true`, but that didn't make a difference. The server was started with `OLLAMA_NUM_PARALLEL=1`, `OLLAMA_KEEP_ALIVE=-1`, and `OLLAMA_DEBUG=1` for the debug logs. ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.9.6
GiteaMirror added the bug label 2026-04-12 19:34:55 -05:00
Author
Owner

@lgruen-vcgs commented on GitHub (Jul 11, 2025):

When using the llama.cpp server directly (e.g. with gemma-3-27b-it-q4_0.gguf), the cache works as expected -- note the "prompt_n": 1 below for a subsequent request:

{
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 18.155,
    "prompt_per_token_ms": 18.155,
    "prompt_per_second": 55.081244836133294,
    "predicted_n": 9,
    "predicted_ms": 132.178,
    "predicted_per_token_ms": 14.686444444444444,
    "predicted_per_second": 68.08999984868889
  }
}
<!-- gh-comment-id:3060404058 --> @lgruen-vcgs commented on GitHub (Jul 11, 2025): When using the `llama.cpp` server directly (e.g. with `gemma-3-27b-it-q4_0.gguf`), the cache works as expected -- note the `"prompt_n": 1` below for a subsequent request: ```json { "timings": { "prompt_n": 1, "prompt_ms": 18.155, "prompt_per_token_ms": 18.155, "prompt_per_second": 55.081244836133294, "predicted_n": 9, "predicted_ms": 132.178, "predicted_per_token_ms": 14.686444444444444, "predicted_per_second": 68.08999984868889 } } ```
Author
Owner

@lgruen-vcgs commented on GitHub (Jul 11, 2025):

I had to set enable --swa-full to avoid the following though:

slot launch_slot_: id  0 | task 2014 | processing task
slot update_slots: id  0 | task 2014 | new prompt, n_ctx_slot = 40192, n_keep = 0, n_prompt_tokens = 10423
slot update_slots: id  0 | task 2014 | n_past = 10423, cache_tokens.size() = 10922, seq_id = 0, pos_min = 9716, n_swa = 1024
slot update_slots: id  0 | task 2014 | forcing full prompt re-processing due to lack of cache data (likely due to SWA, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 2014 | kv cache rm [0, end)
slot update_slots: id  0 | task 2014 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.196489
slot update_slots: id  0 | task 2014 | kv cache rm [2048, end)
slot update_slots: id  0 | task 2014 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.392977
slot update_slots: id  0 | task 2014 | kv cache rm [4096, end)
slot update_slots: id  0 | task 2014 | prompt processing progress, n_past = 6144, n_tokens = 2048, progress = 0.589466
slot update_slots: id  0 | task 2014 | kv cache rm [6144, end)
slot update_slots: id  0 | task 2014 | prompt processing progress, n_past = 8192, n_tokens = 2048, progress = 0.785954
slot update_slots: id  0 | task 2014 | kv cache rm [8192, end)
slot update_slots: id  0 | task 2014 | prompt processing progress, n_past = 10240, n_tokens = 2048, progress = 0.982443
slot update_slots: id  0 | task 2014 | kv cache rm [10240, end)
slot update_slots: id  0 | task 2014 | prompt processing progress, n_past = 10423, n_tokens = 183, progress = 1.000000
slot update_slots: id  0 | task 2014 | prompt done, n_past = 10423, n_tokens = 183
slot      release: id  0 | task 2014 | stop processing: n_past = 10922, truncated = 0
slot print_timing: id  0 | task 2014 |
prompt eval time =    3733.67 ms / 10423 tokens (    0.36 ms per token,  2791.62 tokens per second)
       eval time =    8168.88 ms /   500 tokens (   16.34 ms per token,    61.21 tokens per second)
      total time =   11902.55 ms / 10923 tokens
srv  update_slots: all slots are idle
<!-- gh-comment-id:3060513446 --> @lgruen-vcgs commented on GitHub (Jul 11, 2025): I had to set enable `--swa-full` to avoid the following though: ``` slot launch_slot_: id 0 | task 2014 | processing task slot update_slots: id 0 | task 2014 | new prompt, n_ctx_slot = 40192, n_keep = 0, n_prompt_tokens = 10423 slot update_slots: id 0 | task 2014 | n_past = 10423, cache_tokens.size() = 10922, seq_id = 0, pos_min = 9716, n_swa = 1024 slot update_slots: id 0 | task 2014 | forcing full prompt re-processing due to lack of cache data (likely due to SWA, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055) slot update_slots: id 0 | task 2014 | kv cache rm [0, end) slot update_slots: id 0 | task 2014 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.196489 slot update_slots: id 0 | task 2014 | kv cache rm [2048, end) slot update_slots: id 0 | task 2014 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.392977 slot update_slots: id 0 | task 2014 | kv cache rm [4096, end) slot update_slots: id 0 | task 2014 | prompt processing progress, n_past = 6144, n_tokens = 2048, progress = 0.589466 slot update_slots: id 0 | task 2014 | kv cache rm [6144, end) slot update_slots: id 0 | task 2014 | prompt processing progress, n_past = 8192, n_tokens = 2048, progress = 0.785954 slot update_slots: id 0 | task 2014 | kv cache rm [8192, end) slot update_slots: id 0 | task 2014 | prompt processing progress, n_past = 10240, n_tokens = 2048, progress = 0.982443 slot update_slots: id 0 | task 2014 | kv cache rm [10240, end) slot update_slots: id 0 | task 2014 | prompt processing progress, n_past = 10423, n_tokens = 183, progress = 1.000000 slot update_slots: id 0 | task 2014 | prompt done, n_past = 10423, n_tokens = 183 slot release: id 0 | task 2014 | stop processing: n_past = 10922, truncated = 0 slot print_timing: id 0 | task 2014 | prompt eval time = 3733.67 ms / 10423 tokens ( 0.36 ms per token, 2791.62 tokens per second) eval time = 8168.88 ms / 500 tokens ( 16.34 ms per token, 61.21 tokens per second) total time = 11902.55 ms / 10923 tokens srv update_slots: all slots are idle ```
Author
Owner

@jessegross commented on GitHub (Jul 11, 2025):

It's an artifact of sliding window attention, which gemma3 uses but qwen3 does not. To save memory, the model only looks at (and the cache only saves) the last 1024 tokens. If you have partial reuse of the cache and the window of the new sequence is not fully contained in the saved tokens, then it has to recompute the cache.

This works fine if you are continuing a conversation but if you send the same original prompt again then the window will already have moved on due to the tokens generated in response to the previous prompt.

Llama.cpp's --swa-full is disabling this optimization and keeping all of the history in memory but Ollama does not offer an equivalent option.

<!-- gh-comment-id:3063693926 --> @jessegross commented on GitHub (Jul 11, 2025): It's an artifact of sliding window attention, which gemma3 uses but qwen3 does not. To save memory, the model only looks at (and the cache only saves) the last 1024 tokens. If you have partial reuse of the cache and the window of the new sequence is not fully contained in the saved tokens, then it has to recompute the cache. This works fine if you are continuing a conversation but if you send the same original prompt again then the window will already have moved on due to the tokens generated in response to the previous prompt. Llama.cpp's `--swa-full` is disabling this optimization and keeping all of the history in memory but Ollama does not offer an equivalent option.
Author
Owner

@lgruen commented on GitHub (Jul 12, 2025):

Thanks a lot for explaining, @jessegross! Good to know it's expected behavior.

<!-- gh-comment-id:3064698185 --> @lgruen commented on GitHub (Jul 12, 2025): Thanks a lot for explaining, @jessegross! Good to know it's expected behavior.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7499