[GH-ISSUE #13051] GPT-oss-120b KV cache defragmentation. #8648

Closed
opened 2026-04-12 21:23:39 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @SingularityMan on GitHub (Nov 11, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13051

What is the issue?

When I chat with gpt-oss-120b on my MaxQ, sometimes there is a delay in TTFT even if there is still ~20-40GB VRAM available for use and running the model at 128K context length. When looking at the logs, it seems that Ollama likes to perform KV Cache defragmentation on this model about as much as it does with Gemma-3.

It doesn't happen all the time but it can be a little annoying when it does because I don't think this should be happening with my current setup. Granted, I don't know how Ollama handles this model, but I don't see why KV Cache defragmentation would happen with leftover VRAM available on the same GPU.

I also made it a point last year to set CUDA_VISIBLE_DEVICES to 0, which is the GPU I use for inference, and I disabled System Memory Fallback on Windows for Ollama to prevent accidental RAM blowups. But I never ran into OOM issues with this model. It performs very well in spite of that.

Relevant log output

time=2025-11-11T10:01:32.260-05:00 level=DEBUG source=sched.go:602 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a
time=2025-11-11T10:01:40.396-05:00 level=DEBUG source=server.go:1401 msg="completion request" images=0 prompt=78804 format=""
time=2025-11-11T10:01:40.667-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=7358 prompt=17128 used=4117 remaining=13011
time=2025-11-11T10:01:41.520-05:00 level=DEBUG source=causal.go:442 msg="defragmenting kv cache"
time=2025-11-11T10:01:41.864-05:00 level=DEBUG source=causal.go:442 msg="defragmenting kv cache"
time=2025-11-11T10:01:42.060-05:00 level=DEBUG source=causal.go:442 msg="defragmenting kv cache"
time=2025-11-11T10:01:43.110-05:00 level=DEBUG source=causal.go:442 msg="defragmenting kv cache"
time=2025-11-11T10:01:43.506-05:00 level=DEBUG source=causal.go:442 msg="defragmenting kv cache"
time=2025-11-11T10:01:43.731-05:00 level=DEBUG source=causal.go:442 msg="defragmenting kv cache"
time=2025-11-11T10:01:44.816-05:00 level=DEBUG source=causal.go:442 msg="defragmenting kv cache"

UPDATE: I think there's some confusion here. No way in hell is that prompt size that large. I've tried counting the tokens with models in the OpenAI family and they forecast ~20K tokens, not 78804. I don't know if that's a different kind of calculation Ollama makes, but I checked all my sources from which the text input is included and none of them are anywhere near that prompt count in the logs. Is this being calculated differently?

UPDATE 2: I think I figured it out. It looks like whitespace and newline characters were the culprit. They get tokenized too. Interesting, honestly. I shaved off a huge chunk of tokens like that.

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.12.9

Originally created by @SingularityMan on GitHub (Nov 11, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13051 ### What is the issue? When I chat with gpt-oss-120b on my MaxQ, sometimes there is a delay in TTFT even if there is still ~20-40GB VRAM available for use and running the model at 128K context length. When looking at the logs, it seems that Ollama likes to perform KV Cache defragmentation on this model about as much as it does with Gemma-3. It doesn't happen *all the time* but it can be a little annoying when it does because I don't think this should be happening with my current setup. Granted, I don't know how Ollama handles this model, but I don't see why KV Cache defragmentation would happen with leftover VRAM available on the same GPU. I also made it a point last year to set `CUDA_VISIBLE_DEVICES` to 0, which is the GPU I use for inference, and I disabled `System Memory Fallback` on Windows for Ollama to prevent accidental RAM blowups. But I never ran into OOM issues with this model. It performs very well in spite of that. ### Relevant log output ```shell time=2025-11-11T10:01:32.260-05:00 level=DEBUG source=sched.go:602 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a time=2025-11-11T10:01:40.396-05:00 level=DEBUG source=server.go:1401 msg="completion request" images=0 prompt=78804 format="" time=2025-11-11T10:01:40.667-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=7358 prompt=17128 used=4117 remaining=13011 time=2025-11-11T10:01:41.520-05:00 level=DEBUG source=causal.go:442 msg="defragmenting kv cache" time=2025-11-11T10:01:41.864-05:00 level=DEBUG source=causal.go:442 msg="defragmenting kv cache" time=2025-11-11T10:01:42.060-05:00 level=DEBUG source=causal.go:442 msg="defragmenting kv cache" time=2025-11-11T10:01:43.110-05:00 level=DEBUG source=causal.go:442 msg="defragmenting kv cache" time=2025-11-11T10:01:43.506-05:00 level=DEBUG source=causal.go:442 msg="defragmenting kv cache" time=2025-11-11T10:01:43.731-05:00 level=DEBUG source=causal.go:442 msg="defragmenting kv cache" time=2025-11-11T10:01:44.816-05:00 level=DEBUG source=causal.go:442 msg="defragmenting kv cache" ``` UPDATE: I think there's some confusion here. No way in hell is that prompt size that large. I've tried counting the tokens with models in the OpenAI family and they forecast ~20K tokens, not 78804. I don't know if that's a different kind of calculation Ollama makes, but I checked all my sources from which the text input is included and none of them are anywhere near that prompt count in the logs. Is this being calculated differently? UPDATE 2: I think I figured it out. It looks like whitespace and newline characters were the culprit. They get tokenized too. Interesting, honestly. I shaved off a huge chunk of tokens like that. ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.12.9
GiteaMirror added the bug label 2026-04-12 21:23:39 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8648