[GH-ISSUE #7965] It seems that the new KV cache quantization feature is incorrectly allocating resources. #5097

Closed
opened 2026-04-12 16:11:50 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @emzaedu on GitHub (Dec 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7965

What is the issue?

For example (q4_0 kv):
/set parameter num_ctx 88000
Rombos-LLM-V2.6-Qwen-14b-Q4_K_M:latest 81d0d17e9f6a 21 GB 100% GPU 4 minutes from now

However, the actual VRAM usage amounts to 13,880,772K

There is a significant difference between the actual VRAM usage (13.24 GB) and what Ollama reports (21 GB).

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.5.0

Originally created by @emzaedu on GitHub (Dec 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7965 ### What is the issue? For example (q4_0 kv): /set parameter num_ctx 88000 Rombos-LLM-V2.6-Qwen-14b-Q4_K_M:latest 81d0d17e9f6a 21 GB 100% GPU 4 minutes from now However, the actual VRAM usage amounts to 13,880,772K There is a significant difference between the actual VRAM usage (13.24 GB) and what Ollama reports (21 GB). ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.5.0
GiteaMirror added the bug label 2026-04-12 16:11:50 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 6, 2024):

https://github.com/ollama/ollama/issues/6160

<!-- gh-comment-id:2522909940 --> @rick-github commented on GitHub (Dec 6, 2024): https://github.com/ollama/ollama/issues/6160
Author
Owner

@pdevine commented on GitHub (Dec 20, 2024):

Going to close this as a dupe of the other issue.

<!-- gh-comment-id:2557803140 --> @pdevine commented on GitHub (Dec 20, 2024): Going to close this as a dupe of the other issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5097