[GH-ISSUE #10128] Incorrect VRAM estimation #53159

Open
opened 2026-04-29 02:09:15 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @Khawn2u on GitHub (Apr 4, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10128

What is the issue?

I run have a batch file to run ollama for windows that looks like the following:

set OLLAMA_FLASH_ATTENTION=1
set OLLAMA_KV_CACHE_TYPE=q8_0
set CUDA_VISIBLE_DEVICES=1,2
ollama serve

But when I load a model "Qwen2.5 7b 1M" with 262144 context window. Ollama only uses 10GB (across my 2 24GB Nvidia P40s, so 5GB on both) out of my 48GB total (across both cards)

Here is a more visual explanation:

1 Tesla P40:
6097MiB / 24576MiB
2 Tesla P40:
5419MiB / 24576MiB

But ollama reports:
hf.co/bartowski/Qwen2.5-7B-Instruct-1M-GGUF:Q4_K_M 311926e8160a 52 GB 4%/96% CPU/GPU 4 minutes from now

This is really bad, as the model could fit on even 1 of my GPUs. I assume this is because the VRAM estimation does not account for quantized KV cache. This should be an easy fix in that case.

Relevant log output

hf.co/bartowski/Qwen2.5-7B-Instruct-1M-GGUF:Q4_K_M    311926e8160a    52 GB    4%/96% CPU/GPU    4 minutes from now

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.6.3

Originally created by @Khawn2u on GitHub (Apr 4, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10128 ### What is the issue? I run have a batch file to run ollama for windows that looks like the following: set OLLAMA_FLASH_ATTENTION=1 set OLLAMA_KV_CACHE_TYPE=q8_0 set CUDA_VISIBLE_DEVICES=1,2 ollama serve But when I load a model "Qwen2.5 7b 1M" with 262144 context window. Ollama only uses 10GB (across my 2 24GB Nvidia P40s, so 5GB on both) out of my 48GB total (across both cards) Here is a more visual explanation: 1 Tesla P40: 6097MiB / 24576MiB 2 Tesla P40: 5419MiB / 24576MiB But ollama reports: hf.co/bartowski/Qwen2.5-7B-Instruct-1M-GGUF:Q4_K_M 311926e8160a 52 GB 4%/96% CPU/GPU 4 minutes from now This is really bad, as the model could fit on even 1 of my GPUs. I assume this is because the VRAM estimation does not account for quantized KV cache. This should be an easy fix in that case. ### Relevant log output ```shell hf.co/bartowski/Qwen2.5-7B-Instruct-1M-GGUF:Q4_K_M 311926e8160a 52 GB 4%/96% CPU/GPU 4 minutes from now ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.6.3
GiteaMirror added the bug label 2026-04-29 02:09:15 -05:00
Author
Owner

@chigkim commented on GitHub (Apr 5, 2025):

if you search, you'll find a lot of issues regarding this unfortunately.

<!-- gh-comment-id:2781115950 --> @chigkim commented on GitHub (Apr 5, 2025): if you search, you'll find a lot of issues regarding this unfortunately.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53159