[GH-ISSUE #10477] mistral3.1 and qwen3 have the same memory reporting problem #6891

Closed
opened 2026-04-12 18:45:45 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @Fade78 on GitHub (Apr 29, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10477

Originally assigned to: @jessegross on GitHub.

What is the issue?

qwen3:30b, in ollama, claims to use 100% of VRAM while in fact it uses 80%. If I compensate by increasing the context size it will go GPU/CPU while it could have gone full GPU. Same thing for 32b and, if memory serves, Mistral-Small3.1.

(edit)
Another data

Qwen3:14b Q4KM: ollama reports 100% of VRAM usage with 64k context while only 75% is taken. Therefore, I could in fact have a larger context.

For context, Gemma3:27b Q4KM can be run in the same amount of VRAM with a 80K context.

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @Fade78 on GitHub (Apr 29, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10477 Originally assigned to: @jessegross on GitHub. ### What is the issue? qwen3:30b, in ollama, claims to use 100% of VRAM while in fact it uses 80%. If I compensate by increasing the context size it will go GPU/CPU while it could have gone full GPU. Same thing for 32b and, if memory serves, Mistral-Small3.1. (edit) Another data Qwen3:14b Q4KM: ollama reports 100% of VRAM usage with 64k context while only 75% is taken. Therefore, I could in fact have a larger context. For context, Gemma3:27b Q4KM can be run in the same amount of VRAM with a 80K context. ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-12 18:45:45 -05:00
Author
Owner

@jessegross commented on GitHub (Apr 29, 2025):

When Ollama reports 100% GPU under processor in ollama ps that means that the model is completely loaded on the GPU, not that it is using 100% of the VRAM. Ollama deliberately does not use 100% of VRAM as doing so tends to cause stuttering and other problems, varying based on the GPU and OS.

Different models have various features that might cause them to require different amounts of VRAM for context. Gemma3 using sliding window attention, which is more efficient for longer contexts.

Everything looks as expected here, so I'm going to go ahead and close this.

<!-- gh-comment-id:2839794782 --> @jessegross commented on GitHub (Apr 29, 2025): When Ollama reports `100% GPU` under processor in `ollama ps` that means that the model is completely loaded on the GPU, not that it is using 100% of the VRAM. Ollama deliberately does not use 100% of VRAM as doing so tends to cause stuttering and other problems, varying based on the GPU and OS. Different models have various features that might cause them to require different amounts of VRAM for context. Gemma3 using sliding window attention, which is more efficient for longer contexts. Everything looks as expected here, so I'm going to go ahead and close this.
Author
Owner

@Fade78 commented on GitHub (May 2, 2025):

All other models, older ones, are using 100% VRAM until I put a little to much context. In every case this correlate with 100% VRAM used on the GPUs.

Ollama will load those new models like there is only 80% of VRAM available. Ollama will declare that 100% of the model is loaded in VRAM. Now if I'm increasing a little bit, it well offload a part on the CPU. You could argue that because the 20% VRAM not used is in fact reserved for a new way to use VRAM at runtime, but at runtime, this VRAM is never used, even when the model is 100 GPU loaded.

Please try to reproduce.

<!-- gh-comment-id:2848204459 --> @Fade78 commented on GitHub (May 2, 2025): All other models, older ones, are using 100% VRAM until I put a little to much context. In every case this correlate with 100% VRAM used on the GPUs. Ollama will load those new models like there is only 80% of VRAM available. Ollama will declare that 100% of the model is loaded in VRAM. Now if I'm increasing a little bit, it well offload a part on the CPU. You could argue that because the 20% VRAM not used is in fact reserved for a new way to use VRAM at runtime, but at runtime, this VRAM is never used, even when the model is 100 GPU loaded. Please try to reproduce.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6891