[GH-ISSUE #12138] Qwen3 Fails to Offload on RTX 4080 (12GB VRAM) Due to Incorrect GPU Memory Detection #54580

Open
opened 2026-04-29 06:25:28 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @sunhy0316 on GitHub (Sep 1, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12138

What is the issue?

When running Qwen3 models sequentially (first Qwen3-14B, then Qwen3-30B) on an RTX 4080 notebook with 12GB VRAM, the model does not offload to system RAM as expected. The log reports an abnormally large amount of available GPU memory, which appears to prevent the offloading mechanism from activating.
Using new ollama engine and new estimates.

Image

Image

Relevant log output


OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.11.6

Originally created by @sunhy0316 on GitHub (Sep 1, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12138 ### What is the issue? When running Qwen3 models sequentially (first Qwen3-14B, then Qwen3-30B) on an RTX 4080 notebook with 12GB VRAM, the model does not offload to system RAM as expected. The log reports an abnormally large amount of available GPU memory, which appears to prevent the offloading mechanism from activating. Using new ollama engine and new estimates. ![Image](https://github.com/user-attachments/assets/1de4ffa6-4fb6-47ec-8ae4-9f6858a059c2) ![Image](https://github.com/user-attachments/assets/c7e960de-472a-4d48-ba90-8656a99b33f3) ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.11.6
GiteaMirror added the bugneeds more info labels 2026-04-29 06:25:29 -05:00
Author
Owner

@dngettler commented on GitHub (Sep 1, 2025):

I would strongly recommend that you paste the errors in here as text, not a picture from your phone, because that way someone can paste it into AI or some bot will do it. I don't think it's likely that anyone's gonna try to read that on your phone--as much as we'd love to help--I can't tell what it says.

<!-- gh-comment-id:3240955035 --> @dngettler commented on GitHub (Sep 1, 2025): I would **strongly** recommend that you paste the errors in here as text, not a picture from your phone, because that way someone can paste it into AI or some bot will do it. I don't think it's likely that anyone's gonna try to read that on your phone--as much as we'd love to help--I can't tell what it says.
Author
Owner

@jessegross commented on GitHub (Sep 2, 2025):

It looks like you set num_gpu=100, forcing all layers onto the GPU regardless of the estimate. Leave it unset and it should move some layers to the CPU. The high available memory looks like an underflow in the debug logging but should not affect the layout.

And, yes, please post text logs in the future.

<!-- gh-comment-id:3246098941 --> @jessegross commented on GitHub (Sep 2, 2025): It looks like you set num_gpu=100, forcing all layers onto the GPU regardless of the estimate. Leave it unset and it should move some layers to the CPU. The high available memory looks like an underflow in the debug logging but should not affect the layout. And, yes, please post text logs in the future.
Author
Owner

@sunhy0316 commented on GitHub (Sep 2, 2025):

Apologies for the image-based log. My computer is on an isolated environment, which makes exporting text logs complicated. I will provide the full text logs as soon as I get the opportunity.

My main question is this: Given that the log's reported "available memory" value is just a display bug (as mentioned in the recent fix), why did the offloading mechanism still fail to work in practice?

To clarify the specific behavior: When I start with Qwen3-30B and then run Qwen3-14B, the system correctly unloads the 30B model, and ollama ps only shows the 14B model running. However, when I start with Qwen3-14B first and then run Qwen3-30B, the 14B model is not unloaded. In this case, ollama ps shows both models active, each reported as using ~11GB of VRAM at 100% GPU. The parameter num_gpu=100 are set for both models.

I will provide the complete logs for further investigation, but it will take some time. Thank you.

<!-- gh-comment-id:3246656754 --> @sunhy0316 commented on GitHub (Sep 2, 2025): Apologies for the image-based log. My computer is on an isolated environment, which makes exporting text logs complicated. I will provide the full text logs as soon as I get the opportunity. My main question is this: Given that the log's reported "available memory" value is just a display bug (as mentioned in the recent fix), why did the offloading mechanism still fail to work in practice? To clarify the specific behavior: When I start with Qwen3-30B and then run Qwen3-14B, the system correctly unloads the 30B model, and ollama ps only shows the 14B model running. However, when I start with Qwen3-14B first and then run Qwen3-30B, the 14B model is not unloaded. In this case, ollama ps shows both models active, each reported as using ~11GB of VRAM at 100% GPU. The parameter `num_gpu=100` are set for both models. I will provide the complete logs for further investigation, but it will take some time. Thank you.
Author
Owner

@jessegross commented on GitHub (Sep 2, 2025):

num_gpu tells it to forcibly load at as many layers as you specified, in this case all of them since 100 is greater than the number in the model. You are overriding the memory estimates and Ollama is doing what you told it to do. Don't set num_gpu if you want it to offload to CPU.

<!-- gh-comment-id:3246761283 --> @jessegross commented on GitHub (Sep 2, 2025): num_gpu tells it to forcibly load at as many layers as you specified, in this case all of them since 100 is greater than the number in the model. You are overriding the memory estimates and Ollama is doing what you told it to do. Don't set num_gpu if you want it to offload to CPU.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54580