[GH-ISSUE #11190] Weight x2 after 0.7.2 -> 0.9.1 #33134

Closed
opened 2026-04-22 15:28:35 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @acp1664 on GitHub (Jun 25, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11190

Hi everyone!

Thank you for your great work.

I’ve been using Ollama with OpenWeb-UI consistently. I always run LLMs under 7GB VRAM, and it has worked smoothly since upgrading to Ollama 0.9.

However, I noticed a significant performance drop after the update. Here’s an example:

With Qwen3:8b:

  • On Ollama 0.7, the model weight was below 7GB VRAM, and the GPU could handle it fully. I achieved 20 tokens/second.
  • On Ollama 0.9, the model weight exceeds 16GB, which causes partial loading (55% on CPU, 45% on GPU). This results in incredibly slow performance—only 2 tokens/second.

Is this a known issue? It’s very frustrating because I’m unable to use Ollama effectively after the update.

In contrast, LM Studio handles large context and attached files seamlessly, with no performance issues (also achieving 20 tokens/second).

My system specs:

  • OS: Windows 11
  • RAM: 64GB
  • GPU: Nvidia GTX 1080 (8GB VRAM)

Image
Image
Image

Thank you in advance for your help!

Originally created by @acp1664 on GitHub (Jun 25, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11190 Hi everyone! Thank you for your great work. I’ve been using Ollama with OpenWeb-UI consistently. I always run LLMs under 7GB VRAM, and it has worked smoothly since upgrading to Ollama 0.9. However, I noticed a significant performance drop after the update. Here’s an example: With Qwen3:8b: - On Ollama 0.7, the model weight was below 7GB VRAM, and the GPU could handle it fully. I achieved 20 tokens/second. - On Ollama 0.9, the model weight exceeds 16GB, which causes partial loading (55% on CPU, 45% on GPU). This results in incredibly slow performance—only 2 tokens/second. Is this a known issue? It’s very frustrating because I’m unable to use Ollama effectively after the update. In contrast, LM Studio handles large context and attached files seamlessly, with no performance issues (also achieving 20 tokens/second). My system specs: - OS: Windows 11 - RAM: 64GB - GPU: Nvidia GTX 1080 (8GB VRAM) ![Image](https://github.com/user-attachments/assets/802f6b88-39b7-4067-8010-b615184dbd0a) ![Image](https://github.com/user-attachments/assets/2d3ef3d0-0df9-4ec7-9eb8-bc59053e2479) ![Image](https://github.com/user-attachments/assets/076869aa-190a-4ed6-b8a9-e6b18988b773) Thank you in advance for your help!
GiteaMirror added the feature requestneeds more info labels 2026-04-22 15:28:35 -05:00
Author
Owner

@rick-github commented on GitHub (Jun 25, 2025):

The ollama memory estimation has undergone some work to reduce the possibility of OOMing the runner. As a result, the memory estimations are now a little conservative and so ollama ends up offloading more layers to the CPU than it needs to. There is ongoing work in #11090 to fix the memory management so this problem may be resolved in the next few releases. In the meantime, you can set num_gpu as described here to force ollama to load more layers on the GPU.

<!-- gh-comment-id:3002255026 --> @rick-github commented on GitHub (Jun 25, 2025): The ollama memory estimation has undergone some work to reduce the possibility of OOMing the runner. As a result, the memory estimations are now a little conservative and so ollama ends up offloading more layers to the CPU than it needs to. There is ongoing work in #11090 to fix the memory management so this problem may be resolved in the next few releases. In the meantime, you can set `num_gpu` as described [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650) to force ollama to load more layers on the GPU.
Author
Owner

@rick-github commented on GitHub (Jun 25, 2025):

Since you are using open-webui, you should be able to use the chat controls in the upper right hand side to set num_gpu.

<!-- gh-comment-id:3002259566 --> @rick-github commented on GitHub (Jun 25, 2025): Since you are using open-webui, you should be able to use the chat controls in the upper right hand side to set `num_gpu`.
Author
Owner

@jessegross commented on GitHub (Jun 25, 2025):

This doesn't look like a memory estimation issue to me. This is what I get on 0.9.1:

NAME        ID              SIZE      PROCESSOR    UNTIL              
qwen3:8b    500a1f067a9f    7.6 GB    100% GPU     4 minutes from now    

It's actually slightly higher on 0.7.0:

NAME        ID              SIZE      PROCESSOR    UNTIL              
qwen3:8b    500a1f067a9f    8.0 GB    100% GPU     4 minutes from now   

And using the actual allocations from #11090:

NAME        ID              SIZE      PROCESSOR    UNTIL              
qwen3:8b    500a1f067a9f    7.0 GB    100% GPU     4 minutes from now   

The last one is the source of truth so the estimate from 0.9.1 is a little high but not drastically so. My guess is that there is a difference in context size, nun parallel or model being used across versions.

<!-- gh-comment-id:3002274330 --> @jessegross commented on GitHub (Jun 25, 2025): This doesn't look like a memory estimation issue to me. This is what I get on 0.9.1: ``` NAME ID SIZE PROCESSOR UNTIL qwen3:8b 500a1f067a9f 7.6 GB 100% GPU 4 minutes from now ``` It's actually slightly higher on 0.7.0: ``` NAME ID SIZE PROCESSOR UNTIL qwen3:8b 500a1f067a9f 8.0 GB 100% GPU 4 minutes from now ``` And using the actual allocations from #11090: ``` NAME ID SIZE PROCESSOR UNTIL qwen3:8b 500a1f067a9f 7.0 GB 100% GPU 4 minutes from now ``` The last one is the source of truth so the estimate from 0.9.1 is a little high but not drastically so. My guess is that there is a difference in context size, nun parallel or model being used across versions.
Author
Owner

@rick-github commented on GitHub (Jun 25, 2025):

In that case, server logs will aid in debugging.

<!-- gh-comment-id:3002278647 --> @rick-github commented on GitHub (Jun 25, 2025): In that case, [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33134