[GH-ISSUE #2794] Window version not fully utilize gpu #63726

Closed
opened 2026-05-03 14:47:30 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @Kaspur2012 on GitHub (Feb 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2794

Originally assigned to: @mxyng on GitHub.

hello,

Window preview version
model used : mistral:7b-instruct-v0.2-q8_0
gpu: 2070 super 8gb

Issue:
Recently I switch from lm studio to ollama and noticed that my gpu never get above 50% usage while my cpu is always over 50%.

I've used the same model in lm studio w.o any problems as in gpu mostly above 90%. Using ollama, the model seem to load fully into vram as it took almost all of my vram like lm studio but the gpu usage is never above 50% while the rest seem to go to cpu like it offloading some layers to cpu. I know that lm studio has an option of how much you can offload to gpu which I set to max but no idea about ollama.

I saw a similar post about this and the final conclusion was antivirus quarantined some of ollama. I disabled my AV completely and still same result. I also add ollama directory to exception list in AV

thanks

Originally created by @Kaspur2012 on GitHub (Feb 27, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2794 Originally assigned to: @mxyng on GitHub. hello, Window preview version model used : mistral:7b-instruct-v0.2-q8_0 gpu: 2070 super 8gb Issue: Recently I switch from lm studio to ollama and noticed that my gpu never get above 50% usage while my cpu is always over 50%. I've used the same model in lm studio w.o any problems as in gpu mostly above 90%. Using ollama, the model seem to load fully into vram as it took almost all of my vram like lm studio but the gpu usage is never above 50% while the rest seem to go to cpu like it offloading some layers to cpu. I know that lm studio has an option of how much you can offload to gpu which I set to max but no idea about ollama. I saw a similar post about this and the final conclusion was antivirus quarantined some of ollama. I disabled my AV completely and still same result. I also add ollama directory to exception list in AV thanks
GiteaMirror added the bugwindows labels 2026-05-03 14:47:31 -05:00
Author
Owner

@easp commented on GitHub (Feb 28, 2024):

Check the Ollama logs to see how many layers it's offloading to GPU.
What are the GPU buffer allocations it reports?
https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues

Finally, is your GPU card dedicated to inference, or is it also being used for display?

<!-- gh-comment-id:1968026819 --> @easp commented on GitHub (Feb 28, 2024): Check the Ollama logs to see how many layers it's offloading to GPU. What are the GPU buffer allocations it reports? https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues Finally, is your GPU card dedicated to inference, or is it also being used for display?
Author
Owner

@Kaspur2012 commented on GitHub (Feb 28, 2024):

found these in the log:

llm_load_tensors: offloading 26 repeating layers to GPU
llm_load_tensors: offloaded 26/33 layers to GPU
llm_load_tensors: CPU buffer size = 7338.64 MiB
llm_load_tensors: CUDA0 buffer size = 5746.81 MiB

looks like it offloading 26/33 to gpu and the rest to cpu. I guess that why the gpu is not going full speed cause of the cpu bottleneck. This gpu is use for display as well though no idea why I can offload everything to gpu using lm studio and have almost 100% gpu utilization.

At least the log clarifies why the gpu is only working partially. I'll try loading a smaller model and see if it fully utilize the gpu.

Odd still is that when loading this model, my gpu vram is filled up the same amount vs lm studio loading the same model and lm studio is not offloading any layers to cpu.

<!-- gh-comment-id:1968183629 --> @Kaspur2012 commented on GitHub (Feb 28, 2024): found these in the log: llm_load_tensors: offloading 26 repeating layers to GPU llm_load_tensors: offloaded 26/33 layers to GPU llm_load_tensors: CPU buffer size = 7338.64 MiB llm_load_tensors: CUDA0 buffer size = 5746.81 MiB looks like it offloading 26/33 to gpu and the rest to cpu. I guess that why the gpu is not going full speed cause of the cpu bottleneck. This gpu is use for display as well though no idea why I can offload everything to gpu using lm studio and have almost 100% gpu utilization. At least the log clarifies why the gpu is only working partially. I'll try loading a smaller model and see if it fully utilize the gpu. Odd still is that when loading this model, my gpu vram is filled up the same amount vs lm studio loading the same model and lm studio is not offloading any layers to cpu.
Author
Owner

@dhiltgen commented on GitHub (Mar 11, 2024):

@Kaspur2012 how many layers are you able to successfully load under LM Studio?

<!-- gh-comment-id:1989427336 --> @dhiltgen commented on GitHub (Mar 11, 2024): @Kaspur2012 how many layers are you able to successfully load under LM Studio?
Author
Owner

@dhiltgen commented on GitHub (Mar 27, 2024):

Most likely this is the result of our memory prediction logic being a bit more conservative right now, and leaving some VRAM unallocated, so less layers are loading. We're working on improvements which should result in more fully utilizing the VRAM, and loading more layers, which should be available in a week or two.

<!-- gh-comment-id:2023962546 --> @dhiltgen commented on GitHub (Mar 27, 2024): Most likely this is the result of our memory prediction logic being a bit more conservative right now, and leaving some VRAM unallocated, so less layers are loading. We're working on improvements which should result in more fully utilizing the VRAM, and loading more layers, which should be available in a week or two.
Author
Owner

@steak3 commented on GitHub (Apr 9, 2024):

Most likely this is the result of our memory prediction logic being a bit more conservative right now, and leaving some VRAM unallocated, so less layers are loading. We're working on improvements which should result in more fully utilizing the VRAM, and loading more layers, which should be available in a week or two.

I'm really expecting this. As an addition, wouldn't be a good idea allowing the users to manually set the number of layers ? This could be a first and quicker step before prediction

[EDIT] Just found the OLLAMA_MAX_VRAM parameter. Is it the answer ? https://github.com/ollama/ollama/issues/835

<!-- gh-comment-id:2045669189 --> @steak3 commented on GitHub (Apr 9, 2024): > Most likely this is the result of our memory prediction logic being a bit more conservative right now, and leaving some VRAM unallocated, so less layers are loading. We're working on improvements which should result in more fully utilizing the VRAM, and loading more layers, which should be available in a week or two. I'm really expecting this. As an addition, wouldn't be a good idea allowing the users to manually set the number of layers ? This could be a first and quicker step before prediction [EDIT] Just found the OLLAMA_MAX_VRAM parameter. Is it the answer ? https://github.com/ollama/ollama/issues/835
Author
Owner

@Vitorbnc commented on GitHub (Apr 22, 2024):

@steak3 Had the same problem, setting that to exactly my GPU VRAM did it for me. It will not show 100% usage but will give best performance. If you set more than that it will use the GPU 100% (I tested setting 12GB, have 6GB VRAM), but performance got worse.

<!-- gh-comment-id:2070894554 --> @Vitorbnc commented on GitHub (Apr 22, 2024): @steak3 Had the same problem, setting that to exactly my GPU VRAM did it for me. It will not show 100% usage but will give best performance. If you set more than that it will use the GPU 100% (I tested setting 12GB, have 6GB VRAM), but performance got worse.
Author
Owner

@dhiltgen commented on GitHub (May 2, 2024):

Please give 0.1.33 a try. We've continued to improve our prediction logic to try to maximize VRAM usage without OOMs.

@steak3 you can also specify num_gpu in the request to force a specific number of layers, but our goal is to get this right automatically for you.

<!-- gh-comment-id:2091843091 --> @dhiltgen commented on GitHub (May 2, 2024): Please give 0.1.33 a try. We've continued to improve our prediction logic to try to maximize VRAM usage without OOMs. @steak3 you can also specify `num_gpu` in the request to force a specific number of layers, but our goal is to get this right automatically for you.
Author
Owner

@jmorganca commented on GitHub (May 10, 2024):

Hi @Kaspur2012 thanks for the issue and sorry for the hiccups. 0.1.33 and 0.1.34 should have improvements here – let us know if it's still happening!

<!-- gh-comment-id:2103681181 --> @jmorganca commented on GitHub (May 10, 2024): Hi @Kaspur2012 thanks for the issue and sorry for the hiccups. 0.1.33 and 0.1.34 should have improvements here – let us know if it's still happening!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63726