[GH-ISSUE #8291] disable cpu offload for runing llm #5306

Closed
opened 2026-04-12 16:29:47 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @verigle on GitHub (Jan 3, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8291

cpu will be auto offload to cpu ,although has more than one gpu for free, so I want to disable cpu offload for llm inference.

94%/6% CPU/GPU

Originally created by @verigle on GitHub (Jan 3, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8291 cpu will be auto offload to cpu ,although has more than one gpu for free, so I want to disable cpu offload for llm inference. > 94%/6% CPU/GPU
GiteaMirror added the feature request label 2026-04-12 16:29:47 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 3, 2025):

ollama should always use as much of the GPU as it can. It offloads when it doesn't think there's enough resources to host the model. In this case, 6% of the model is running on GPU, ollama thinks there is not enough resources to load more than that on GPU. If you think this is incorrect, server logs will aid in debugging. As it stands, if CPU offload was disabled, you would not be able to run this model at all.

<!-- gh-comment-id:2568702996 --> @rick-github commented on GitHub (Jan 3, 2025): ollama should always use as much of the GPU as it can. It offloads when it doesn't think there's enough resources to host the model. In this case, 6% of the model is running on GPU, ollama thinks there is not enough resources to load more than that on GPU. If you think this is incorrect, [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging. As it stands, if CPU offload was disabled, you would not be able to run this model at all.
Author
Owner

@Chlorek commented on GitHub (Jan 3, 2025):

There is some bug as it often splits the memory wrong for me as well. I am just pulling my hairs out because of this. I have validated that GPUs have free memory using nvidia-smi, and also ollama on the startup reports GPUs memory to be available. Yet when loading model which needs splitting between two GPUs I often see most of it being offloaded to CPU and of course in such cases speed becomes terrible.

Relevant logs from my last 20 attempts today:

name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="21.7 GiB"
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloaded 60/81 layers to GPU
llm_load_tensors:        CPU buffer size = 38110.61 MiB
llm_load_tensors:      CUDA0 buffer size = 14230.95 MiB
llm_load_tensors:      CUDA1 buffer size = 13312.82 MiB

What is strange to me is that I can see
llm_load_print_meta: model size = 37.22 GiB (4.53 BPW)
so why does total buffer size ends up way bigger? Am I misunderstanding something here?

// Edit
I found out that one of applications connecting had been setting context length of 32k, once reduced to 8k it works as expected. Increase in memory consumption is expected, but this ends up way out of my calculations. While I need to gather evidence for this I am convinced this used to work well on that hardware with the same context size.

<!-- gh-comment-id:2569344339 --> @Chlorek commented on GitHub (Jan 3, 2025): There is some bug as it often splits the memory wrong for me as well. I am just pulling my hairs out because of this. I have validated that GPUs have free memory using nvidia-smi, and also ollama on the startup reports GPUs memory to be available. Yet when loading model which needs splitting between two GPUs I often see most of it being offloaded to CPU and of course in such cases speed becomes terrible. Relevant logs from my last 20 attempts today: ``` name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="21.7 GiB" ``` ``` llm_load_tensors: offloading 60 repeating layers to GPU llm_load_tensors: offloaded 60/81 layers to GPU llm_load_tensors: CPU buffer size = 38110.61 MiB llm_load_tensors: CUDA0 buffer size = 14230.95 MiB llm_load_tensors: CUDA1 buffer size = 13312.82 MiB ``` What is strange to me is that I can see `llm_load_print_meta: model size = 37.22 GiB (4.53 BPW)` so why does total buffer size ends up way bigger? Am I misunderstanding something here? // Edit I found out that one of applications connecting had been setting context length of 32k, once reduced to 8k it works as expected. Increase in memory consumption is expected, but this ends up way out of my calculations. While I need to gather evidence for this I am convinced this used to work well on that hardware with the same context size.
Author
Owner

@rick-github commented on GitHub (Jan 3, 2025):

Server logs.

<!-- gh-comment-id:2569371071 --> @rick-github commented on GitHub (Jan 3, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues).
Author
Owner

@ViolentProphet commented on GitHub (Feb 28, 2025):

Ollama Does NOT always use full GPU Capability ..on Windows.
It often defaults to CPU after updates and must be reran.
It also sometime gets stuck offloading to CPU.

<!-- gh-comment-id:2690677036 --> @ViolentProphet commented on GitHub (Feb 28, 2025): Ollama Does NOT always use full GPU Capability ..on Windows. It often defaults to CPU after updates and must be reran. It also sometime gets stuck offloading to CPU.
Author
Owner

@rick-github commented on GitHub (Feb 28, 2025):

Server logs.

<!-- gh-comment-id:2690723981 --> @rick-github commented on GitHub (Feb 28, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues).
Author
Owner

@tjwebb commented on GitHub (Apr 24, 2025):

Ollama is lying to me.

For smaller models such as gemma3:4b and 12b, ollama reports using 100% GPU on an L40 (48GB VRAM) and loads it into GPU RAM, but is clearly running the inference on the CPU.

top reports all my CPU cores used and nvtop reports 0% GPU. Because of this, inference on smaller models is often MUCH SLOWER than larger models such as gemma3:27b.

(although interestingly, I do not have this problem with granite3.3:8b)

Is there some setting in the modelfile that affects this behavior?

<!-- gh-comment-id:2828170718 --> @tjwebb commented on GitHub (Apr 24, 2025): Ollama is lying to me. For smaller models such as gemma3:4b and 12b, ollama reports using 100% GPU on an L40 (48GB VRAM) and loads it into GPU RAM, but is clearly running the inference on the CPU. top reports all my CPU cores used and nvtop reports 0% GPU. Because of this, inference on smaller models is often MUCH SLOWER than larger models such as gemma3:27b. (although interestingly, I do not have this problem with `granite3.3:8b`) Is there some setting in the modelfile that affects this behavior?
Author
Owner

@rick-github commented on GitHub (Apr 24, 2025):

Server logs.

<!-- gh-comment-id:2828196388 --> @rick-github commented on GitHub (Apr 24, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5306