[GH-ISSUE #6914] Work done by CPU instead of GPU #50885

Closed
opened 2026-04-28 17:19:55 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @Iliceth on GitHub (Sep 23, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6914

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I'm aware this might not be a bug, but I'm trying to understand and figure out if I can and/or should change something. When I try Reflection 70b.

CPU: 8 cores 100% utilization
RAM: 23 of 32 GB in use
GPU: on average 5% utilization
VRAM: 23 of 24 GB in use

I assume the model does not fit in VRAM and therefor the spread across VRAM and RAM, I'm fine with that. But the fact that the CPU seems to be expected to do almost all of the lifting seems odd to me, although it might be normal.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.3.11

Originally created by @Iliceth on GitHub (Sep 23, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6914 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I'm aware this might not be a bug, but I'm trying to understand and figure out if I can and/or should change something. When I try Reflection 70b. CPU: 8 cores 100% utilization RAM: 23 of 32 GB in use GPU: on average 5% utilization VRAM: 23 of 24 GB in use I assume the model does not fit in VRAM and therefor the spread across VRAM and RAM, I'm fine with that. But the fact that the CPU seems to be expected to do almost all of the lifting seems odd to me, although it might be normal. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.11
GiteaMirror added the question label 2026-04-28 17:19:55 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 23, 2024):

The CPU is much slower at doing inference than the GPU, so more time is spent in computation in the CPU than the GPU.

<!-- gh-comment-id:2368027755 --> @rick-github commented on GitHub (Sep 23, 2024): The CPU is much slower at doing inference than the GPU, so more time is spent in computation in the CPU than the GPU.
Author
Owner

@Iliceth commented on GitHub (Sep 23, 2024):

That I get, but can I force to only use GPU or does it have to do part of the inference on the CPU because system RAM is used additionally to the GPU's VRAM?

<!-- gh-comment-id:2368300110 --> @Iliceth commented on GitHub (Sep 23, 2024): That I get, but can I force to only use GPU or does it have to do part of the inference on the CPU because system RAM is used additionally to the GPU's VRAM?
Author
Owner

@rick-github commented on GitHub (Sep 23, 2024):

Yes, it's because part of the model resides in system RAM. While it's technically possible for the GPU to access system RAM to do inference, it is still more efficient to have the CPU do inference on the weights in system RAM and have the GPU do inference on the weights in VRAM. If you want to experiment, you can set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 in the server environment and then use num_gpu in an API call or /set parameter num_gpu in the ollama cli to override the number of layers that ollama offloads to the GPU. I found that on my system (linux/4070) having the GPU do all inference across RAM and VRAM was extremely slow compared to the hybrid approach.

<!-- gh-comment-id:2368359345 --> @rick-github commented on GitHub (Sep 23, 2024): Yes, it's because part of the model resides in system RAM. While it's technically possible for the GPU to access system RAM to do inference, it is still more efficient to have the CPU do inference on the weights in system RAM and have the GPU do inference on the weights in VRAM. If you want to experiment, you can set `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` in the server environment and then use `num_gpu` in an API call or `/set parameter num_gpu` in the ollama cli to override the number of layers that ollama offloads to the GPU. I found that on my system (linux/4070) having the GPU do all inference across RAM and VRAM was extremely slow compared to the hybrid approach.
Author
Owner

@dhiltgen commented on GitHub (Sep 25, 2024):

@Iliceth if you have further questions, let us know. In short, running a smaller model that fits 100% in your GPU will yield the best performance.

<!-- gh-comment-id:2372636839 --> @dhiltgen commented on GitHub (Sep 25, 2024): @Iliceth if you have further questions, let us know. In short, running a smaller model that fits 100% in your GPU will yield the best performance.
Author
Owner

@sarjil77 commented on GitHub (Sep 25, 2024):

i nvidia-rtx A6000 but when i am running my ollama model and on checking with ps command it is showing me that it is running on 100% cpu. infact there is no process running on the system, this issue i was getting before also but at that time when i uninstalled it and on reinstalling it worked, but currently i am facing the same issue but not able to solve it.

please respond ASAP

Thank you in advance

<!-- gh-comment-id:2373171212 --> @sarjil77 commented on GitHub (Sep 25, 2024): i nvidia-rtx A6000 but when i am running my ollama model and on checking with ps command it is showing me that it is running on 100% cpu. infact there is no process running on the system, this issue i was getting before also but at that time when i uninstalled it and on reinstalling it worked, but currently i am facing the same issue but not able to solve it. please respond ASAP Thank you in advance
Author
Owner

@rick-github commented on GitHub (Sep 25, 2024):

Server logs will help in debugging.

<!-- gh-comment-id:2373183064 --> @rick-github commented on GitHub (Sep 25, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will help in debugging.
Author
Owner

@Iliceth commented on GitHub (Sep 25, 2024):

@rick-github @dhiltgen I will try out your suggestion, thanks. I am very happy with how ollama performs with smaller models, it's just tempting to play with the big one's... :-p

<!-- gh-comment-id:2373475225 --> @Iliceth commented on GitHub (Sep 25, 2024): @rick-github @dhiltgen I will try out your suggestion, thanks. I am very happy with how ollama performs with smaller models, it's just tempting to play with the big one's... :-p
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50885