[GH-ISSUE #12063] ollama not using system efficiently #8011

Closed
opened 2026-04-12 20:13:24 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @sbmilab on GitHub (Aug 25, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12063

I'm running Ollama on my pc and it mainly uses CPU rather than all other available resources. My computer is pretty powerful, but ollama only uses some of the resources: e.g, it only uses the dedicated GPU and no other GPU memory or NPU. Any suggestions?
eg: only 14% gpu, almoat none of shared gpu, 91% cpu, 0% npu
Image

Originally created by @sbmilab on GitHub (Aug 25, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12063 I'm running Ollama on my pc and it mainly uses CPU rather than all other available resources. My computer is pretty powerful, but ollama only uses some of the resources: e.g, it only uses the dedicated GPU and no other GPU memory or NPU. Any suggestions? eg: only 14% gpu, almoat none of shared gpu, 91% cpu, 0% npu <img width="525" height="516" alt="Image" src="https://github.com/user-attachments/assets/69a11d0f-431d-47d3-a0e6-cb9ba413e4ed" />
Author
Owner

@rick-github commented on GitHub (Aug 25, 2025):

Ollama doesn't currently support NPUs. Server logs will aid in determining system efficiency but the likely reason is you are using a model that is too big for your system.

<!-- gh-comment-id:3218480067 --> @rick-github commented on GitHub (Aug 25, 2025): Ollama doesn't currently support NPUs. [Server logs]( https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in determining system efficiency but the likely reason is you are using a model that is too big for your system.
Author
Owner

@Istar-Eldritch commented on GitHub (Aug 25, 2025):

I don't use windows but either you have some issue with the drivers or the model is offloading to the cpu. The model may technically fit in VRAM, but if you put a large context it may still offload.

<!-- gh-comment-id:3218482460 --> @Istar-Eldritch commented on GitHub (Aug 25, 2025): I don't use windows but either you have some issue with the drivers or the model is offloading to the cpu. The model may technically fit in VRAM, but if you put a large context it may still offload.
Author
Owner

@sbmilab commented on GitHub (Aug 25, 2025):

I have 128 GB RAM, an RTX 5070 with 12 GB RAM, and the shared GPU RAM is 72.6 GB. It seems like the resources are not handled efficientlY: On llam2 model (7B, 3.8 GB) it uses GPU, but not the shared GPU. On something like gpt-oss:20b (20B, 14GB), it doesn't use GPU (the attached image is the usage profile).

<!-- gh-comment-id:3218511848 --> @sbmilab commented on GitHub (Aug 25, 2025): I have 128 GB RAM, an RTX 5070 with 12 GB RAM, and the shared GPU RAM is 72.6 GB. It seems like the resources are not handled efficientlY: On llam2 model (7B, 3.8 GB) it uses GPU, but not the shared GPU. On something like gpt-oss:20b (20B, 14GB), it doesn't use GPU (the attached image is the usage profile).
Author
Owner

@onestardao commented on GitHub (Aug 25, 2025):

ollama on windows has limited resource scheduling right now, so even when the model is loaded into VRAM, a lot of execution still falls back to CPU. this aligns with our ProblemMap No.14 (startup order / resource binding issue). do you want me to share the checklist? it walks through the fixes step by step.

<!-- gh-comment-id:3218660295 --> @onestardao commented on GitHub (Aug 25, 2025): ollama on windows has limited resource scheduling right now, so even when the model is loaded into VRAM, a lot of execution still falls back to CPU. this aligns with our ProblemMap No.14 (startup order / resource binding issue). do you want me to share the checklist? it walks through the fixes step by step.
Author
Owner

@sbmilab commented on GitHub (Aug 25, 2025):

@onestardao yes please

<!-- gh-comment-id:3218692021 --> @sbmilab commented on GitHub (Aug 25, 2025): @onestardao yes please
Author
Owner

@onestardao commented on GitHub (Aug 25, 2025):

hi, sharing the resource here as requested: WFGY ProblemMap / Diagnose Guide

this is a semantic firewall approach — no infra change needed.
just check Problem No.8 (“debugging is a black box”) and follow the short checklist inside.

let me know if you hit the same parsing issue after applying the patch.

<!-- gh-comment-id:3218977112 --> @onestardao commented on GitHub (Aug 25, 2025): hi, sharing the resource here as requested: [WFGY ProblemMap / Diagnose Guide](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md) this is a semantic firewall approach — no infra change needed. just check Problem No.8 (“debugging is a black box”) and follow the short checklist inside. let me know if you hit the same parsing issue after applying the patch.
Author
Owner

@rick-github commented on GitHub (Aug 25, 2025):

After you've read the word salad that is WFGY, post your server log. That will allow us to see how ollama is allocating memory. The likely problem is that the model is too big to fit in 12GB of VRAM and layers are spilling to system RAM, where the CPU does inference. Because the CPU is not a quick as the GPU at matrix operations, more time is spent waiting for the CPU than the GPU, leading to high CPU usage and low GPU usage.

<!-- gh-comment-id:3220260078 --> @rick-github commented on GitHub (Aug 25, 2025): After you've read the word salad that is WFGY, post your server log. That will allow us to see how ollama is allocating memory. The likely problem is that the model is too big to fit in 12GB of VRAM and layers are spilling to system RAM, where the CPU does inference. Because the CPU is not a quick as the GPU at matrix operations, more time is spent waiting for the CPU than the GPU, leading to high CPU usage and low GPU usage.
Author
Owner

@onestardao commented on GitHub (Aug 25, 2025):

thanks rick fair point. let’s make it concrete.

here’s a minimal triage i’ll run and share logs for, so we can verify whether it’s layer spill vs binding:

  1. capture evidence
  • ollama serve -v while running one prompt
  • nvidia-smi -l 1 for 20s during that prompt
  • Modelfile or exact tag + params (num_ctx, gpu_layers, quant)
  1. quick containment test
  • switch to llama3.1:8b-instruct-q4_K_M

  • Modelfile:

    PARAMETER num_ctx 4096
    PARAMETER num_gpu 1
    PARAMETER gpu_layers 28
    
  • close any app that can grab VRAM, rerun, watch VRAM/util.

  1. if util jumps on the small quant, we confirm size/binding.
    on windows this matches the “startup/binding” failure we see a lot. the fix is to keep layers in VRAM with a fitting quant + gpu_layers, or split embeddings from the main model to avoid competing allocations.

i’ll post the two logs and the Modelfile shortly. if you want any extra fields in the server logs, tell me and i’ll include them.

<!-- gh-comment-id:3220377334 --> @onestardao commented on GitHub (Aug 25, 2025): thanks rick fair point. let’s make it concrete. here’s a minimal triage i’ll run and share logs for, so we can verify whether it’s layer spill vs binding: 1. capture evidence * `ollama serve -v` while running one prompt * `nvidia-smi -l 1` for 20s during that prompt * Modelfile or exact tag + params (`num_ctx`, `gpu_layers`, quant) 2. quick containment test * switch to `llama3.1:8b-instruct-q4_K_M` * Modelfile: ``` PARAMETER num_ctx 4096 PARAMETER num_gpu 1 PARAMETER gpu_layers 28 ``` * close any app that can grab VRAM, rerun, watch VRAM/util. 3. if util jumps on the small quant, we confirm size/binding. on windows this matches the “startup/binding” failure we see a lot. the fix is to keep layers in VRAM with a fitting quant + `gpu_layers`, or split embeddings from the main model to avoid competing allocations. i’ll post the two logs and the Modelfile shortly. if you want any extra fields in the server logs, tell me and i’ll include them.
Author
Owner

@rick-github commented on GitHub (Aug 25, 2025):

PARAMETER gpu_layers 28

is not a valid parameter.

<!-- gh-comment-id:3220433874 --> @rick-github commented on GitHub (Aug 25, 2025): ``` PARAMETER gpu_layers 28 ``` is not a valid parameter.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8011