[GH-ISSUE #10327] ollama ps reports wrong values (depending on num_batch?) #68838

Open
opened 2026-05-04 15:23:04 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @ALLMI78 on GitHub (Apr 17, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10327

What is the issue?

full story: https://github.com/ollama/ollama/issues/10323

i think the problem is that ollama ps reports wrong values depending on num_batch.

nvidia-smi and win10 taskmanager show the correct values, there ist no splitting CPU/GPU...?

Image

OS W10

Ollama version 0.6.5

Originally created by @ALLMI78 on GitHub (Apr 17, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10327 ### What is the issue? full story: https://github.com/ollama/ollama/issues/10323 i think the problem is that ollama ps reports wrong values depending on num_batch. nvidia-smi and win10 taskmanager show the correct values, there ist no splitting CPU/GPU...? ![Image](https://github.com/user-attachments/assets/b840419e-3f53-4aad-b272-114f7be6302b) ### OS W10 ### Ollama version 0.6.5
GiteaMirror added the bug label 2026-05-04 15:23:04 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 17, 2025):

ollama does an estimation of required resources. but it's up to the backend to actually allocate the memory. If the backend is more efficient, eg using flash attention or lazy allocation, the values reported by the GPU will not match the estimation.

<!-- gh-comment-id:2814139033 --> @rick-github commented on GitHub (Apr 17, 2025): ollama does an estimation of required resources. but it's up to the backend to actually allocate the memory. If the backend is more efficient, eg using flash attention or lazy allocation, the values reported by the GPU will not match the estimation.
Author
Owner

@ALLMI78 commented on GitHub (Apr 17, 2025):

Ahh great, so I guess I'm chasing ghosts right now ;) And the extreme deviations in my "full story" can be explained by that too?

But that would unfortunately mean the values from Ollama PS are pretty much useless?

Trying to optimize anything based on that basically doesn’t really work then...?

Dear Rick, since I really appreciate your explanations and love learning from you — if you feel like it (and have the time), you're very welcome to take a look at the full story and share your thoughts. I was especially surprised by the impact of num_batch on prompt processing.

Close as completed?

<!-- gh-comment-id:2814156382 --> @ALLMI78 commented on GitHub (Apr 17, 2025): Ahh great, so I guess I'm chasing ghosts right now ;) And the extreme deviations in my "full story" can be explained by that too? But that would unfortunately mean the values from Ollama PS are pretty much useless? Trying to optimize anything based on that basically doesn’t really work then...? Dear Rick, since I really appreciate your explanations and love learning from you — if you feel like it (and have the time), you're very welcome to take a look at the full story and share your thoughts. I was especially surprised by the impact of num_batch on prompt processing. Close as completed?
Author
Owner

@rick-github commented on GitHub (Apr 17, 2025):

I saw #10323 but haven't had a chance to look at it yet, if I get some time this weekend I'll add some comments. The memory estimation (in my opinion) is a part of ollama that needs some attention, because one of the selling points is that ollama takes care of figuring out how many layers can be offloaded. If users have to tune that by setting num_gpu then we might as well go back to using llama.cpp. It worked in the early days, but with flash attention, sliding windows, other optimisations and the new go runner, it frequently happens that ollama under-estimates how much VRAM to use, resulting in layer spilling and slower performance. One big hammer to work around this is to set num_gpu to more than the model layer count and force the runner to load everything in to GPU, but that can lead to OOMs or to using shared memory which can have a significant performance impact. It can also lead to corner cases where users are confused as to why they set num_gpu but the model still only runs on CPU.

That's not to say that your investigation is chasing ghosts, it could be that you have found a real issue. I will have a look at the ticket and see if there's anything unusual.

<!-- gh-comment-id:2814181653 --> @rick-github commented on GitHub (Apr 17, 2025): I saw #10323 but haven't had a chance to look at it yet, if I get some time this weekend I'll add some comments. The memory estimation (in my opinion) is a part of ollama that needs some attention, because one of the selling points is that ollama takes care of figuring out how many layers can be offloaded. If users have to tune that by setting `num_gpu` then we might as well go back to using llama.cpp. It worked in the early days, but with [flash attention](https://github.com/ollama/ollama/issues/6160), [sliding windows](https://github.com/ollama/ollama/pull/9987), other optimisations and the new [go runner](https://github.com/ollama/ollama/issues/10040), it frequently happens that ollama under-estimates how much VRAM to use, resulting in layer spilling and slower performance. One big hammer to work around this is to [set `num_gpu`](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650) to more than the model layer count and force the runner to load everything in to GPU, but that can lead to OOMs or to using shared memory which can have a significant [performance impact](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900). It can also lead to [corner cases](https://github.com/ollama/ollama/issues/10167#issuecomment-2784789807) where users are confused as to why they set `num_gpu` but the model still only runs on CPU. That's not to say that your investigation is chasing ghosts, it could be that you have found a real issue. I will have a look at the ticket and see if there's anything unusual.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68838