[GH-ISSUE #4668] Low GPU / High CPU Utilization ==> Slow Performance #2935

Closed
opened 2026-04-12 13:18:25 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @tarekeldeeb on GitHub (May 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4668

What is the issue?

The Ollama on Ubuntu 22.04 can detect my Cuda GPU, and loads the model to its memory, but the processing seems to be mostly on CPU. Is this a normal behavior? The overall performance is not satisfying, like 1-token-per-second or so ... much slower than a human reading speed.
Screenshot from 2024-05-27 15-04-47

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.38

Originally created by @tarekeldeeb on GitHub (May 27, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4668 ### What is the issue? The Ollama on Ubuntu 22.04 can detect my Cuda GPU, and loads the model to its memory, but the processing seems to be mostly on CPU. Is this a normal behavior? The overall performance is not satisfying, like 1-token-per-second or so ... much slower than a human reading speed. ![Screenshot from 2024-05-27 15-04-47](https://github.com/ollama/ollama/assets/90985/bc949a87-ced5-45f8-bea3-a99dfc739e19) ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.38
GiteaMirror added the bug label 2026-04-12 13:18:25 -05:00
Author
Owner

@easp commented on GitHub (May 27, 2024):

What model and quantization?

If the model doesn't fit entirely on the GPU, part of it will run out of system memory on the CPU. GPU utilization will be low because GPU has to wait for the slow CPU.

More insight is available through ollama ps

<!-- gh-comment-id:2133884799 --> @easp commented on GitHub (May 27, 2024): What model and quantization? If the model doesn't fit entirely on the GPU, part of it will run out of system memory on the CPU. GPU utilization will be low because GPU has to wait for the slow CPU. More insight is available through `ollama ps`
Author
Owner

@tarekeldeeb commented on GitHub (May 28, 2024):

Makes sense now, the model does not totally fit in GPU!

ollama ps
NAME        	ID          	SIZE  	PROCESSOR      	UNTIL               
codellama:7b	8fdf8f752f6e	6.8 GB	43%/57% CPU/GPU	25 minutes from now
<!-- gh-comment-id:2134477784 --> @tarekeldeeb commented on GitHub (May 28, 2024): Makes sense now, the model does not totally fit in GPU! <pre>ollama ps NAME ID SIZE PROCESSOR UNTIL codellama:7b 8fdf8f752f6e 6.8 GB 43%/57% CPU/GPU 25 minutes from now</pre>
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2935