[GH-ISSUE #4531] Is the GPU working? #2842

Closed
opened 2026-04-12 13:11:04 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @15731807423 on GitHub (May 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4531

Originally assigned to: @dhiltgen on GitHub.

微信截图_20240520112116

After running 'ollama run llama3:70b', the CPU and GPU utilization increased to 100%, and the model began to be transferred to memory and graphics memory, then decreased to 0%. Then a message was sent, and the model began to answer. The GPU only rose to 100% at the beginning and then immediately dropped to 0%, and only the CPU remained working. Is this normal?

Originally created by @15731807423 on GitHub (May 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4531 Originally assigned to: @dhiltgen on GitHub. ![微信截图_20240520112116](https://github.com/ollama/ollama/assets/45228445/ccdbddbb-a16b-4ea3-aa7d-f5094b3ae8c1) After running 'ollama run llama3:70b', the CPU and GPU utilization increased to 100%, and the model began to be transferred to memory and graphics memory, then decreased to 0%. Then a message was sent, and the model began to answer. The GPU only rose to 100% at the beginning and then immediately dropped to 0%, and only the CPU remained working. Is this normal?
GiteaMirror added the gpunvidia labels 2026-04-12 13:11:04 -05:00
Author
Owner

@pdevine commented on GitHub (May 20, 2024):

@15731807423 what's the output of ollama ps? It should tell you how much of the model is on the GPU and how much is on the CPU.

<!-- gh-comment-id:2119683218 --> @pdevine commented on GitHub (May 20, 2024): @15731807423 what's the output of `ollama ps`? It should tell you how much of the model is on the GPU and how much is on the CPU.
Author
Owner

@15731807423 commented on GitHub (May 20, 2024):

@pdevine
This should be the occupied RAM and VRAM.
The utilization rate of GPU has always been 0% when answering.

NAME            ID              SIZE    PROCESSOR       UNTIL
llama3:70b      be39eb53a197    41 GB   42%/58% CPU/GPU 4 minutes from now

NAME            ID              SIZE    PROCESSOR       UNTIL
llama3:latest   a6990ed6be41    5.4 GB  100% GPU        4 minutes from now
<!-- gh-comment-id:2119757698 --> @15731807423 commented on GitHub (May 20, 2024): @pdevine This should be the occupied RAM and VRAM. The utilization rate of GPU has always been 0% when answering. ``` NAME ID SIZE PROCESSOR UNTIL llama3:70b be39eb53a197 41 GB 42%/58% CPU/GPU 4 minutes from now NAME ID SIZE PROCESSOR UNTIL llama3:latest a6990ed6be41 5.4 GB 100% GPU 4 minutes from now ```
Author
Owner

@pdevine commented on GitHub (May 20, 2024):

@15731807423 looks like 70b is being partially offloaded, and 8b is fully running on the GPU. When you do /set verbose how many tokens / second are you getting? With llama3:latest I would expect about 120-125 toks/second w/ a 4090. 70b will be much, much, much slower because almost half of the model is on the CPU, and the fact that it's a huge model. You should be getting around 2-3 toks/sec, although it will vary depending on your CPU.

Here's my ollama ps output on the 4090:

$ ollama ps
NAME      	ID          	SIZE 	PROCESSOR      	UNTIL
llama3:70b	be39eb53a197	41 GB	40%/60% CPU/GPU	4 minutes from now
<!-- gh-comment-id:2119814655 --> @pdevine commented on GitHub (May 20, 2024): @15731807423 looks like 70b is being partially offloaded, and 8b is fully running on the GPU. When you do `/set verbose` how many tokens / second are you getting? With llama3:latest I would expect about 120-125 toks/second w/ a 4090. 70b will be much, much, much slower because almost half of the model is on the CPU, and the fact that it's a huge model. You should be getting around 2-3 toks/sec, although it will vary depending on your CPU. Here's my ollama ps output on the 4090: ``` $ ollama ps NAME ID SIZE PROCESSOR UNTIL llama3:70b be39eb53a197 41 GB 40%/60% CPU/GPU 4 minutes from now ```
Author
Owner

@15731807423 commented on GitHub (May 20, 2024):

@pdevine
What I don't understand is that the utilization rate of the GPU has always been 0%. It only rises to 100% in an instant at the beginning, and then it will reach 0% in the next second, while the CPU continues to work until the answer is completed. Is that correct?

(base) PS C:\Windows\System32> ollama run llama3

/set verbose
Set 'verbose' mode.
你好
😊 你好!我是 Chatbot,很高兴见到你!如果你需要帮助或想聊天,请随时问我。 😊

total duration: 5.8423836s
load duration: 5.4839949s
prompt eval count: 12 token(s)
prompt eval duration: 17.113ms
prompt eval rate: 701.22 tokens/s
eval count: 34 token(s)
eval duration: 334.75ms
eval rate: 101.57 tokens/s

(base) PS C:\Windows\System32> ollama run llama3:70b

/set verbose
Set 'verbose' mode.
你好
😊 Ni Hao! (您好) Welcome! How can I help you today? 🤔

total duration: 13.0373642s
load duration: 6.6727ms
prompt eval count: 12 token(s)
prompt eval duration: 2.312915s
prompt eval rate: 5.19 tokens/s
eval count: 22 token(s)
eval duration: 10.71453s
eval rate: 2.05 tokens/s

<!-- gh-comment-id:2119842009 --> @15731807423 commented on GitHub (May 20, 2024): @pdevine What I don't understand is that the utilization rate of the GPU has always been 0%. It only rises to 100% in an instant at the beginning, and then it will reach 0% in the next second, while the CPU continues to work until the answer is completed. Is that correct? (base) PS C:\Windows\System32> ollama run llama3 >>> /set verbose Set 'verbose' mode. >>> 你好 😊 你好!我是 Chatbot,很高兴见到你!如果你需要帮助或想聊天,请随时问我。 😊 total duration: 5.8423836s load duration: 5.4839949s prompt eval count: 12 token(s) prompt eval duration: 17.113ms prompt eval rate: 701.22 tokens/s eval count: 34 token(s) eval duration: 334.75ms eval rate: 101.57 tokens/s >>> (base) PS C:\Windows\System32> ollama run llama3:70b >>> /set verbose Set 'verbose' mode. >>> 你好 😊 Ni Hao! (您好) Welcome! How can I help you today? 🤔 total duration: 13.0373642s load duration: 6.6727ms prompt eval count: 12 token(s) prompt eval duration: 2.312915s prompt eval rate: 5.19 tokens/s eval count: 22 token(s) eval duration: 10.71453s eval rate: 2.05 tokens/s
Author
Owner

@frederickjjoubert commented on GitHub (May 20, 2024):

I think might be related to https://github.com/ollama/ollama/issues/1651 ? It doesn't look like ollama is using the GPU on PopOS

<!-- gh-comment-id:2121307517 --> @frederickjjoubert commented on GitHub (May 20, 2024): I think might be related to https://github.com/ollama/ollama/issues/1651 ? It doesn't look like `ollama` is using the GPU on PopOS
Author
Owner

@pdevine commented on GitHub (May 20, 2024):

It is using the GPU, but it's not particularly efficient at using it because the model is split across the CPU and GPU and the limitations of the computer (like slow memory). You can turn the GPU off entirely in the repl with:

>>> /set parameter num_gpu 0

Which should show you the difference in performance. You can also load a lower number of layers (i.e. /set parameter num_gpu 1) which will show offloading most of the layers in the model to the CPU. I believe the reason why the activity monitor shows the GPU not doing much has to do with the bandwidth to the GPU and the contention between system memory and the GPU itself. That said, it's possible that we can potentially eek more speed out of this in the future if we're more clever about how we load the model onto the GPU.

Back to CPU only (using num_gpu 0) I get:

total duration:       3m25.479681006s
load duration:        4.023984693s
prompt eval count:    208 token(s)
prompt eval duration: 41.733919s
prompt eval rate:     4.98 tokens/s
eval count:           259 token(s)
eval duration:        2m39.571141s
eval rate:            1.62 tokens/s

or roughly half the speed of the GPU.

<!-- gh-comment-id:2121334946 --> @pdevine commented on GitHub (May 20, 2024): It is using the GPU, but it's not particularly *efficient* at using it because the model is split across the CPU and GPU and the limitations of the computer (like slow memory). You can turn the GPU off entirely in the repl with: ``` >>> /set parameter num_gpu 0 ``` Which should show you the difference in performance. You can also load a lower number of layers (i.e. `/set parameter num_gpu 1`) which will show offloading _most_ of the layers in the model to the CPU. I *believe* the reason why the activity monitor shows the GPU not doing much has to do with the bandwidth to the GPU and the contention between system memory and the GPU itself. That said, it's possible that we can potentially eek more speed out of this in the future if we're more clever about how we load the model onto the GPU. Back to CPU only (using `num_gpu 0`) I get: ``` total duration: 3m25.479681006s load duration: 4.023984693s prompt eval count: 208 token(s) prompt eval duration: 41.733919s prompt eval rate: 4.98 tokens/s eval count: 259 token(s) eval duration: 2m39.571141s eval rate: 1.62 tokens/s ``` or roughly half the speed of the GPU.
Author
Owner

@dhiltgen commented on GitHub (May 22, 2024):

To expand on what Patrick mentioned, the 42% of the model loaded on system memory and doing inference calculations on the CPU is significantly slower than the GPU, so the GPU is able to quickly accomplish it's calculations for each step in the inference, and then sits idle waiting for the CPU to catch up. The closer you can get to 100% on GPU, the better the performance will be. If you have further questions, let us know.

<!-- gh-comment-id:2125829338 --> @dhiltgen commented on GitHub (May 22, 2024): To expand on what Patrick mentioned, the 42% of the model loaded on system memory and doing inference calculations on the CPU is significantly slower than the GPU, so the GPU is able to quickly accomplish it's calculations for each step in the inference, and then sits idle waiting for the CPU to catch up. The closer you can get to 100% on GPU, the better the performance will be. If you have further questions, let us know.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2842