[GH-ISSUE #8365] When I use multiple GPUs, the utilization is very low.How can I configure it to maximize GPU utilization and reduce the reasoning time? #67422

Closed
opened 2026-05-04 10:18:28 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @RoRui on GitHub (Jan 9, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8365

The graphics card I am using is a Tesla M60 16G and the model I am using is qwen2.5:14b.

When I use only one GPU core, GPU utilization can be up to 100%. Trying to write an 800 word article takes about 50 seconds.

Then, I configured the environment variable OLLAMA_SCHED_SPREAD=1 and used 2 GPU cores, and the utilization of each core was only about 50%. I tried using 6 graphics cards with 12 GPU cores and only got about 7% utilization per core. Again, there was no reduction in time spent within writing an 800 word article.
How can I configure it to maximize GPU utilization and reduce the reasoning time?
73f996d24dfeb261a9a597fda1ce2d3
cc03a23bab08dfe94e0badda5df933c

Originally created by @RoRui on GitHub (Jan 9, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8365 The graphics card I am using is a Tesla M60 16G and the model I am using is qwen2.5:14b. When I use only one GPU core, GPU utilization can be up to 100%. Trying to write an 800 word article takes about 50 seconds. Then, I configured the environment variable OLLAMA_SCHED_SPREAD=1 and used 2 GPU cores, and the utilization of each core was only about 50%. I tried using 6 graphics cards with 12 GPU cores and only got about 7% utilization per core. Again, there was no reduction in time spent within writing an 800 word article. How can I configure it to maximize GPU utilization and reduce the reasoning time? ![73f996d24dfeb261a9a597fda1ce2d3](https://github.com/user-attachments/assets/d54d4350-54dd-414d-b2ae-210aedfffc49) ![cc03a23bab08dfe94e0badda5df933c](https://github.com/user-attachments/assets/58c75947-2989-43de-9ff2-8c5139ce16d0)
GiteaMirror added the feature request label 2026-05-04 10:18:28 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 9, 2025):

https://github.com/ollama/ollama/issues/7648#issuecomment-2473561990

<!-- gh-comment-id:2580534273 --> @rick-github commented on GitHub (Jan 9, 2025): https://github.com/ollama/ollama/issues/7648#issuecomment-2473561990
Author
Owner

@RoRui commented on GitHub (Jan 10, 2025):

#7648 (comment)

Think you

<!-- gh-comment-id:2581719751 --> @RoRui commented on GitHub (Jan 10, 2025): > [#7648 (comment)](https://github.com/ollama/ollama/issues/7648#issuecomment-2473561990) Think you
Author
Owner

@JohnSmithToYou commented on GitHub (Jan 11, 2025):

I am also experiencing this with my 2x4090s. @jmorganca I think num_gpu is not getting calculated correctly for multiple graphics cards after KV Cache Quantization was introduced.
I was computing my context length using this article: https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama
I created model files for Qwen2.5-Coder-32B-Instruct-Q6_K, Dracarys2-72B-Instruct-reason-i1-Q4_K_S, and QwQ-32B-Preview-code-Q6_K and maxed out the context so that it would stay in memory.
When I run them (ollama run), Ollama complains they don't fit in vram. You can get it to work if you force all of the layer to GPU: /set parameter num_gpu 999. This wasn't necessary before kv cache support.

Until this is resolved, add this to your model files:

PARAMETER num_gpu 999

Note: It doesn't matter if OLLAMA_KV_CACHE_TYPE is set or not.

Could this be related to #8188?

<!-- gh-comment-id:2585304997 --> @JohnSmithToYou commented on GitHub (Jan 11, 2025): I am also experiencing this with my 2x4090s. @jmorganca I think num_gpu is not getting calculated correctly for multiple graphics cards after KV Cache Quantization was introduced. I was computing my context length using this article: https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama I created model files for `Qwen2.5-Coder-32B-Instruct-Q6_K`, `Dracarys2-72B-Instruct-reason-i1-Q4_K_S`, and `QwQ-32B-Preview-code-Q6_K` and maxed out the context so that it would stay in memory. When I run them (`ollama run`), Ollama complains they don't fit in vram. You can get it to work if you force all of the layer to GPU: `/set parameter num_gpu 999`. This wasn't necessary before kv cache support. Until this is resolved, add this to your model files: ``` PARAMETER num_gpu 999 ``` _Note: It doesn't matter if OLLAMA_KV_CACHE_TYPE is set or not._ Could this be related to #8188?
Author
Owner

@rick-github commented on GitHub (Jan 11, 2025):

This is not the same as what you are experiencing on your GPUs. Create a new issue and add full logs.

Could this be related to #8188?

No.

<!-- gh-comment-id:2585306726 --> @rick-github commented on GitHub (Jan 11, 2025): This is not the same as what you are experiencing on your GPUs. Create a new issue and add full logs. > Could this be related to #8188? No.
Author
Owner

@smartG666 commented on GitHub (Mar 31, 2025):

I have encountered the same problem. I hope you have solved it. I am looking forward to your answer.

<!-- gh-comment-id:2765445512 --> @smartG666 commented on GitHub (Mar 31, 2025): I have encountered the same problem. I hope you have solved it. I am looking forward to your answer.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#67422