[GH-ISSUE #6913] On multi-GPU inference speed limited by performance of single CPU core #30133

Closed
opened 2026-04-22 09:36:25 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @jonathankfmn on GitHub (Sep 23, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6913

What is the issue?

image
image

I have a server with 4 RTX 4090 GPUs. When I run a model, all GPUs are utilized between 10-25%, while the CPU is at 100%, but only using one thread. All layers are loaded into the GPUs.

I already tried, the num_thread parameter that does not change anything.

I believe the single thread process in the CPU is limiting the response speed. Is there a way to improve the speed? By changing the configuration?

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.3.11

Originally created by @jonathankfmn on GitHub (Sep 23, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6913 ### What is the issue? ![image](https://github.com/user-attachments/assets/77256c03-4d29-45a7-96d9-2f01901a782c) ![image](https://github.com/user-attachments/assets/f82028c3-e240-47a6-833c-519117caeec3) I have a server with 4 RTX 4090 GPUs. When I run a model, all GPUs are utilized between 10-25%, while the CPU is at 100%, but only using one thread. All layers are loaded into the GPUs. I already tried, the `num_thread` parameter that does not change anything. I believe the single thread process in the CPU is limiting the response speed. Is there a way to improve the speed? By changing the configuration? ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.3.11
GiteaMirror added the performancenvidiabugamd labels 2026-04-22 09:36:25 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 23, 2024):

https://github.com/ggerganov/llama.cpp/issues/8684

<!-- gh-comment-id:2368072595 --> @rick-github commented on GitHub (Sep 23, 2024): https://github.com/ggerganov/llama.cpp/issues/8684
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#30133