[GH-ISSUE #11917] 2*4090---The highest GPU load rate can only reach 50% #7910

Closed
opened 2026-04-12 20:04:34 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @niboliang on GitHub (Aug 15, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11917

What is the issue?

Configure 2 GTX4090 graphics cards on Ubuntu servers and use load balancing,Each graphics card

Image

GPU load rate can only reach 50%

Relevant log output


OS

Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.9.2

Originally created by @niboliang on GitHub (Aug 15, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11917 ### What is the issue? Configure 2 GTX4090 graphics cards on Ubuntu servers and use load balancing,Each graphics card <img width="528" height="233" alt="Image" src="https://github.com/user-attachments/assets/c987dc02-55f5-4cfc-8ee9-f8d979c08526" /> GPU load rate can only reach 50% ### Relevant log output ```shell ``` ### OS Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.9.2
GiteaMirror added the bug label 2026-04-12 20:04:34 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 15, 2025):

Model layers are processed sequentially in ollama. GPU-0 gets one half of the layers. GPU-1 gets the other half. When a single inference starts, GPU-0 runs at 100% doing inference with the first half of the model layers, while GPU-1 is idle. When GPU-0 is done, the CPU transfers the intermediate results to GPU-1, which then runs at 100% over the second half of the layers while GPU-0 is idle. This means that on average for a single inference, each GPU is used half of the time. Generally, %usage = 100/number_of_devices.

<!-- gh-comment-id:3191215181 --> @rick-github commented on GitHub (Aug 15, 2025): Model layers are processed sequentially in ollama. GPU-0 gets one half of the layers. GPU-1 gets the other half. When a single inference starts, GPU-0 runs at 100% doing inference with the first half of the model layers, while GPU-1 is idle. When GPU-0 is done, the CPU transfers the intermediate results to GPU-1, which then runs at 100% over the second half of the layers while GPU-0 is idle. This means that on average for a single inference, each GPU is used half of the time. Generally, %usage = 100/number_of_devices.
Author
Owner

@niboliang commented on GitHub (Aug 16, 2025):

Model layers are processed sequentially in ollama. GPU-0 gets one half of the layers. GPU-1 gets the other half. When a single inference starts, GPU-0 runs at 100% doing inference with the first half of the model layers, while GPU-1 is idle. When GPU-0 is done, the CPU transfers the intermediate results to GPU-1, which then runs at 100% over the second half of the layers while GPU-0 is idle. This means that on average for a single inference, each GPU is used half of the time. Generally, %usage = 100/number_of_devices.

Does this mean that ollama does not support dual-card inference, so no matter how it is set up, the GPU usage rate will only be 50%? Thank you.

<!-- gh-comment-id:3193430746 --> @niboliang commented on GitHub (Aug 16, 2025): > Model layers are processed sequentially in ollama. GPU-0 gets one half of the layers. GPU-1 gets the other half. When a single inference starts, GPU-0 runs at 100% doing inference with the first half of the model layers, while GPU-1 is idle. When GPU-0 is done, the CPU transfers the intermediate results to GPU-1, which then runs at 100% over the second half of the layers while GPU-0 is idle. This means that on average for a single inference, each GPU is used half of the time. Generally, %usage = 100/number_of_devices. Does this mean that ollama does not support dual-card inference, so no matter how it is set up, the GPU usage rate will only be 50%? Thank you.
Author
Owner

@rick-github commented on GitHub (Sep 1, 2025):

Does this mean that ollama does not support dual-card inference, so no matter how it is set up, the GPU usage rate will only be 50%?

For a single generation, yes. If the clients send multiple concurrent requests, GPU usage will be higher.

<!-- gh-comment-id:3243176457 --> @rick-github commented on GitHub (Sep 1, 2025): > Does this mean that ollama does not support dual-card inference, so no matter how it is set up, the GPU usage rate will only be 50%? For a single generation, yes. If the clients send multiple concurrent requests, GPU usage will be higher.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7910