[GH-ISSUE #12768] Deepseek-r1:70b seems only uses one GPU out of four #8469

Closed
opened 2026-04-12 21:09:43 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @citystrawman on GitHub (Oct 24, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12768

Hi everyone, I am using Ragflow+Ollama to launch a Deepseek-r1:70b LLM, and my machine has 4 RTX 4090 graphics card.
When I tried to test the LLM model using prompts, I checked the gpu infomation using nvidia-smi, and its info is as follows:

Image

From the above, it seems that the server seems only uses one GPU out of four, and the speed of generation of the answer is not satisfactory. I also found a similar post : #7104. In this post I found an important infomation:

in general, splitting models across multiple GPUs is typically done for larger models that exceed the VRAM capacity of a single GPU. If a model fits comfortably within the memory of one GPU, distributing it across two GPUs often adds complexity without a significant performance boost. You are correct that GPUs already use parallelism efficiently, but the added data exchange between GPUs can slow things down rather than accelerate them. However, some specialized frameworks may support multi-GPU inference with optimizations to reduce the overhead, though it’s not the default approach.

Does that mean that generally, speeding up the answering speed by using multiple GPUs may do little help? If not, how can I take the advantage of the 4 GPUs to speed up the ansewering speed?

Originally created by @citystrawman on GitHub (Oct 24, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12768 Hi everyone, I am using Ragflow+Ollama to launch a Deepseek-r1:70b LLM, and my machine has 4 RTX 4090 graphics card. When I tried to test the LLM model using prompts, I checked the gpu infomation using nvidia-smi, and its info is as follows: <img width="1559" height="1749" alt="Image" src="https://github.com/user-attachments/assets/866d9b8e-679f-4dac-b7f6-b3cc75d0d26a" /> From the above, it seems that the server seems only uses one GPU out of four, and the speed of generation of the answer is not satisfactory. I also found a similar post : [ #7104](https://github.com/ollama/ollama/issues/7104). In this post I found an important infomation: `in general, splitting models across multiple GPUs is typically done for larger models that exceed the VRAM capacity of a single GPU. If a model fits comfortably within the memory of one GPU, distributing it across two GPUs often adds complexity without a significant performance boost. You are correct that GPUs already use parallelism efficiently, but the added data exchange between GPUs can slow things down rather than accelerate them. However, some specialized frameworks may support multi-GPU inference with optimizations to reduce the overhead, though it’s not the default approach.` Does that mean that generally, speeding up the answering speed by using multiple GPUs may do little help? If not, how can I take the advantage of the 4 GPUs to speed up the ansewering speed?
Author
Owner

@rick-github commented on GitHub (Oct 24, 2025):

Correct, in ollama adding GPUs does not speed up the answering speed for an individual request. Multiple GPUs are useful for parallel queries. If you were to parallelize your workload and increase OLLAMA_NUM_PARALLEL, then the overall answering speed would decrease.

<!-- gh-comment-id:3442640840 --> @rick-github commented on GitHub (Oct 24, 2025): Correct, in ollama adding GPUs does not speed up the answering speed for an individual request. Multiple GPUs are useful for parallel queries. If you were to parallelize your workload and increase [`OLLAMA_NUM_PARALLEL`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests), then the overall answering speed would decrease.
Author
Owner

@pdevine commented on GitHub (Oct 24, 2025):

I'm going to close this as answered (thank you @rick-github !), but feel free to follow up if you have more questions.

<!-- gh-comment-id:3444049127 --> @pdevine commented on GitHub (Oct 24, 2025): I'm going to close this as answered (thank you @rick-github !), but feel free to follow up if you have more questions.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8469