[GH-ISSUE #7416] Optimizing Single Inference Performance on Distributed GPUs with Ollama’s Parallel Inference #4716

Closed
opened 2026-04-12 15:39:35 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @jibinghu on GitHub (Oct 30, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7416

Hey guys,

I have a server with dual A100 GPUs and a server with a single V100 GPU. Knowing the IP addresses, ports, and passwords of both servers, I want to use Ollama’s parallel inference functionality to perform a single inference request on the llama3.1-70B model. How can I achieve optimal performance for a single request when using Ollama for inference? Do I need to use MPI or other distributed methods? Will this make the model inference faster?

Thanks.

Originally created by @jibinghu on GitHub (Oct 30, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7416 Hey guys, I have a server with dual A100 GPUs and a server with a single V100 GPU. Knowing the IP addresses, ports, and passwords of both servers, I want to use Ollama’s parallel inference functionality to perform a single inference request on the llama3.1-70B model. How can I achieve optimal performance for a single request when using Ollama for inference? Do I need to use MPI or other distributed methods? Will this make the model inference faster? Thanks.
GiteaMirror added the feature request label 2026-04-12 15:39:35 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 30, 2024):

https://github.com/ollama/ollama/issues/7104

<!-- gh-comment-id:2446073755 --> @rick-github commented on GitHub (Oct 30, 2024): https://github.com/ollama/ollama/issues/7104
Author
Owner

@jibinghu commented on GitHub (Oct 30, 2024):

In fact, my question is about how to configure load balancing across different servers, rather than setting up multi-GPU on a single server. I’d like to see if Ollama has a similar configuration method for this.Thanks.

<!-- gh-comment-id:2446074451 --> @jibinghu commented on GitHub (Oct 30, 2024): In fact, my question is about how to configure load balancing across different servers, rather than setting up multi-GPU on a single server. I’d like to see if Ollama has a similar configuration method for this.Thanks.
Author
Owner

@rick-github commented on GitHub (Oct 30, 2024):

One of the points of #7104 is that multi-gpu, whether on one server or more, will not significantly speed up a single inference request. If you want to experiment with distributed inference, see https://github.com/ollama/ollama/pull/6729. Other than OLLAMA_NUM_PARALLEL and OLLAMA_SCHED_SPREAD, stock ollama has no configuration options for this sort of setup.

<!-- gh-comment-id:2446285376 --> @rick-github commented on GitHub (Oct 30, 2024): One of the points of #7104 is that multi-gpu, whether on one server or more, will not significantly speed up a single inference request. If you want to experiment with distributed inference, see https://github.com/ollama/ollama/pull/6729. Other than [`OLLAMA_NUM_PARALLEL`](https://github.com/ollama/ollama/blob/db1842b9e1272237947d427c852c38e48688dd02/envconfig/config.go#L247) and [`OLLAMA_SCHED_SPREAD`](https://github.com/ollama/ollama/blob/db1842b9e1272237947d427c852c38e48688dd02/envconfig/config.go#L249), stock ollama has no configuration options for this sort of setup.
Author
Owner

@jibinghu commented on GitHub (Oct 30, 2024):

Alright, I will continue working on #6729. Thank you for your help. I will close this issue.

<!-- gh-comment-id:2446320493 --> @jibinghu commented on GitHub (Oct 30, 2024): Alright, I will continue working on #6729. Thank you for your help. I will close this issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4716