[GH-ISSUE #2319] Distrubuted LLM support ? #1337

Closed
opened 2026-04-12 11:10:19 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @Donno191 on GitHub (Feb 2, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2319

Originally assigned to: @bmizerany on GitHub.

I have 3 x PC with 3090 and 1 x PC with 4090. Currently i am running ollama using my 4090 and it is working great for loading different models on the go, but the bottle neck is loading larger models and bigger context windows on the 24gb vram. It would be great to have something like pedals or MPI on llama.cpp.

IDEA:
Maybe having ollama slave running on my 3 x pc with 3090 holding the distributed llm and if the ollama server/serve on my 4090 PC needs to load the large models then use the 3090's to increase vram to 96gb

This will help increase the bottleneck of consumer hardware and also help businesses utilize resources when idle for LLM's.

Originally created by @Donno191 on GitHub (Feb 2, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2319 Originally assigned to: @bmizerany on GitHub. I have 3 x PC with 3090 and 1 x PC with 4090. Currently i am running ollama using my 4090 and it is working great for loading different models on the go, but the bottle neck is loading larger models and bigger context windows on the 24gb vram. It would be great to have something like pedals or MPI on llama.cpp. IDEA: Maybe having ollama slave running on my 3 x pc with 3090 holding the distributed llm and if the ollama server/serve on my 4090 PC needs to load the large models then use the 3090's to increase vram to 96gb This will help increase the bottleneck of consumer hardware and also help businesses utilize resources when idle for LLM's.
GiteaMirror added the feature request label 2026-04-12 11:10:19 -05:00
Author
Owner

@remy415 commented on GitHub (Feb 2, 2024):

Distributed llm is a very difficult problem. You may be able to hack MPI support into llama.cpp but afaik it’s still in development. Check the upstream llamacpp repo

<!-- gh-comment-id:1922873326 --> @remy415 commented on GitHub (Feb 2, 2024): Distributed llm is a very difficult problem. You may be able to hack MPI support into llama.cpp but afaik it’s still in development. Check the upstream llamacpp repo
Author
Owner

@rujaun commented on GitHub (Feb 2, 2024):

I'd be interested as well. How does big corporations run inference on these massive models?

<!-- gh-comment-id:1923118016 --> @rujaun commented on GitHub (Feb 2, 2024): I'd be interested as well. How does big corporations run inference on these massive models?
Author
Owner

@remy415 commented on GitHub (Feb 2, 2024):

Because big corporations have dozens if not hundreds of people on a team dedicated to installing, configuring, and programming the hardware and software needed to run distributed. Also their software is highly customized and specialized. Adding this kind of support to an application for a multitude of hardware and OS configurations is exponentially harder as proper distribution of tasks requires fast interconnects between the hardware, and software tuned to the exact hardware that is being run as to not induce OOM errors. Most users who don't have $200k+ (probably more) laying around to run a multi-gpu setup connected via a high speed bus will be using multiple systems connected via Ethernet, usually at a maximum throughput of 1Gbps. Additionally the systems themselves won't have the exact same hardware setup. Hell, even running RAM in dual/quad channel mode used to be really tricky even if you buy the ram in pairs, imagine trying to synchronize and coordinate multiple systems of varying configurations in a software system heavily reliant on low latency, high throughput, and is prone to easily running out of memory.

Distributed ML will come, but when it does it will be supported on the backend (llama_cpp) before the front end (Ollama).

<!-- gh-comment-id:1924043963 --> @remy415 commented on GitHub (Feb 2, 2024): Because big corporations have dozens if not hundreds of people on a team dedicated to installing, configuring, and programming the hardware and software needed to run distributed. Also their software is highly customized and specialized. Adding this kind of support to an application for a multitude of hardware and OS configurations is exponentially harder as proper distribution of tasks requires fast interconnects between the hardware, and software tuned to the exact hardware that is being run as to not induce OOM errors. Most users who don't have $200k+ (probably more) laying around to run a multi-gpu setup connected via a high speed bus will be using multiple systems connected via Ethernet, usually at a maximum throughput of 1Gbps. Additionally the systems themselves won't have the exact same hardware setup. Hell, even running RAM in dual/quad channel mode used to be really tricky even if you buy the ram in pairs, imagine trying to synchronize and coordinate multiple systems of varying configurations in a software system heavily reliant on low latency, high throughput, and is prone to easily running out of memory. Distributed ML will come, but when it does it will be supported on the backend (llama_cpp) before the front end (Ollama).
Author
Owner

@easp commented on GitHub (Feb 3, 2024):

You may be able to hack MPI support into llama.cpp but afaik it’s still in development. Check the upstream llamacpp repo

The author of the MPI support in Llama.cpp has commented elsewhere (reddit, I think) that MPI support is currently broken and I didn't think they had much time to fix it.

<!-- gh-comment-id:1925434146 --> @easp commented on GitHub (Feb 3, 2024): > You may be able to hack MPI support into llama.cpp but afaik it’s still in development. Check the upstream llamacpp repo The author of the MPI support in Llama.cpp has commented elsewhere (reddit, I think) that MPI support is currently broken and I didn't think they had much time to fix it.
Author
Owner

@liyimeng commented on GitHub (Mar 20, 2024):

Any chance to incorporate this project https://github.com/b4rtaz/distributed-llama into Ollama?

<!-- gh-comment-id:2009114501 --> @liyimeng commented on GitHub (Mar 20, 2024): Any chance to incorporate this project https://github.com/b4rtaz/distributed-llama into Ollama?
Author
Owner

@bmizerany commented on GitHub (Apr 8, 2024):

There are no plans for direct support of distributed-llama.

<!-- gh-comment-id:2043225277 --> @bmizerany commented on GitHub (Apr 8, 2024): There are no plans for direct support of distributed-llama.
Author
Owner

@bmizerany commented on GitHub (Apr 8, 2024):

We not currently working on distributed llm support. Closing for now, but we can reopen/revisit when we look into it.

<!-- gh-comment-id:2043228643 --> @bmizerany commented on GitHub (Apr 8, 2024): We not currently working on distributed llm support. Closing for now, but we can reopen/revisit when we look into it.
Author
Owner

@sreevatsank1999 commented on GitHub (Aug 6, 2024):

Llama.cpp now supports distributed inference (with some limitations)

https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacpp_now_supports_distributed_inference/

<!-- gh-comment-id:2272152554 --> @sreevatsank1999 commented on GitHub (Aug 6, 2024): Llama.cpp now supports distributed inference (with some limitations) https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacpp_now_supports_distributed_inference/
Author
Owner

@sammcj commented on GitHub (Aug 25, 2024):

Would be awesome to see llama.cpp's RCP / distributed inference as an option in Ollama

<!-- gh-comment-id:2308730455 --> @sammcj commented on GitHub (Aug 25, 2024): Would be awesome to see llama.cpp's RCP / distributed inference as an option in Ollama
Author
Owner

@Schaekermann commented on GitHub (Sep 25, 2024):

I'm also searching for a way to do distributed inference on poor hardware. I have found this:
https://github.com/exo-explore/exo
I have managed to get it up running on a single Dell Precision 3420 with ubuntu 22.04 but not yet tested it. I will come back with results asap.

<!-- gh-comment-id:2374401373 --> @Schaekermann commented on GitHub (Sep 25, 2024): I'm also searching for a way to do distributed inference on poor hardware. I have found this: https://github.com/exo-explore/exo I have managed to get it up running on a single Dell Precision 3420 with ubuntu 22.04 but not yet tested it. I will come back with results asap.
Author
Owner

@APiTJLillo commented on GitHub (Oct 2, 2024):

I was just looking at exo too. I'd be willing to help however I can. I am working on building a machine with a few 24gb m40s, have a 16gb 4090 laptop, a 1070 8gb machine, and a p4000 8gb machine.

<!-- gh-comment-id:2387458516 --> @APiTJLillo commented on GitHub (Oct 2, 2024): I was just looking at exo too. I'd be willing to help however I can. I am working on building a machine with a few 24gb m40s, have a 16gb 4090 laptop, a 1070 8gb machine, and a p4000 8gb machine.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1337