[GH-ISSUE #6439] How to load multiple but same species models on different GPUs? #66086

Closed
opened 2026-05-03 23:53:57 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @EGOIST5 on GitHub (Aug 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6439

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Linux, I use the following command to start Ollama server:
CUDA_VISIBLE_DEVICES=1,2,3,4,5 OLLAMA_MAX_LOADED_MODELS=5 ./ollama-linux-amd64 serve&
Then I want to run several py files used llama3.1:70b, but when I run the several py files, then they all use the same model.
That's to say, only one Gpu is activated. I want to my five gpus to load diffrent llama3.1:70b to run the different py files.
Is there a way to achieve it? Thank you!

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @EGOIST5 on GitHub (Aug 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6439 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Linux, I use the following command to start Ollama server: CUDA_VISIBLE_DEVICES=1,2,3,4,5 OLLAMA_MAX_LOADED_MODELS=5 ./ollama-linux-amd64 serve& Then I want to run several py files used llama3.1:70b, but when I run the several py files, then they all use the same model. That's to say, only one Gpu is activated. I want to my five gpus to load diffrent llama3.1:70b to run the different py files. Is there a way to achieve it? Thank you! ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the feature request label 2026-05-03 23:53:57 -05:00
Author
Owner

@wrapss commented on GitHub (Aug 20, 2024):

but why load the same model several times? couldn't you just enable parralele requests? (OLLAMA_NUM_PARALLEL + OLLAMA_SCHED_SPREAD)

<!-- gh-comment-id:2298569360 --> @wrapss commented on GitHub (Aug 20, 2024): but why load the same model several times? couldn't you just enable parralele requests? (OLLAMA_NUM_PARALLEL + OLLAMA_SCHED_SPREAD)
Author
Owner

@EGOIST5 commented on GitHub (Aug 20, 2024):

but why load the same model several times? couldn't you just enable parralele requests? (OLLAMA_NUM_PARALLEL + OLLAMA_SCHED_SPREAD)

Thank you, I think my way would be faster.

<!-- gh-comment-id:2298581739 --> @EGOIST5 commented on GitHub (Aug 20, 2024): > but why load the same model several times? couldn't you just enable parralele requests? (OLLAMA_NUM_PARALLEL + OLLAMA_SCHED_SPREAD) Thank you, I think my way would be faster.
Author
Owner

@wrapss commented on GitHub (Aug 20, 2024):

but what's the point of loading a model several times? just explain it's ez

<!-- gh-comment-id:2298586127 --> @wrapss commented on GitHub (Aug 20, 2024): but what's the point of loading a model several times? just explain it's ez
Author
Owner

@EGOIST5 commented on GitHub (Aug 20, 2024):

but what's the point of loading a model several times? just explain it's ez

I mean my five GPUs are each loaded with a llama3.1:70b, so no GPU is idle and this way is the fastest

<!-- gh-comment-id:2298590927 --> @EGOIST5 commented on GitHub (Aug 20, 2024): > but what's the point of loading a model several times? just explain it's ez I mean my five GPUs are each loaded with a llama3.1:70b, so no GPU is idle and this way is the fastest
Author
Owner

@wrapss commented on GitHub (Aug 20, 2024):

that's what OLLAMA_SCHED_SPREAD is all about

<!-- gh-comment-id:2298593469 --> @wrapss commented on GitHub (Aug 20, 2024): that's what OLLAMA_SCHED_SPREAD is all about
Author
Owner

@wrapss commented on GitHub (Aug 20, 2024):

OLLAMA_SCHED_SPREAD splits the model across all your gpu's and OLLAMA_NUM_PARALLEL lets you choose how many requests you can make at the same time, which is literally what you want to do 100X simpler.

<!-- gh-comment-id:2298595261 --> @wrapss commented on GitHub (Aug 20, 2024): OLLAMA_SCHED_SPREAD splits the model across all your gpu's and OLLAMA_NUM_PARALLEL lets you choose how many requests you can make at the same time, which is literally what you want to do 100X simpler.
Author
Owner

@EGOIST5 commented on GitHub (Aug 20, 2024):

Thank you. I set OLLAMA_SCHED_SPREAD=3, OLLAMA_NUM_PARALLEL=1, then I run several py files, It still load one llama3.1:70b, and split it into three gpus, but my one gpu can contain the whole model. By the way, if I set OLLAMA_NUM_PARALLEL=the num of my py files, does it affect the accuracy?

<!-- gh-comment-id:2298629569 --> @EGOIST5 commented on GitHub (Aug 20, 2024): Thank you. I set OLLAMA_SCHED_SPREAD=3, OLLAMA_NUM_PARALLEL=1, then I run several py files, It still load one llama3.1:70b, and split it into three gpus, but my one gpu can contain the whole model. By the way, if I set OLLAMA_NUM_PARALLEL=the num of my py files, does it affect the accuracy?
Author
Owner

@wrapss commented on GitHub (Aug 20, 2024):

It's the other way around, OLLAMA_SCHED_SPREAD must be a 1 (sort of boolean), and OLLAMA_NUM_PARALLEL must be the number of py files you launch at the same time. this won't affect quality but it will affect tokens/sec

<!-- gh-comment-id:2298640225 --> @wrapss commented on GitHub (Aug 20, 2024): It's the other way around, OLLAMA_SCHED_SPREAD must be a 1 (sort of boolean), and OLLAMA_NUM_PARALLEL must be the number of py files you launch at the same time. this won't affect quality but it will affect tokens/sec
Author
Owner

@EGOIST5 commented on GitHub (Aug 20, 2024):

I tried OLLAMA_SCHED_SPREAD=1, OLLAMA_NUM_PARALLEL=3, but it still split one model into 3 gpus,
and I tired OLLAMA_SCHED_SPREAD=False, OLLAMA_NUM_PARALLEL=3, it worked the way initiallly.

<!-- gh-comment-id:2298663765 --> @EGOIST5 commented on GitHub (Aug 20, 2024): I tried OLLAMA_SCHED_SPREAD=1, OLLAMA_NUM_PARALLEL=3, but it still split one model into 3 gpus, and I tired OLLAMA_SCHED_SPREAD=False, OLLAMA_NUM_PARALLEL=3, it worked the way initiallly.
Author
Owner

@mxyng commented on GitHub (Aug 21, 2024):

It's not currently possible to have multiple copies of the same model active at the same time. This may change in the future

<!-- gh-comment-id:2302938751 --> @mxyng commented on GitHub (Aug 21, 2024): It's not currently possible to have multiple copies of the same model active at the same time. This may change in the future
Author
Owner

@rick-github commented on GitHub (Aug 21, 2024):

Before ollama supported parallelism, I played around with running multiple ollama instances and load balancing across them using litellm. It wasn't deployed to production so I don't how reliable it would have been.

<!-- gh-comment-id:2303250853 --> @rick-github commented on GitHub (Aug 21, 2024): Before ollama supported parallelism, I played around with running multiple ollama instances and load balancing across them using litellm. It wasn't deployed to production so I don't how reliable it would have been.
Author
Owner

@ovaisq commented on GitHub (Aug 22, 2024):

Are options such as OLLAMA_SCHED_SPREAD and others documented somewhere? Also, if there's any other venue that I should be posting this, please let me know. Thx!

<!-- gh-comment-id:2305294590 --> @ovaisq commented on GitHub (Aug 22, 2024): Are options such as OLLAMA_SCHED_SPREAD and others documented somewhere? Also, if there's any other venue that I should be posting this, please let me know. Thx!
Author
Owner

@dhiltgen commented on GitHub (Oct 22, 2024):

Support for loading the same model multiple times (across multiple GPUs) is tracked via #3902

<!-- gh-comment-id:2430527858 --> @dhiltgen commented on GitHub (Oct 22, 2024): Support for loading the same model multiple times (across multiple GPUs) is tracked via #3902
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66086