[GH-ISSUE #9310] 8 GPUs want to start 8 same models #6075

Closed
opened 2026-04-12 17:24:14 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @AltenLi on GitHub (Feb 24, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9310

I tried so many methods.
environment: win10 22H, latest, nightly

methods:
1.multi ollama serve: failed on gpu split.
2.multi ollama docker: --gpus all & CUDA_VISIBLE_DEVICES=0 / --gpus all & CUDA_VISIBLE_DEVICES=1 and so on.. use local different copied model path into docker (to save download time), load model extremely slow, 32B used 15min. and often killed by gpu vram..

need:
multi same model server locally.

Thanks!!!

Originally created by @AltenLi on GitHub (Feb 24, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9310 I tried so many methods. environment: win10 22H, latest, nightly methods: 1.multi ollama serve: failed on gpu split. 2.multi ollama docker: --gpus all & CUDA_VISIBLE_DEVICES=0 / --gpus all & CUDA_VISIBLE_DEVICES=1 and so on.. use local different copied model path into docker (to save download time), load model extremely slow, 32B used 15min. and often killed by gpu vram.. need: multi same model server locally. Thanks!!!
GiteaMirror added the feature request label 2026-04-12 17:24:14 -05:00
Author
Owner

@ShadovvSinger commented on GitHub (Feb 25, 2025):

see https://github.com/ollama/ollama/issues/7206#issuecomment-2413928300
you can try: start multiple ollama on different GPU and different port, use nginx to to distribute the requests

I am also very thirsty to this feature (one ollama, multiple copies on multiple GPUs)
please follow https://github.com/ollama/ollama/issues/3902

<!-- gh-comment-id:2682218932 --> @ShadovvSinger commented on GitHub (Feb 25, 2025): see https://github.com/ollama/ollama/issues/7206#issuecomment-2413928300 you can try: start multiple ollama on different GPU and different port, use nginx to to distribute the requests I am also very thirsty to this feature (one ollama, multiple copies on multiple GPUs) please follow https://github.com/ollama/ollama/issues/3902
Author
Owner

@ShadovvSinger commented on GitHub (Feb 25, 2025):

"1.multi ollama serve: failed on gpu split."
I have successfully deployed using this method.
I use screen to start different terminal, set CUDA_VISIBLE_DEVICES=0/1/2/3/4... (the CUDA number)
OLLAMA_HOST= different port
I don't know if windows support this method, but it should work some how.

<!-- gh-comment-id:2682241127 --> @ShadovvSinger commented on GitHub (Feb 25, 2025): "1.multi ollama serve: failed on gpu split." I have successfully deployed using this method. I use screen to start different terminal, set CUDA_VISIBLE_DEVICES=0/1/2/3/4... (the CUDA number) OLLAMA_HOST= different port I don't know if windows support this method, but it should work some how.
Author
Owner

@wenhui-ml commented on GitHub (May 4, 2025):

hi,is there any progress to run one same llm model on multi gpus on windows, in order to serve distributed request?

<!-- gh-comment-id:2849148816 --> @wenhui-ml commented on GitHub (May 4, 2025): hi,is there any progress to run one same llm model on multi gpus on windows, in order to serve distributed request?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6075