[GH-ISSUE #7206] Help, is it possible to copy and deploy a small size model onto multiple GPU cards? #66631

Closed
opened 2026-05-04 07:39:29 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @tigflanker on GitHub (Oct 15, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7206

The current situation is as follows:

I have 4 T4 GPU cards, each with 16G of memory, and my model size is the quantized version, which is only 5G.

When considering the multi-concurrency scenario in the production environment, I want to load this model onto all 4 GPUs simultaneously (for subsequent concurrent invocation).

Is there a way to do this? Thanks.

Originally created by @tigflanker on GitHub (Oct 15, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7206 The current situation is as follows: I have 4 T4 GPU cards, each with 16G of memory, and my model size is the quantized version, which is only 5G. When considering the multi-concurrency scenario in the production environment, I want to load this model onto all 4 GPUs simultaneously (for subsequent concurrent invocation). Is there a way to do this? Thanks.
GiteaMirror added the feature request label 2026-05-04 07:39:29 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 15, 2024):

ollama doesn't currently support loading the same model more than once. https://github.com/ollama/ollama/issues/3902 is tracking the work for loading a model multiple times but there's no progress so far.

If you want to run multiple copies of the same model, the easiest way at the moment is to start an ollama server on a different port for each GPU, using CUDA_VISIBLE_DEVICES to bind each GPU to it's server, and then use a reverse proxy (eg nginx or ollama_proxy_server) to distribute the requests to the ollama servers.

Another approach would be to set OLLAMA_SCHED_SPREAD=1 to force the model to be spread across all the GPUs and set OLLAMA_NUM_PARALLEL=4 to allow concurrent processing. I don't know what the performance of this type of configuration would be.

<!-- gh-comment-id:2413928300 --> @rick-github commented on GitHub (Oct 15, 2024): ollama doesn't currently support loading the same model more than once. https://github.com/ollama/ollama/issues/3902 is tracking the work for loading a model multiple times but there's no progress so far. If you want to run multiple copies of the same model, the easiest way at the moment is to start an ollama server on a different port for each GPU, using `CUDA_VISIBLE_DEVICES` to bind each GPU to it's server, and then use a reverse proxy (eg [nginx](https://docs.nginx.com/nginx/admin-guide/load-balancer/http-load-balancer/) or [ollama_proxy_server](https://github.com/ParisNeo/ollama_proxy_server)) to distribute the requests to the ollama servers. Another approach would be to set `OLLAMA_SCHED_SPREAD=1` to force the model to be spread across all the GPUs and set `OLLAMA_NUM_PARALLEL=4` to allow concurrent processing. I don't know what the performance of this type of configuration would be.
Author
Owner

@dhiltgen commented on GitHub (Oct 15, 2024):

Tracking via #3902

<!-- gh-comment-id:2415221414 --> @dhiltgen commented on GitHub (Oct 15, 2024): Tracking via #3902
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66631