[GH-ISSUE #6654] Multi-instance seems not working #50701

Closed
opened 2026-04-28 16:48:35 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @bigsausage on GitHub (Sep 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6654

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

i want to use multi-process to increase the concurrency of my server

i use the follow command to start the server first
CUDA_VISIBLE_DEVICES=3 OLLAMA_NUM_PARALLEL=3 OLLAMA_MAX_LOADED_MODELS=3 /usr/bin/ollama serve

then i created 3 different copy of the model
ollama create my_lama3_1 -f ./Modelfile1
ollama create my_lama3_2 -f ./Modelfile2
ollama create my_lama3_2 -f ./Modelfile2

and the Modelfile1 Modelfile2 Modelfile3 points to the llama3_quantize_1.gguf llama3_quantize_2.gguf` llama3_quantize_3.gguf` which actually is the same int4 gguf (about 5GB).

ollama list shows that i got 3 different instance model but got the same id
image

nvidia-smi shows that the gpu only uses 6g
image

ollama ps shows that i got only one instance running.
image

i wrote a script to randomly distribute the request to one of the three models(my_lama3_1 / my_lama3_2 / my_lama3_3 ) to test the concurrency ability of my server , but the gpu still keeps at the range of 6GB, which i expect to be 6GB * 3 = 18GB to release all its ability ..

is there any way to get the 3 instance load into the gpu so i can get a better qps of my server?
or i was not using the command correctly and please tell me the correct way...
all i want is to improve the qps of my server, any advice is welcome!!

thanks ~

OS

Linux

GPU

Intel

CPU

Intel

Ollama version

0.2.5

Originally created by @bigsausage on GitHub (Sep 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6654 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? i want to use multi-process to increase the concurrency of my server i use the follow command to start the server first `CUDA_VISIBLE_DEVICES=3 OLLAMA_NUM_PARALLEL=3 OLLAMA_MAX_LOADED_MODELS=3 /usr/bin/ollama serve` then i created 3 different copy of the model `ollama create my_lama3_1 -f ./Modelfile1` `ollama create my_lama3_2 -f ./Modelfile2` `ollama create my_lama3_2 -f ./Modelfile2` and the `Modelfile1` `Modelfile2` `Modelfile3` points to the `llama3_quantize_1.gguf` ``llama3_quantize_2.gguf` ``llama3_quantize_3.gguf` which actually is the same int4 gguf (about 5GB). `ollama list` shows that i got 3 different instance model but got the same id ![image](https://github.com/user-attachments/assets/ff66289d-cfd0-4b21-9d7e-faa7dabd20ee) `nvidia-smi` shows that the gpu only uses 6g ![image](https://github.com/user-attachments/assets/47bfad3e-e140-4d84-98bb-cbe76e5bcf16) `ollama ps` shows that i got only one instance running. ![image](https://github.com/user-attachments/assets/fb1c0420-19a8-4982-ba2a-0fa521d88b6a) i wrote a script to randomly distribute the request to one of the three models(my_lama3_1 / my_lama3_2 / my_lama3_3 ) to test the concurrency ability of my server , but the gpu still keeps at the range of 6GB, which i expect to be 6GB * 3 = 18GB to release all its ability .. is there any way to get the 3 instance load into the gpu so i can get a better qps of my server? or i was not using the command correctly and please tell me the correct way... all i want is to improve the qps of my server, any advice is welcome!! thanks ~ ### OS Linux ### GPU Intel ### CPU Intel ### Ollama version 0.2.5
GiteaMirror added the feature request label 2026-04-28 16:48:35 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 5, 2024):

What you have is the same model with three names, so ollama just loads it once. ollama doesn't currently support loading the same model more than once. I say currently because I'm sure I read something some time ago that the developers were considering this, but I can't find a reference so consider that a possible hallucination.

If you want to run multiple copies of the same model, the easiest way at the moment is to start an ollama server on a different port for each GPU, using CUDA_VISIBLE_DEVICES to bind each GPU to it's server, and then use a reverse proxy (eg nginx or ollama_proxy_server) to distribute the requests to the ollama servers.

Another way would be to use slightly different quants for your custom models, eg q4_0, q4_1, q4_K_M.

<!-- gh-comment-id:2331521028 --> @rick-github commented on GitHub (Sep 5, 2024): What you have is the same model with three names, so ollama just loads it once. ollama doesn't currently support loading the same model more than once. I say currently because I'm sure I read something some time ago that the developers were considering this, but I can't find a reference so consider that a possible hallucination. If you want to run multiple copies of the same model, the easiest way at the moment is to start an ollama server on a different port for each GPU, using `CUDA_VISIBLE_DEVICES` to bind each GPU to it's server, and then use a reverse proxy (eg [nginx](https://docs.nginx.com/nginx/admin-guide/load-balancer/http-load-balancer/) or [ollama_proxy_server](https://github.com/ParisNeo/ollama_proxy_server)) to distribute the requests to the ollama servers. Another way would be to use slightly different quants for your custom models, eg q4_0, q4_1, q4_K_M.
Author
Owner

@dhiltgen commented on GitHub (Sep 5, 2024):

We're tracking adding the ability to load the same model multiple times in issue #3902

<!-- gh-comment-id:2332145470 --> @dhiltgen commented on GitHub (Sep 5, 2024): We're tracking adding the ability to load the same model multiple times in issue #3902
Author
Owner

@bigsausage commented on GitHub (Sep 6, 2024):

What you have is the same model with three names, so ollama just loads it once. ollama doesn't currently support loading the same model more than once. I say currently because I'm sure I read something some time ago that the developers were considering this, but I can't find a reference so consider that a possible hallucination.

If you want to run multiple copies of the same model, the easiest way at the moment is to start an ollama server on a different port for each GPU, using CUDA_VISIBLE_DEVICES to bind each GPU to it's server, and then use a reverse proxy (eg nginx or ollama_proxy_server) to distribute the requests to the ollama servers.

Another way would be to use slightly different quants for your custom models, eg q4_0, q4_1, q4_K_M.

thanks, that helps a lot

<!-- gh-comment-id:2333000582 --> @bigsausage commented on GitHub (Sep 6, 2024): > What you have is the same model with three names, so ollama just loads it once. ollama doesn't currently support loading the same model more than once. I say currently because I'm sure I read something some time ago that the developers were considering this, but I can't find a reference so consider that a possible hallucination. > > If you want to run multiple copies of the same model, the easiest way at the moment is to start an ollama server on a different port for each GPU, using `CUDA_VISIBLE_DEVICES` to bind each GPU to it's server, and then use a reverse proxy (eg [nginx](https://docs.nginx.com/nginx/admin-guide/load-balancer/http-load-balancer/) or [ollama_proxy_server](https://github.com/ParisNeo/ollama_proxy_server)) to distribute the requests to the ollama servers. > > Another way would be to use slightly different quants for your custom models, eg q4_0, q4_1, q4_K_M. thanks, that helps a lot
Author
Owner

@bigsausage commented on GitHub (Sep 6, 2024):

We're tracking adding the ability to load the same model multiple times in issue #3902

ok ~ look forward to the new feature~~

<!-- gh-comment-id:2333027755 --> @bigsausage commented on GitHub (Sep 6, 2024): > We're tracking adding the ability to load the same model multiple times in issue #3902 ok ~ look forward to the new feature~~
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50701