[GH-ISSUE #7768] Model not loaded on all GPUs for load balancing #51473

Closed
opened 2026-04-28 20:18:37 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @brauliobo on GitHub (Nov 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7768

What is the issue?

I expect that on a Multi GPU system it would load the model on all GPUs with the docker container loaded with --gpus all to balance the requests load between them.

Output of docker logs ollama:

ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes

Output of docker exec -it ollama nvidia-smi:

Wed Nov 20 20:00:03 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        On  |   00000000:01:00.0  On |                  N/A |
| 54%   67C    P2             99W /  100W |    7203MiB /  12288MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060        On  |   00000000:05:00.0 Off |                  N/A |
| 41%   63C    P2             99W /  100W |    8057MiB /  12288MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    1   N/A  N/A        52      C   ...unners/cuda_v12/ollama_llama_server       3344MiB |
+-----------------------------------------------------------------------------------------+

OS

Linux, Docker

GPU

Nvidia

CPU

AMD

Ollama version

0.4.2 from docker

Originally created by @brauliobo on GitHub (Nov 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7768 ### What is the issue? I expect that on a Multi GPU system it would load the model on all GPUs with the docker container loaded with `--gpus all` to balance the requests load between them. Output of `docker logs ollama`: ``` ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes ``` Output of `docker exec -it ollama nvidia-smi`: ``` Wed Nov 20 20:00:03 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 On | N/A | | 54% 67C P2 99W / 100W | 7203MiB / 12288MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3060 On | 00000000:05:00.0 Off | N/A | | 41% 63C P2 99W / 100W | 8057MiB / 12288MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 1 N/A N/A 52 C ...unners/cuda_v12/ollama_llama_server 3344MiB | +-----------------------------------------------------------------------------------------+ ``` ### OS Linux, Docker ### GPU Nvidia ### CPU AMD ### Ollama version 0.4.2 from docker
GiteaMirror added the bug label 2026-04-28 20:18:37 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

https://github.com/ollama/ollama/issues/7648#issuecomment-2473561990

If you want ollama to spread the model across all GPUs, set OLLAMA_SCHED_SPREAD: ecf41eed05/envconfig/config.go (L249)

<!-- gh-comment-id:2489470778 --> @rick-github commented on GitHub (Nov 20, 2024): https://github.com/ollama/ollama/issues/7648#issuecomment-2473561990 If you want ollama to spread the model across all GPUs, set `OLLAMA_SCHED_SPREAD`: https://github.com/ollama/ollama/blob/ecf41eed0595fb031f1addc179f6abb86d8405f8/envconfig/config.go#L249
Author
Owner

@brauliobo commented on GitHub (Nov 20, 2024):

thanks! it looks split into the 2 GPUs:

braulio @ whitebeast ➜  ollama  docker exec -it ollama nvidia-smi
Wed Nov 20 20:29:05 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        On  |   00000000:01:00.0 Off |                  N/A |
| 52%   65C    P2             99W /  100W |    9298MiB /  12288MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060        On  |   00000000:05:00.0 Off |                  N/A |
| 40%   63C    P2             99W /  100W |    6647MiB /  12288MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        58      C   ...unners/cuda_v12/ollama_llama_server       2082MiB |
|    1   N/A  N/A        58      C   ...unners/cuda_v12/ollama_llama_server       1934MiB |
+-----------------------------------------------------------------------------------------+

I expected it instead to load the whole model into each GPU to better load balance multiple requests.

Also still the bottleneck will be the CPU due to https://github.com/ollama/ollama/issues/6913

Thanks again and that should be what is possible for now

<!-- gh-comment-id:2489491295 --> @brauliobo commented on GitHub (Nov 20, 2024): thanks! it looks split into the 2 GPUs: ``` braulio @ whitebeast ➜ ollama docker exec -it ollama nvidia-smi Wed Nov 20 20:29:05 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A | | 52% 65C P2 99W / 100W | 9298MiB / 12288MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3060 On | 00000000:05:00.0 Off | N/A | | 40% 63C P2 99W / 100W | 6647MiB / 12288MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 58 C ...unners/cuda_v12/ollama_llama_server 2082MiB | | 1 N/A N/A 58 C ...unners/cuda_v12/ollama_llama_server 1934MiB | +-----------------------------------------------------------------------------------------+ ``` I expected it instead to load the whole model into each GPU to better load balance multiple requests. Also still the bottleneck will be the CPU due to https://github.com/ollama/ollama/issues/6913 Thanks again and that should be what is possible for now
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

https://github.com/ollama/ollama/issues/7648#issuecomment-2473561990 for ways to load balance multiple requests.

<!-- gh-comment-id:2489496287 --> @rick-github commented on GitHub (Nov 20, 2024): https://github.com/ollama/ollama/issues/7648#issuecomment-2473561990 for ways to load balance multiple requests.
Author
Owner

@brauliobo commented on GitHub (Nov 20, 2024):

thanks again, I've set -e OLLAMA_NUM_PARALLEL=2, but I wonder how much slower it is compared to running 2 servers separately with one GPU set with CUDA_VISIBLE_DEVICES for each, as its setup is much more complicated.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        On  |   00000000:01:00.0 Off |                  N/A |
| 52%   65C    P2             99W /  100W |    8996MiB /  12288MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060        On  |   00000000:05:00.0 Off |                  N/A |
| 40%   63C    P2             99W /  100W |    6213MiB /  12288MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        59      C   ...unners/cuda_v12/ollama_llama_server       1720MiB |
|    1   N/A  N/A        59      C   ...unners/cuda_v12/ollama_llama_server       1500MiB |
+-----------------------------------------------------------------------------------------+
<!-- gh-comment-id:2489510738 --> @brauliobo commented on GitHub (Nov 20, 2024): thanks again, I've set `-e OLLAMA_NUM_PARALLEL=2`, but I wonder how much slower it is compared to running 2 servers separately with one GPU set with `CUDA_VISIBLE_DEVICES` for each, as its setup is much more complicated. ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A | | 52% 65C P2 99W / 100W | 8996MiB / 12288MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3060 On | 00000000:05:00.0 Off | N/A | | 40% 63C P2 99W / 100W | 6213MiB / 12288MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 59 C ...unners/cuda_v12/ollama_llama_server 1720MiB | | 1 N/A N/A 59 C ...unners/cuda_v12/ollama_llama_server 1500MiB | +-----------------------------------------------------------------------------------------+ ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51473