[GH-ISSUE #835] Improve GPU scheduling #26160

Closed
opened 2026-04-22 02:13:08 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @slychief on GitHub (Oct 18, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/835

Originally assigned to: @dhiltgen on GitHub.

Hi,

we have several GPUs in our server and use SLURM to manage the ressources. SLURM uses CUDA_VISIBLE_DEVICES to assign GPUs to jobs/processes.

When I run ollama directly from commandline - within a SLURM managed context with 1 GPU assigned - it uses all availables GPUs in the server and ignores CUDA_VISIBLE_DEVICES.

Is there a parameter or any recommendation how I can specify which GPUs ollama can use?

PS: a workaround is to use the docker container, but is there another solution for this, too?

Originally created by @slychief on GitHub (Oct 18, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/835 Originally assigned to: @dhiltgen on GitHub. Hi, we have several GPUs in our server and use SLURM to manage the ressources. SLURM uses CUDA_VISIBLE_DEVICES to assign GPUs to jobs/processes. When I run ollama directly from commandline - within a SLURM managed context with 1 GPU assigned - it uses all availables GPUs in the server and ignores CUDA_VISIBLE_DEVICES. Is there a parameter or any recommendation how I can specify which GPUs ollama can use? PS: a workaround is to use the docker container, but is there another solution for this, too?
GiteaMirror added the nvidiafeature requestamd labels 2026-04-22 02:13:08 -05:00
Author
Owner

@jtoy commented on GitHub (Dec 7, 2023):

I think ollama should support CUDA_VISIBLE_DEVICES

<!-- gh-comment-id:1843972108 --> @jtoy commented on GitHub (Dec 7, 2023): I think ollama should support CUDA_VISIBLE_DEVICES
Author
Owner

@jtoy commented on GitHub (Dec 8, 2023):

just to give more context:
I have a server with a 4090 and titan X in it, they are almost 8 years apart, but both work.
ollama on that box seems to be pretty slow, I want to test if its because ollama is using both GPUs and if the titanx is slowing down ollama.

The majority of GPU software uses CUDA_VISIBLE_DEVICES to respect which device it should use.

How would one test and run ollama on a single GPU?

<!-- gh-comment-id:1847569608 --> @jtoy commented on GitHub (Dec 8, 2023): just to give more context: I have a server with a 4090 and titan X in it, they are almost 8 years apart, but both work. ollama on that box seems to be pretty slow, I want to test if its because ollama is using both GPUs and if the titanx is slowing down ollama. The majority of GPU software uses CUDA_VISIBLE_DEVICES to respect which device it should use. How would one test and run ollama on a single GPU?
Author
Owner

@jtoy commented on GitHub (Dec 12, 2023):

I've started to look into this. it looks like the code has a parameters for this with opts.MainGPU but the current code doesn't take this flag from outside. If this is something useful, I can look into adding it.

<!-- gh-comment-id:1851248758 --> @jtoy commented on GitHub (Dec 12, 2023): I've started to look into this. it looks like the code has a parameters for this with opts.MainGPU but the current code doesn't take this flag from outside. If this is something useful, I can look into adding it.
Author
Owner

@mxyng commented on GitHub (Jan 17, 2024):

@slychief @jtoy can you confirm this is still an issue? I'm not able to reproduce this with the latest (v0.1.20) ollama.

Testing 2 T4 GPUs, I get the following results:

  1. ollama serve - CUDA_VISIBLE_DEVICES not set uses all available GPUs
$ nvidia-smi
Wed Jan 17 17:35:08 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   68C    P0              42W /  70W |   2765MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:00:05.0 Off |                    0 |
| N/A   54C    P0              37W /  70W |   2351MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1840      C   ollama                                     2760MiB |
|    1   N/A  N/A      1840      C   ollama                                     2346MiB |
+---------------------------------------------------------------------------------------+
  1. CUDA_VISIBLE_DEVICES=0 ollama serve - only expose device 0
$ nvidia-smi
Wed Jan 17 17:36:49 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P0              43W /  70W |   4741MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:00:05.0 Off |                    0 |
| N/A   61C    P0              31W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1889      C   ollama                                     4736MiB |
+---------------------------------------------------------------------------------------+
  1. CUDA_VISIBLE_DEVICES=1 ollama serve - only expose device 1
$ nvidia-smi
Wed Jan 17 17:37:34 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   67C    P8              12W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:00:05.0 Off |                    0 |
| N/A   62C    P0              31W /  70W |   4741MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    1   N/A  N/A      1932      C   ollama                                     4736MiB |
+---------------------------------------------------------------------------------------+
<!-- gh-comment-id:1896288415 --> @mxyng commented on GitHub (Jan 17, 2024): @slychief @jtoy can you confirm this is still an issue? I'm not able to reproduce this with the latest (v0.1.20) ollama. Testing 2 T4 GPUs, I get the following results: 1. `ollama serve` - `CUDA_VISIBLE_DEVICES` not set uses all available GPUs ``` $ nvidia-smi Wed Jan 17 17:35:08 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 | | N/A 68C P0 42W / 70W | 2765MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:00:05.0 Off | 0 | | N/A 54C P0 37W / 70W | 2351MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1840 C ollama 2760MiB | | 1 N/A N/A 1840 C ollama 2346MiB | +---------------------------------------------------------------------------------------+ ``` 2. `CUDA_VISIBLE_DEVICES=0 ollama serve` - only expose device 0 ``` $ nvidia-smi Wed Jan 17 17:36:49 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 | | N/A 72C P0 43W / 70W | 4741MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:00:05.0 Off | 0 | | N/A 61C P0 31W / 70W | 2MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1889 C ollama 4736MiB | +---------------------------------------------------------------------------------------+ ``` 3. `CUDA_VISIBLE_DEVICES=1 ollama serve` - only expose device 1 ``` $ nvidia-smi Wed Jan 17 17:37:34 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 | | N/A 67C P8 12W / 70W | 2MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:00:05.0 Off | 0 | | N/A 62C P0 31W / 70W | 4741MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 1 N/A N/A 1932 C ollama 4736MiB | +---------------------------------------------------------------------------------------+ ```
Author
Owner

@iamashwin99 commented on GitHub (Feb 22, 2024):

In our slurm configuration, we dont set CUDA_VISIBLE_DEVICES but rather SLURM_GPUS.
Is there a way to handle that?

❯ echo $CUDA_VISIBLE_DEVICES

❯ echo $SLURM_GPUS
1

<!-- gh-comment-id:1958947787 --> @iamashwin99 commented on GitHub (Feb 22, 2024): In our slurm configuration, we dont set `CUDA_VISIBLE_DEVICES` but rather `SLURM_GPUS`. Is there a way to handle that? ```console ❯ echo $CUDA_VISIBLE_DEVICES ❯ echo $SLURM_GPUS 1 ```
Author
Owner

@dhiltgen commented on GitHub (Mar 12, 2024):

@slychief is this still a problem? It looks like Ollama correctly respects the standard CUDA_VISIBLE_DEVICES environment variable. I'm going to close this as fixed, but if you're still seeing it, can you clarify the problem and what you're asking for?

<!-- gh-comment-id:1991928791 --> @dhiltgen commented on GitHub (Mar 12, 2024): @slychief is this still a problem? It looks like Ollama correctly respects the standard CUDA_VISIBLE_DEVICES environment variable. I'm going to close this as fixed, but if you're still seeing it, can you clarify the problem and what you're asking for?
Author
Owner

@slychief commented on GitHub (Mar 12, 2024):

Hi. Yes, going through the thread, it seems to be fixed. I haven't had time to test it myself, but I trust that it is solved and that the ticket can be closed.

<!-- gh-comment-id:1992099960 --> @slychief commented on GitHub (Mar 12, 2024): Hi. Yes, going through the thread, it seems to be fixed. I haven't had time to test it myself, but I trust that it is solved and that the ticket can be closed.
Author
Owner

@jtoy commented on GitHub (Mar 12, 2024):

how does usage work if we have multiple gpus? One thing I wanted to do is if I have a machine with 2 gpus, I would load ollama with mistral on 1 , and ollama with mixtral on the other gpu. is that possible ?

<!-- gh-comment-id:1992226595 --> @jtoy commented on GitHub (Mar 12, 2024): how does usage work if we have multiple gpus? One thing I wanted to do is if I have a machine with 2 gpus, I would load ollama with mistral on 1 , and ollama with mixtral on the other gpu. is that possible ?
Author
Owner

@dhiltgen commented on GitHub (Mar 12, 2024):

@jtoy yes, that's possible. Just run them on different ports. Be aware of #1514
We do have a workaround for memory predictions in 0.1.29 where you can set OLLAMA_MAX_VRAM=<bytes> until we get that issue resolved.

<!-- gh-comment-id:1992650776 --> @dhiltgen commented on GitHub (Mar 12, 2024): @jtoy yes, that's possible. Just run them on different ports. Be aware of #1514 We do have a workaround for memory predictions in 0.1.29 where you can set `OLLAMA_MAX_VRAM=<bytes>` until we get that issue resolved.
Author
Owner

@wwjCMP commented on GitHub (Apr 27, 2024):

I want to know how to use slurm to run the Ollama service.

<!-- gh-comment-id:2081119968 --> @wwjCMP commented on GitHub (Apr 27, 2024): I want to know how to use slurm to run the Ollama service.
Author
Owner

@thedaffodil commented on GitHub (Jul 3, 2024):

I want to know how to use slurm to run the Ollama service.

could you find a solution on that?

<!-- gh-comment-id:2205734594 --> @thedaffodil commented on GitHub (Jul 3, 2024): > I want to know how to use slurm to run the Ollama service. could you find a solution on that?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#26160