[GH-ISSUE #1813] How to run Ollama only on a dedicated GPU? (Instead of all GPUs) #63072

Closed
opened 2026-05-03 11:39:09 -05:00 by GiteaMirror · 42 comments
Owner

Originally created by @sthufnagl on GitHub (Jan 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1813

Originally assigned to: @dhiltgen on GitHub.

Hi,

I have 3x3090 and I want to run Ollama Instance only on a dedicated GPU. The reason for this: To have 3xOllama Instances (with different ports) for using with Autogen.
I also tried the "Docker Ollama" without luck.
Or is there an other solution?

Let me know...

Thanks in advance

Steve

Originally created by @sthufnagl on GitHub (Jan 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1813 Originally assigned to: @dhiltgen on GitHub. Hi, I have 3x3090 and I want to run Ollama Instance only on a dedicated GPU. The reason for this: To have 3xOllama Instances (with different ports) for using with Autogen. I also tried the "Docker Ollama" without luck. Or is there an other solution? Let me know... Thanks in advance Steve
GiteaMirror added the gpu label 2026-05-03 11:39:09 -05:00
Author
Owner

@Tomatcree01 commented on GitHub (Jan 5, 2024):

You could give me the other two

<!-- gh-comment-id:1879379738 --> @Tomatcree01 commented on GitHub (Jan 5, 2024): You could give me the other two
Author
Owner

@sthufnagl commented on GitHub (Jan 6, 2024):

:-)

<!-- gh-comment-id:1879722051 --> @sthufnagl commented on GitHub (Jan 6, 2024): :-)
Author
Owner

@sthufnagl commented on GitHub (Jan 6, 2024):

Could it be that the numbers of GPUs used with Ollama is related to the model?
At the page https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md they mentioned a "num_gpu" parameter.
==> I have to create a new Model File from an existant Model? And include this parameter?
Still searching....

<!-- gh-comment-id:1879752568 --> @sthufnagl commented on GitHub (Jan 6, 2024): Could it be that the numbers of GPUs used with Ollama is related to the model? At the page https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md they mentioned a "num_gpu" parameter. ==> I have to create a new Model File from an existant Model? And include this parameter? Still searching....
Author
Owner

@tarbard commented on GitHub (Jan 6, 2024):

Could it be that the numbers of GPUs used with Ollama is related to the model? At the page https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md they mentioned a "num_gpu" parameter.

That's just the number of layers. I don't think there's a way to control GPU affinity but I would also like to do this. Another issue for me is it is automatically splitting a model between 2 GPUs even though it would fit on a single GPU (which would be faster) so I would like to just make it use the one with bigger VRAM.

<!-- gh-comment-id:1879847059 --> @tarbard commented on GitHub (Jan 6, 2024): > Could it be that the numbers of GPUs used with Ollama is related to the model? At the page https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md they mentioned a "num_gpu" parameter. That's just the number of layers. I don't think there's a way to control GPU affinity but I would also like to do this. Another issue for me is it is automatically splitting a model between 2 GPUs even though it would fit on a single GPU (which would be faster) so I would like to just make it use the one with bigger VRAM.
Author
Owner

@tarbard commented on GitHub (Jan 6, 2024):

I tried a bit of research - it seems the relevant llama options are

-mg i, --main-gpu i: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used. Requires cuBLAS.

-ts SPLIT, --tensor-split SPLIT: When using multiple GPUs this option controls how large tensors should be split across all GPUs. SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS.

Checking the https://github.com/jmorganca/ollama/blob/main/docs/api.md docs we should be able to pass in main_gpu to the API, so I tried with setting main_gpu to 1

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?",
  "stream": false,
  "options": {
    "num_keep": 5,
    "seed": 42,
    "num_predict": 100,
    "top_k": 20,
    "top_p": 0.9,
    "tfs_z": 0.5,
    "typical_p": 0.7,
    "repeat_last_n": 33,
    "temperature": 0.8,
    "repeat_penalty": 1.2,
    "presence_penalty": 1.5,
    "frequency_penalty": 1.0,
    "mirostat": 1,
    "mirostat_tau": 0.8,
    "mirostat_eta": 0.6,
    "penalize_newline": true,
    "stop": ["\n", "user:"],
    "numa": false,
    "num_ctx": 1024,
    "num_batch": 2,
    "num_gqa": 1,
    "main_gpu": 1,
    "low_vram": false,
    "f16_kv": true,
    "vocab_only": false,
    "use_mmap": true,
    "use_mlock": false,
    "embedding_only": false,
    "rope_frequency_base": 1.1,
    "rope_frequency_scale": 0.8,
    "num_thread": 8
  }
}'

This didn't seem to work as the same memory split took place rather than it using only the second GPU. Maybe the option is not yet passed onto llama from ollama. I had a look at the ollama code but i'm not familiar with Go so i'm not sure.

<!-- gh-comment-id:1879866492 --> @tarbard commented on GitHub (Jan 6, 2024): I tried a bit of research - it seems the relevant llama options are ``` -mg i, --main-gpu i: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used. Requires cuBLAS. -ts SPLIT, --tensor-split SPLIT: When using multiple GPUs this option controls how large tensors should be split across all GPUs. SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS. ``` Checking the https://github.com/jmorganca/ollama/blob/main/docs/api.md docs we should be able to pass in main_gpu to the API, so I tried with setting main_gpu to 1 ``` curl http://localhost:11434/api/generate -d '{ "model": "llama2", "prompt": "Why is the sky blue?", "stream": false, "options": { "num_keep": 5, "seed": 42, "num_predict": 100, "top_k": 20, "top_p": 0.9, "tfs_z": 0.5, "typical_p": 0.7, "repeat_last_n": 33, "temperature": 0.8, "repeat_penalty": 1.2, "presence_penalty": 1.5, "frequency_penalty": 1.0, "mirostat": 1, "mirostat_tau": 0.8, "mirostat_eta": 0.6, "penalize_newline": true, "stop": ["\n", "user:"], "numa": false, "num_ctx": 1024, "num_batch": 2, "num_gqa": 1, "main_gpu": 1, "low_vram": false, "f16_kv": true, "vocab_only": false, "use_mmap": true, "use_mlock": false, "embedding_only": false, "rope_frequency_base": 1.1, "rope_frequency_scale": 0.8, "num_thread": 8 } }' ``` This didn't seem to work as the same memory split took place rather than it using only the second GPU. Maybe the option is not yet passed onto llama from ollama. I had a look at the ollama code but i'm not familiar with Go so i'm not sure.
Author
Owner

@sthufnagl commented on GitHub (Jan 7, 2024):

Thx tarbard...I will check it.

<!-- gh-comment-id:1880100181 --> @sthufnagl commented on GitHub (Jan 7, 2024): Thx tarbard...I will check it.
Author
Owner

@houstonhaynes commented on GitHub (Jan 7, 2024):

If you're running in three separate containers via docker you can start up each container to only be "aware" of one GPU.

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html

docker run --gpus '"device=1,2"' \
    nvidia/cuda nvidia-smi --query-gpu=uuid --format=csv
<!-- gh-comment-id:1880210573 --> @houstonhaynes commented on GitHub (Jan 7, 2024): If you're running in three separate containers via docker you can start up each container to only be "aware" of one GPU. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html ```bash docker run --gpus '"device=1,2"' \ nvidia/cuda nvidia-smi --query-gpu=uuid --format=csv ```
Author
Owner

@sthufnagl commented on GitHub (Jan 8, 2024):

@houstonhaynes...I had the same Idea, but it doesn't work for me. Ollama, running inside Docker, takes all GPUs no matter how I use the the Docker Parameter "--gpu" (also tried the ID of a GPU).
:-(
Does it work for you?

My solution now is to splt/distribute the 3090 to different PCs. To my surprise, even with very old PC Hardware, Ollama runs fast!
Also the uploading of a Model to VRAM is nearly the same.

<!-- gh-comment-id:1880709279 --> @sthufnagl commented on GitHub (Jan 8, 2024): @houstonhaynes...I had the same Idea, but it doesn't work for me. Ollama, running inside Docker, takes all GPUs no matter how I use the the Docker Parameter "--gpu" (also tried the ID of a GPU). :-( Does it work for you? My solution now is to splt/distribute the 3090 to different PCs. To my surprise, even with very old PC Hardware, Ollama runs fast! Also the uploading of a Model to VRAM is nearly the same.
Author
Owner

@houstonhaynes commented on GitHub (Jan 8, 2024):

That is wild - I guess I "trust the manual" too much! I have two machines with an RTX3050 on each and haven't moved one over to have two on one machine. I was just doing some spelunking for GPU driven inference with postgresml and spotted that "deep" info from NVidia along the way. I thought it would be useful when I upgrade. I'm sorry it's not more helpful but maybe the controls "under the hood" suggested above will give you the right lever(s). I'd love to know how that turns out in case it comes calling after I put a bunch of cards in a GPU chassis! 😸

<!-- gh-comment-id:1880998797 --> @houstonhaynes commented on GitHub (Jan 8, 2024): That is wild - I guess I "trust the manual" too much! I have two machines with an RTX3050 on each and haven't moved one over to have two on one machine. I was just doing some spelunking for GPU driven inference with postgresml and spotted that "deep" info from NVidia along the way. I thought it would be useful when I upgrade. I'm sorry it's not more helpful but maybe the controls "under the hood" suggested above will give you the right lever(s). I'd love to know how that turns out in case it comes calling after I put a bunch of cards in a GPU chassis! 😸
Author
Owner

@null-dev commented on GitHub (Jan 11, 2024):

BTW you can use CUDA_VISIBLE_DEVICES for this, see: https://stackoverflow.com/questions/39649102/how-do-i-select-which-gpu-to-run-a-job-on

Unfortunately, the name of the environment variable is kinda a lie. It appears the other GPUs are still visible, just not accessible, so when ollama calculates the compute capability level of the GPUs, it will take into account the other GPUs. This is bad, because if you have GPU 0 with compute capability X, and GPU 1 with compute capability Y and you set CUDA_VISIBLE_DEVICES=0, ollama will detect the compute capability as min(X, Y) when instead compute capability X is the best value. EDIT: Nevermind, this isn't a problem because it looks like Ollama doesn't actually do anything with the detected compute capability information, it's just used to validate whether or not to use GPUs at all.

<!-- gh-comment-id:1886403230 --> @null-dev commented on GitHub (Jan 11, 2024): BTW you can use `CUDA_VISIBLE_DEVICES` for this, see: https://stackoverflow.com/questions/39649102/how-do-i-select-which-gpu-to-run-a-job-on Unfortunately, the name of the environment variable is kinda a lie. It appears the other GPUs are still visible, just not accessible, so when `ollama` calculates the compute capability level of the GPUs, it will take into account the other GPUs. ~~This is bad, because if you have GPU 0 with compute capability X, and GPU 1 with compute capability Y and you set `CUDA_VISIBLE_DEVICES=0`, ollama will detect the compute capability as `min(X, Y)` when instead compute capability `X` is the best value.~~ **EDIT:** Nevermind, this isn't a problem because it looks like Ollama doesn't actually do anything with the detected compute capability information, it's just used to validate whether or not to use GPUs at all.
Author
Owner

@cgint commented on GitHub (Jan 21, 2024):

Same challenge here.

CUDA_VISIBLE_DEVICES somehow does not work for me as a switch between models that fit onto one GPU and others that need 2. I could though spin up two instances of ollama on two ports where one has CUDA_VISIBLE_DEVICES set to only 'see' one device and the second instance has access to both. Then I would have to decide myself depending on the model which instance to connect to.

Would really be awesome if either ...

  • there was a config option for OLLAMA that changes behaviour in a way that is does not try to balance the used VRAM over all available GPUs but e.g. only use one GPU if this already has enough VRAM to hold model + context.
  • there was an option to specify this on inference-calls. main_gpu mentioned by @tarbard sounds like that.

Will check out if main_gpu works on my system.

Damn!
Not working with Ollama in Python - although the option is handed over to the HTTP-Request to Ollama-Endpoint. 🤷

What i do get since activating {'main_gpu': 1} though ... is a log output when a model is loaded saying
ollama[1733]: ggml_cuda_set_main_device: using device 1 (NVIDIA GeForce RTX 4060 Ti) as main device.
But the model is still distributed across my 2 GPUs although it would fit onto one.

With my current solution i spin up another instance of ollama with the following command ...

CUDA_VISIBLE_DEVICES=0 OLLAMA_HOST=0.0.0.0:22222 ollama serve

... and whenever I know a model fits on one GPU i connect to this port on my local machine.

Thx for the CUDA_VISIBLE_DEVICES @null-dev

<!-- gh-comment-id:1902682612 --> @cgint commented on GitHub (Jan 21, 2024): Same challenge here. `CUDA_VISIBLE_DEVICES` somehow does not work for me as a switch between models that fit onto one GPU and others that need 2. I could though spin up two instances of `ollama` on two ports where one has `CUDA_VISIBLE_DEVICES` set to only 'see' one device and the second instance has access to both. Then I would have to decide myself depending on the model which instance to connect to. Would really be awesome if either ... - there was a config option for OLLAMA that changes behaviour in a way that is does not try to balance the used VRAM over all available GPUs but e.g. only use one GPU if this already has enough VRAM to hold model + context. - there was an option to specify this on inference-calls. `main_gpu` mentioned by @tarbard sounds like that. Will check out if `main_gpu` works on my system. Damn! Not working with Ollama in Python - although the option is handed over to the HTTP-Request to Ollama-Endpoint. :shrug: What i do get since activating {'main_gpu': 1} though ... is a log output when a model is loaded saying `ollama[1733]: ggml_cuda_set_main_device: using device 1 (NVIDIA GeForce RTX 4060 Ti) as main device`. But the model is still distributed across my 2 GPUs although it would fit onto one. With my current solution i spin up another instance of `ollama` with the following command ... ``` CUDA_VISIBLE_DEVICES=0 OLLAMA_HOST=0.0.0.0:22222 ollama serve ``` ... and whenever I know a model fits on one GPU i connect to this port on my local machine. Thx for the `CUDA_VISIBLE_DEVICES` @null-dev
Author
Owner

@matbeedotcom commented on GitHub (Jan 27, 2024):

-damn, I was not hoping for this outcome. Has anyone figured out how to restrict it to just one?- nvm, using CUDA_VISIBLE_DEVICES seemed to have done the trick

<!-- gh-comment-id:1912946901 --> @matbeedotcom commented on GitHub (Jan 27, 2024): -damn, I was not hoping for this outcome. Has anyone figured out how to restrict it to just one?- nvm, using CUDA_VISIBLE_DEVICES seemed to have done the trick
Author
Owner

@Koesn commented on GitHub (Feb 25, 2024):

Why this still unsupported? I'm running LM Studio to dedicate a GPU using tensor split 0,35 so I can fully offload Mistral 32k context to a 3060. I hope there's a tensor split on Ollama modelfile.

<!-- gh-comment-id:1962810633 --> @Koesn commented on GitHub (Feb 25, 2024): Why this still unsupported? I'm running LM Studio to dedicate a GPU using tensor split 0,35 so I can fully offload Mistral 32k context to a 3060. I hope there's a tensor split on Ollama modelfile.
Author
Owner

@dhiltgen commented on GitHub (Mar 12, 2024):

CUDA_VISIBLE_DEVICES should work. We do have a defect related to memory prediction calculations in this case tracked via #1514

If you're seeing it load onto unexpected GPUs when this variable is set, please share the server log and some more details about the setup and I'll re-open.

<!-- gh-comment-id:1992607854 --> @dhiltgen commented on GitHub (Mar 12, 2024): CUDA_VISIBLE_DEVICES should work. We do have a defect related to memory prediction calculations in this case tracked via #1514 If you're seeing it load onto unexpected GPUs when this variable is set, please share the server log and some more details about the setup and I'll re-open.
Author
Owner

@jeremytregunna commented on GitHub (Mar 14, 2024):

CUDA_VISIBLE_DEVICES should work. We do have a defect related to memory prediction calculations in this case tracked via #1514

If you're seeing it load onto unexpected GPUs when this variable is set, please share the server log and some more details about the setup and I'll re-open.

image

As you can see in the above image, I have 3 GPUs. 2x RTX A6000 and 1x 3070. I use the A6000s for bigger models through Ollama, and the smaller GPU I want to reserve for embedding models. However, when I start the server using the systemd config below:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/home/ubuntu/.local/bin:/home/ubuntu/miniconda3/bin:/home/ubuntu/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
Environment="CUDA_VISIBLE_DEVICES=0,2"
Environment="OLLAMA_HOST=0.0.0.0:11434"

[Install]
WantedBy=default.target

Restart Ollama, and use say dolphin-mixtral:8x7b-v2.7-q8_0 (a model that will occupy more GPU memory than i have on any one GPU), it distributes it over device 0 and 1 instead of 0 and 2. I can wholly confirm I did a systemctl daemon-reload, then a systemctl restart ollama before then sending a message to the dolphin-mixtral model and watching nvtop.

So it doesn't seem as though CUDA_VISIBLE_DEVICES is working as intended. For completeness here's the output of nvidia-smi:

Thu Mar 14 22:51:19 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:81:00.0 Off |                    0 |
| 30%   57C    P8              22W / 300W |  43657MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3070        Off | 00000000:C1:00.0 Off |                  N/A |
|  0%   47C    P8              22W / 270W |   5246MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000               Off | 00000000:C2:00.0 Off |                  Off |
| 31%   60C    P8              28W / 300W |      1MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2873      C   /usr/local/bin/ollama                     43650MiB |
|    1   N/A  N/A      2873      C   /usr/local/bin/ollama                      5240MiB |
+---------------------------------------------------------------------------------------+

Any help would be appreciated. @dhiltgen

<!-- gh-comment-id:1998609376 --> @jeremytregunna commented on GitHub (Mar 14, 2024): > CUDA_VISIBLE_DEVICES should work. We do have a defect related to memory prediction calculations in this case tracked via #1514 > > If you're seeing it load onto unexpected GPUs when this variable is set, please share the server log and some more details about the setup and I'll re-open. ![image](https://github.com/ollama/ollama/assets/261615/de90fb66-c672-4d05-9eb4-22895da3137a) As you can see in the above image, I have 3 GPUs. 2x RTX A6000 and 1x 3070. I use the A6000s for bigger models through Ollama, and the smaller GPU I want to reserve for embedding models. However, when I start the server using the systemd config below: ``` [Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=/usr/local/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=/home/ubuntu/.local/bin:/home/ubuntu/miniconda3/bin:/home/ubuntu/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" Environment="CUDA_VISIBLE_DEVICES=0,2" Environment="OLLAMA_HOST=0.0.0.0:11434" [Install] WantedBy=default.target ``` Restart Ollama, and use say dolphin-mixtral:8x7b-v2.7-q8_0 (a model that will occupy more GPU memory than i have on any one GPU), it distributes it over device 0 and 1 instead of 0 and 2. I can wholly confirm I did a `systemctl daemon-reload`, then a `systemctl restart ollama` before then sending a message to the dolphin-mixtral model and watching nvtop. So it doesn't seem as though CUDA_VISIBLE_DEVICES is working as intended. For completeness here's the output of nvidia-smi: ``` Thu Mar 14 22:51:19 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA RTX A6000 Off | 00000000:81:00.0 Off | 0 | | 30% 57C P8 22W / 300W | 43657MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3070 Off | 00000000:C1:00.0 Off | N/A | | 0% 47C P8 22W / 270W | 5246MiB / 8192MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA RTX A6000 Off | 00000000:C2:00.0 Off | Off | | 31% 60C P8 28W / 300W | 1MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2873 C /usr/local/bin/ollama 43650MiB | | 1 N/A N/A 2873 C /usr/local/bin/ollama 5240MiB | +---------------------------------------------------------------------------------------+ ``` Any help would be appreciated. @dhiltgen
Author
Owner

@dhiltgen commented on GitHub (Mar 15, 2024):

@jeremytregunna it sounds like there might be an ordering/enumeration bug where we're not consistent with other tools. If I had to guess, I'd speculate this is some tools/libraries using PCI bus/slot, and others sorting by capability/performance.

Can you enable OLLAMA_DEBUG=1 and start up the server?

Also try CUDA_VISIBLE_DEVICES=0,1 and from what you describe, that sounds like it might get the GPU assignment right.

<!-- gh-comment-id:2000426684 --> @dhiltgen commented on GitHub (Mar 15, 2024): @jeremytregunna it sounds like there might be an ordering/enumeration bug where we're not consistent with other tools. If I had to guess, I'd speculate this is some tools/libraries using PCI bus/slot, and others sorting by capability/performance. Can you enable OLLAMA_DEBUG=1 and start up the server? Also try `CUDA_VISIBLE_DEVICES=0,1` and from what you describe, that sounds like it might get the GPU assignment right.
Author
Owner

@jeremytregunna commented on GitHub (Mar 16, 2024):

@jeremytregunna it sounds like there might be an ordering/enumeration bug where we're not consistent with other tools. If I had to guess, I'd speculate this is some tools/libraries using PCI bus/slot, and others sorting by capability/performance.

Can you enable OLLAMA_DEBUG=1 and start up the server?

Also try CUDA_VISIBLE_DEVICES=0,1 and from what you describe, that sounds like it might get the GPU assignment right.

Hrmm... I've run it with debug logs on a few times, and the ordering never seems to change, it always reports the output below:

CUDA driver version: 535.161.07
time=2024-03-15T23:25:09.751Z level=INFO source=gpu.go:82 msg="Nvidia GPU detected"
time=2024-03-15T23:25:09.751Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
[0] CUDA device name: NVIDIA RTX A6000
[0] CUDA part number: 900-5G133-0300-000
[0] CUDA S/N: 1651922013945
[0] CUDA vbios version: 94.02.5C.00.06
[0] CUDA brand: 13
[0] CUDA totalMem 48305799168
[0] CUDA usedMem 467599360
[1] CUDA device name: NVIDIA GeForce RTX 3070
[1] CUDA part number: 
nvmlDeviceGetSerial failed: 3
[1] CUDA vbios version: 94.04.67.00.3E
[1] CUDA brand: 5
[1] CUDA totalMem 8589934592
[1] CUDA usedMem 230031360
[2] CUDA device name: NVIDIA RTX A6000
[2] CUDA part number: 900-5G133-1700-000
[2] CUDA S/N: 1320722000285
[2] CUDA vbios version: 94.02.5C.00.02
[2] CUDA brand: 13
[2] CUDA totalMem 51527024640
[2] CUDA usedMem 486866944
time=2024-03-15T23:25:09.769Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-15T23:25:09.769Z level=DEBUG source=gpu.go:180 msg="cuda detected 3 devices with 92043M available memory"

I verified they're the same devices by looking at the serial number. I also tried what you said with using `CUDA_VISIBLE_DEVICES=0,1" and 1,2 with no luck

The whole log is preserved below, note this is with 0,2 but as I previously mentioned, that made no difference:

Mar 15 23:35:20 calgary systemd[1]: Stopping Ollama Service...
Mar 15 23:35:20 calgary systemd[1]: ollama.service: Deactivated successfully.
Mar 15 23:35:20 calgary systemd[1]: Stopped Ollama Service.
Mar 15 23:35:20 calgary systemd[1]: ollama.service: Consumed 5.777s CPU time.
Mar 15 23:35:20 calgary systemd[1]: Started Ollama Service.
Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.657Z level=INFO source=images.go:806 msg="total blobs: 48"
Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.657Z level=INFO source=images.go:813 msg="total unused blobs removed: 0"
Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.658Z level=INFO source=routes.go:1110 msg="Listening on [::]:11434 (version 0.1.29)"
Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.658Z level=INFO source=payload_common.go:112 msg="Extracting dynamic libraries to /tmp/ollama4171821284/runners ..."
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.312Z level=INFO source=payload_common.go:139 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60000 cpu cpu_avx]"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.312Z level=INFO source=gpu.go:77 msg="Detecting GPU type"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.312Z level=INFO source=gpu.go:191 msg="Searching for GPU management library libnvidia-ml.so"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.317Z level=INFO source=gpu.go:237 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.161.07]"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.334Z level=INFO source=gpu.go:82 msg="Nvidia GPU detected"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.334Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.352Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.959Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama4171821284/runners/cuda_v11/libext_server.so"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.959Z level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
Mar 15 23:36:36 calgary ollama[5122]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
Mar 15 23:36:36 calgary ollama[5122]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
Mar 15 23:36:36 calgary ollama[5122]: ggml_init_cublas: found 2 CUDA devices:
Mar 15 23:36:36 calgary ollama[5122]:   Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
Mar 15 23:36:36 calgary ollama[5122]:   Device 1: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: loaded meta data with 24 key-value pairs and 995 tensors from /usr/share/ollama/.ollama/models/blobs/sha256:a03abff90c35c22bb4e10be3fcb0b974525e50c5e65ce1b4db59781fc413dc2e (version GGUF V3 (latest))
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   1:                               general.name str              = cognitivecomputations
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   4:                          llama.block_count u32              = 32
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   9:                         llama.expert_count u32              = 8
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  10:                    llama.expert_used_count u32              = 2
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 1000000.000000
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  13:                          general.file_type u32              = 7
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 32000
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  23:               general.quantization_version u32              = 2
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - type  f32:   65 tensors
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - type  f16:   32 tensors
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - type q8_0:  898 tensors
Mar 15 23:36:37 calgary ollama[5122]: llm_load_vocab: special tokens definition check successful ( 261/32002 ).
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: format           = GGUF V3 (latest)
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: arch             = llama
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: vocab type       = SPM
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_vocab          = 32002
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_merges         = 0
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_ctx_train      = 32768
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd           = 4096
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_head           = 32
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_head_kv        = 8
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_layer          = 32
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_rot            = 128
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_head_k    = 128
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_head_v    = 128
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_gqa            = 4
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_k_gqa     = 1024
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_v_gqa     = 1024
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_ff             = 14336
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_expert         = 8
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_expert_used    = 2
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: pooling type     = 0
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: rope type        = 0
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: rope scaling     = linear
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: freq_base_train  = 1000000.0
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: freq_scale_train = 1
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_yarn_orig_ctx  = 32768
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: rope_finetuned   = unknown
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model type       = 7B
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model ftype      = Q8_0
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model params     = 46.70 B
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model size       = 46.22 GiB (8.50 BPW)
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: general.name     = cognitivecomputations
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: BOS token        = 1 '<s>'
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: UNK token        = 0 '<unk>'
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: LF token         = 13 '<0x0A>'
Mar 15 23:36:37 calgary ollama[5122]: llm_load_tensors: ggml ctx size =    1.14 MiB
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: offloading 32 repeating layers to GPU
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: offloading non-repeating layers to GPU
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: offloaded 33/33 layers to GPU
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors:        CPU buffer size =   132.82 MiB
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors:      CUDA0 buffer size = 42647.22 MiB
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors:      CUDA1 buffer size =  4544.62 MiB
Mar 15 23:36:48 calgary ollama[5122]: ....................................................................................................
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: n_ctx      = 2048
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: freq_base  = 1000000.0
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: freq_scale = 1
Mar 15 23:36:48 calgary ollama[5122]: llama_kv_cache_init:      CUDA0 KV buffer size =   232.00 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_kv_cache_init:      CUDA1 KV buffer size =    24.00 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model:  CUDA_Host input buffer size   =    13.02 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model:      CUDA0 compute buffer size =   184.03 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model:      CUDA1 compute buffer size =   192.01 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model:  CUDA_Host compute buffer size =     8.00 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: graph splits (measure): 3
Mar 15 23:36:48 calgary ollama[5122]: loading library /tmp/ollama4171821284/runners/cuda_v11/libext_server.so
Mar 15 23:36:48 calgary ollama[5122]: {"function":"initialize","level":"INFO","line":440,"msg":"initializing slots","n_slots":1,"tid":"137734259725888","timestamp":1710545808}
Mar 15 23:36:48 calgary ollama[5122]: {"function":"initialize","level":"INFO","line":449,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"137734259725888","timestamp":1710545808}
Mar 15 23:36:48 calgary ollama[5122]: time=2024-03-15T23:36:48.328Z level=INFO source=dyn_ext_server.go:162 msg="Starting llama main loop"
Mar 15 23:36:48 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1590,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"137729434187328","timestamp":1710545808}
Mar 15 23:36:48 calgary ollama[5122]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545808}
Mar 15 23:36:48 calgary ollama[5122]: {"function":"update_slots","ga_i":0,"level":"INFO","line":1821,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":111,"slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545808}
Mar 15 23:36:48 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1848,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545808}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =     781.86 ms /   111 tokens (    7.04 ms per token,   141.97 tokens per second)","n_prompt_tokens_processed":111,"n_tokens_second":141.96842417607155,"slot_id":0,"t_prompt_processing":781.864,"t_token":7.04381981981982,"task_id":0,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =   10352.39 ms /   327 runs   (   31.66 ms per token,    31.59 tokens per second)","n_decoded":327,"n_tokens_second":31.586915019027494,"slot_id":0,"t_token":31.65867889908257,"t_token_generation":10352.388,"task_id":0,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":281,"msg":"          total time =   11134.25 ms","slot_id":0,"t_prompt_processing":781.864,"t_token_generation":10352.388,"t_total":11134.252,"task_id":0,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1652,"msg":"slot released","n_cache_tokens":438,"n_ctx":2048,"n_past":437,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545819,"truncated":false}
Mar 15 23:36:59 calgary ollama[5122]: [GIN] 2024/03/15 - 23:36:59 | 200 | 23.883120028s |      10.7.14.22 | POST     "/api/chat"
Mar 15 23:36:59 calgary ollama[5122]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"update_slots","ga_i":0,"level":"INFO","line":1821,"msg":"slot progression","n_past":21,"n_past_se":0,"n_prompt_tokens_processed":131,"slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1848,"msg":"kv cache rm [p0, end)","p0":21,"slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:37:00 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =     836.71 ms /   131 tokens (    6.39 ms per token,   156.57 tokens per second)","n_prompt_tokens_processed":131,"n_tokens_second":156.56578332490747,"slot_id":0,"t_prompt_processing":836.709,"t_token":6.387091603053435,"task_id":330,"tid":"137729434187328","timestamp":1710545820}
Mar 15 23:37:00 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =     190.45 ms /     7 runs   (   27.21 ms per token,    36.75 tokens per second)","n_decoded":7,"n_tokens_second":36.75486083034482,"slot_id":0,"t_token":27.207285714285714,"t_token_generation":190.451,"task_id":330,"tid":"137729434187328","timestamp":1710545820}
Mar 15 23:37:00 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":281,"msg":"          total time =    1027.16 ms","slot_id":0,"t_prompt_processing":836.709,"t_token_generation":190.451,"t_total":1027.1599999999999,"task_id":330,"tid":"137729434187328","timestamp":1710545820}
Mar 15 23:37:00 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1652,"msg":"slot released","n_cache_tokens":159,"n_ctx":2048,"n_past":158,"n_system_tokens":0,"slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545820,"truncated":false}
Mar 15 23:37:00 calgary ollama[5122]: [GIN] 2024/03/15 - 23:37:00 | 200 |   1.02968349s |      10.7.14.22 | POST     "/api/generate"
<!-- gh-comment-id:2001238032 --> @jeremytregunna commented on GitHub (Mar 16, 2024): > @jeremytregunna it sounds like there might be an ordering/enumeration bug where we're not consistent with other tools. If I had to guess, I'd speculate this is some tools/libraries using PCI bus/slot, and others sorting by capability/performance. > > Can you enable OLLAMA_DEBUG=1 and start up the server? > > Also try `CUDA_VISIBLE_DEVICES=0,1` and from what you describe, that sounds like it might get the GPU assignment right. Hrmm... I've run it with debug logs on a few times, and the ordering never seems to change, it always reports the output below: ``` CUDA driver version: 535.161.07 time=2024-03-15T23:25:09.751Z level=INFO source=gpu.go:82 msg="Nvidia GPU detected" time=2024-03-15T23:25:09.751Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" [0] CUDA device name: NVIDIA RTX A6000 [0] CUDA part number: 900-5G133-0300-000 [0] CUDA S/N: 1651922013945 [0] CUDA vbios version: 94.02.5C.00.06 [0] CUDA brand: 13 [0] CUDA totalMem 48305799168 [0] CUDA usedMem 467599360 [1] CUDA device name: NVIDIA GeForce RTX 3070 [1] CUDA part number: nvmlDeviceGetSerial failed: 3 [1] CUDA vbios version: 94.04.67.00.3E [1] CUDA brand: 5 [1] CUDA totalMem 8589934592 [1] CUDA usedMem 230031360 [2] CUDA device name: NVIDIA RTX A6000 [2] CUDA part number: 900-5G133-1700-000 [2] CUDA S/N: 1320722000285 [2] CUDA vbios version: 94.02.5C.00.02 [2] CUDA brand: 13 [2] CUDA totalMem 51527024640 [2] CUDA usedMem 486866944 time=2024-03-15T23:25:09.769Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6" time=2024-03-15T23:25:09.769Z level=DEBUG source=gpu.go:180 msg="cuda detected 3 devices with 92043M available memory" ``` I verified they're the same devices by looking at the serial number. I also tried what you said with using `CUDA_VISIBLE_DEVICES=0,1" and 1,2 with no luck The whole log is preserved below, note this is with `0,2` but as I previously mentioned, that made no difference: ``` Mar 15 23:35:20 calgary systemd[1]: Stopping Ollama Service... Mar 15 23:35:20 calgary systemd[1]: ollama.service: Deactivated successfully. Mar 15 23:35:20 calgary systemd[1]: Stopped Ollama Service. Mar 15 23:35:20 calgary systemd[1]: ollama.service: Consumed 5.777s CPU time. Mar 15 23:35:20 calgary systemd[1]: Started Ollama Service. Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.657Z level=INFO source=images.go:806 msg="total blobs: 48" Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.657Z level=INFO source=images.go:813 msg="total unused blobs removed: 0" Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.658Z level=INFO source=routes.go:1110 msg="Listening on [::]:11434 (version 0.1.29)" Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.658Z level=INFO source=payload_common.go:112 msg="Extracting dynamic libraries to /tmp/ollama4171821284/runners ..." Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.312Z level=INFO source=payload_common.go:139 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60000 cpu cpu_avx]" Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.312Z level=INFO source=gpu.go:77 msg="Detecting GPU type" Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.312Z level=INFO source=gpu.go:191 msg="Searching for GPU management library libnvidia-ml.so" Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.317Z level=INFO source=gpu.go:237 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.161.07]" Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.334Z level=INFO source=gpu.go:82 msg="Nvidia GPU detected" Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.334Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.352Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6" Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6" Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6" Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.959Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama4171821284/runners/cuda_v11/libext_server.so" Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.959Z level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" Mar 15 23:36:36 calgary ollama[5122]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes Mar 15 23:36:36 calgary ollama[5122]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: no Mar 15 23:36:36 calgary ollama[5122]: ggml_init_cublas: found 2 CUDA devices: Mar 15 23:36:36 calgary ollama[5122]: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes Mar 15 23:36:36 calgary ollama[5122]: Device 1: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: loaded meta data with 24 key-value pairs and 995 tensors from /usr/share/ollama/.ollama/models/blobs/sha256:a03abff90c35c22bb4e10be3fcb0b974525e50c5e65ce1b4db59781fc413dc2e (version GGUF V3 (latest)) Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 0: general.architecture str = llama Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 1: general.name str = cognitivecomputations Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 2: llama.context_length u32 = 32768 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 4: llama.block_count u32 = 32 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 9: llama.expert_count u32 = 8 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 10: llama.expert_used_count u32 = 2 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 12: llama.rope.freq_base f32 = 1000000.000000 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 13: general.file_type u32 = 7 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 14: tokenizer.ggml.model str = llama Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<... Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000... Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 32000 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 22: tokenizer.chat_template str = {% if not add_generation_prompt is de... Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 23: general.quantization_version u32 = 2 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - type f32: 65 tensors Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - type f16: 32 tensors Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - type q8_0: 898 tensors Mar 15 23:36:37 calgary ollama[5122]: llm_load_vocab: special tokens definition check successful ( 261/32002 ). Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: format = GGUF V3 (latest) Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: arch = llama Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: vocab type = SPM Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_vocab = 32002 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_merges = 0 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_ctx_train = 32768 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd = 4096 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_head = 32 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_head_kv = 8 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_layer = 32 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_rot = 128 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_head_k = 128 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_head_v = 128 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_gqa = 4 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_k_gqa = 1024 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_v_gqa = 1024 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_norm_eps = 0.0e+00 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_ff = 14336 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_expert = 8 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_expert_used = 2 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: pooling type = 0 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: rope type = 0 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: rope scaling = linear Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: freq_base_train = 1000000.0 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: freq_scale_train = 1 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_yarn_orig_ctx = 32768 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: rope_finetuned = unknown Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model type = 7B Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model ftype = Q8_0 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model params = 46.70 B Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model size = 46.22 GiB (8.50 BPW) Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: general.name = cognitivecomputations Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: BOS token = 1 '<s>' Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: EOS token = 32000 '<|im_end|>' Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: UNK token = 0 '<unk>' Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: LF token = 13 '<0x0A>' Mar 15 23:36:37 calgary ollama[5122]: llm_load_tensors: ggml ctx size = 1.14 MiB Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: offloading 32 repeating layers to GPU Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: offloading non-repeating layers to GPU Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: offloaded 33/33 layers to GPU Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: CPU buffer size = 132.82 MiB Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: CUDA0 buffer size = 42647.22 MiB Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: CUDA1 buffer size = 4544.62 MiB Mar 15 23:36:48 calgary ollama[5122]: .................................................................................................... Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: n_ctx = 2048 Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: freq_base = 1000000.0 Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: freq_scale = 1 Mar 15 23:36:48 calgary ollama[5122]: llama_kv_cache_init: CUDA0 KV buffer size = 232.00 MiB Mar 15 23:36:48 calgary ollama[5122]: llama_kv_cache_init: CUDA1 KV buffer size = 24.00 MiB Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: CUDA0 compute buffer size = 184.03 MiB Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: CUDA1 compute buffer size = 192.01 MiB Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: graph splits (measure): 3 Mar 15 23:36:48 calgary ollama[5122]: loading library /tmp/ollama4171821284/runners/cuda_v11/libext_server.so Mar 15 23:36:48 calgary ollama[5122]: {"function":"initialize","level":"INFO","line":440,"msg":"initializing slots","n_slots":1,"tid":"137734259725888","timestamp":1710545808} Mar 15 23:36:48 calgary ollama[5122]: {"function":"initialize","level":"INFO","line":449,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"137734259725888","timestamp":1710545808} Mar 15 23:36:48 calgary ollama[5122]: time=2024-03-15T23:36:48.328Z level=INFO source=dyn_ext_server.go:162 msg="Starting llama main loop" Mar 15 23:36:48 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1590,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"137729434187328","timestamp":1710545808} Mar 15 23:36:48 calgary ollama[5122]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545808} Mar 15 23:36:48 calgary ollama[5122]: {"function":"update_slots","ga_i":0,"level":"INFO","line":1821,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":111,"slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545808} Mar 15 23:36:48 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1848,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545808} Mar 15 23:36:59 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 781.86 ms / 111 tokens ( 7.04 ms per token, 141.97 tokens per second)","n_prompt_tokens_processed":111,"n_tokens_second":141.96842417607155,"slot_id":0,"t_prompt_processing":781.864,"t_token":7.04381981981982,"task_id":0,"tid":"137729434187328","timestamp":1710545819} Mar 15 23:36:59 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 10352.39 ms / 327 runs ( 31.66 ms per token, 31.59 tokens per second)","n_decoded":327,"n_tokens_second":31.586915019027494,"slot_id":0,"t_token":31.65867889908257,"t_token_generation":10352.388,"task_id":0,"tid":"137729434187328","timestamp":1710545819} Mar 15 23:36:59 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":281,"msg":" total time = 11134.25 ms","slot_id":0,"t_prompt_processing":781.864,"t_token_generation":10352.388,"t_total":11134.252,"task_id":0,"tid":"137729434187328","timestamp":1710545819} Mar 15 23:36:59 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1652,"msg":"slot released","n_cache_tokens":438,"n_ctx":2048,"n_past":437,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545819,"truncated":false} Mar 15 23:36:59 calgary ollama[5122]: [GIN] 2024/03/15 - 23:36:59 | 200 | 23.883120028s | 10.7.14.22 | POST "/api/chat" Mar 15 23:36:59 calgary ollama[5122]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545819} Mar 15 23:36:59 calgary ollama[5122]: {"function":"update_slots","ga_i":0,"level":"INFO","line":1821,"msg":"slot progression","n_past":21,"n_past_se":0,"n_prompt_tokens_processed":131,"slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545819} Mar 15 23:36:59 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1848,"msg":"kv cache rm [p0, end)","p0":21,"slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545819} Mar 15 23:37:00 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 836.71 ms / 131 tokens ( 6.39 ms per token, 156.57 tokens per second)","n_prompt_tokens_processed":131,"n_tokens_second":156.56578332490747,"slot_id":0,"t_prompt_processing":836.709,"t_token":6.387091603053435,"task_id":330,"tid":"137729434187328","timestamp":1710545820} Mar 15 23:37:00 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 190.45 ms / 7 runs ( 27.21 ms per token, 36.75 tokens per second)","n_decoded":7,"n_tokens_second":36.75486083034482,"slot_id":0,"t_token":27.207285714285714,"t_token_generation":190.451,"task_id":330,"tid":"137729434187328","timestamp":1710545820} Mar 15 23:37:00 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":281,"msg":" total time = 1027.16 ms","slot_id":0,"t_prompt_processing":836.709,"t_token_generation":190.451,"t_total":1027.1599999999999,"task_id":330,"tid":"137729434187328","timestamp":1710545820} Mar 15 23:37:00 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1652,"msg":"slot released","n_cache_tokens":159,"n_ctx":2048,"n_past":158,"n_system_tokens":0,"slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545820,"truncated":false} Mar 15 23:37:00 calgary ollama[5122]: [GIN] 2024/03/15 - 23:37:00 | 200 | 1.02968349s | 10.7.14.22 | POST "/api/generate" ```
Author
Owner

@dhiltgen commented on GitHub (Mar 18, 2024):

@jeremytregunna looking back on that screen shot you posted above, I think the problem may be a result of how you have your cards plugged into your PCI slots. I believe you have 1 of the A6000's and the 3070 in the PCI 4@16x slots, but the other A6000 is in a older/slower PCI 1@16x slot. If you put both of the A6000's into the gen 4 slots and the 3070 into the gen 1 slot, perhaps things will be selected properly.

<!-- gh-comment-id:2003070606 --> @dhiltgen commented on GitHub (Mar 18, 2024): @jeremytregunna looking back on that screen shot you posted above, I think the problem may be a result of how you have your cards plugged into your PCI slots. I believe you have 1 of the A6000's and the 3070 in the PCI 4@16x slots, but the other A6000 is in a older/slower PCI 1@16x slot. If you put both of the A6000's into the gen 4 slots and the 3070 into the gen 1 slot, perhaps things will be selected properly.
Author
Owner

@jeremytregunna commented on GitHub (Mar 18, 2024):

@jeremytregunna looking back on that screen shot you posted above, I think the problem may be a result of how you have your cards plugged into your PCI slots. I believe you have 1 of the A6000's and the 3070 in the PCI 4@16x slots, but the other A6000 is in a older/slower PCI 1@16x slot. If you put both of the A6000's into the gen 4 slots and the 3070 into the gen 1 slot, perhaps things will be selected properly.

Nope that's not it, but you are correct in one respect. The second A6000, since not being used, is currently at PCE1 speeds but, if I select it specifically in some other torch code, it bumps up to PCIE4x16 speeds. nvtop right now reports all 3 cards at PCE gen1 speeds because nothing is loaded. I can assure you, they're all plugged into gen 4 x16 slots.

<!-- gh-comment-id:2003934338 --> @jeremytregunna commented on GitHub (Mar 18, 2024): > @jeremytregunna looking back on that screen shot you posted above, I think the problem may be a result of how you have your cards plugged into your PCI slots. I believe you have 1 of the A6000's and the 3070 in the PCI 4@16x slots, but the other A6000 is in a older/slower PCI 1@16x slot. If you put both of the A6000's into the gen 4 slots and the 3070 into the gen 1 slot, perhaps things will be selected properly. Nope that's not it, but you are correct in one respect. The second A6000, since not being used, is currently at PCE1 speeds but, if I select it specifically in some other torch code, it bumps up to PCIE4x16 speeds. nvtop right now reports all 3 cards at PCE gen1 speeds because nothing is loaded. I can assure you, they're all plugged into gen 4 x16 slots.
Author
Owner

@dhiltgen commented on GitHub (Mar 19, 2024):

Can you try setting CUDA_DEVICE_ORDER as well. Options are FASTEST_FIRST or PCI_BUS_ID
It looks like you can also specify device UUIDs for the visible device setting which might help. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#id205

Use nvidia-smi -L to get the UUIDs of your GPUs.

Hopefully some combination of these will get things aligned.

<!-- gh-comment-id:2006030885 --> @dhiltgen commented on GitHub (Mar 19, 2024): Can you try setting `CUDA_DEVICE_ORDER` as well. Options are `FASTEST_FIRST` or `PCI_BUS_ID` It looks like you can also specify device UUIDs for the visible device setting which might help. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#id205 Use `nvidia-smi -L` to get the UUIDs of your GPUs. Hopefully some combination of these will get things aligned.
Author
Owner

@jeremytregunna commented on GitHub (Mar 21, 2024):

Can you try setting CUDA_DEVICE_ORDER as well. Options are FASTEST_FIRST or PCI_BUS_ID It looks like you can also specify device UUIDs for the visible device setting which might help. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#id205

Use nvidia-smi -L to get the UUIDs of your GPUs.

Hopefully some combination of these will get things aligned.

Ok this had an interesting effect. Loading dolphin-mixtral:8x7b-v2.7-q8_0 again, it splits 50%/50% on the A6000s now with FASTEST_FIRST, but it also uses about 1/4 of memory on the 3070. I can confirm all memory usage on all the GPUs is nominal before dolphin-mixtral is loaded. I essentially need to keep tho 3070 out of consideration for ollama entirely, so this won't exactly work since it'll always be in the mix.

2024-03-20T20:33:46,349694373-06:00

<!-- gh-comment-id:2011081449 --> @jeremytregunna commented on GitHub (Mar 21, 2024): > Can you try setting `CUDA_DEVICE_ORDER` as well. Options are `FASTEST_FIRST` or `PCI_BUS_ID` It looks like you can also specify device UUIDs for the visible device setting which might help. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#id205 > > Use `nvidia-smi -L` to get the UUIDs of your GPUs. > > Hopefully some combination of these will get things aligned. Ok this had an interesting effect. Loading dolphin-mixtral:8x7b-v2.7-q8_0 again, it splits 50%/50% on the A6000s now with `FASTEST_FIRST`, but it also uses about 1/4 of memory on the 3070. I can confirm all memory usage on all the GPUs is nominal before dolphin-mixtral is loaded. I essentially need to keep tho 3070 out of consideration for ollama entirely, so this won't exactly work since it'll always be in the mix. ![2024-03-20T20:33:46,349694373-06:00](https://github.com/ollama/ollama/assets/261615/9f212c31-66e6-469b-bc5a-09788e69de03)
Author
Owner

@jeremytregunna commented on GitHub (Mar 21, 2024):

@dhiltgen So tried with the explicit UUIDs with CUDA_VISIBLE_DEVICES and that works, but their GPU instance IDs do not work. For now, this is resolved, but I am left wondering if Ollama can do better?

<!-- gh-comment-id:2011101400 --> @jeremytregunna commented on GitHub (Mar 21, 2024): @dhiltgen So tried with the explicit UUIDs with `CUDA_VISIBLE_DEVICES` and that works, but their GPU instance IDs do not work. For now, this is resolved, but I am left wondering if Ollama can do better?
Author
Owner

@Koesn commented on GitHub (Mar 25, 2024):

@dhiltgen Thank you, CUDA_VISIBLE_DEVICES works. Finally.

<!-- gh-comment-id:2017415292 --> @Koesn commented on GitHub (Mar 25, 2024): @dhiltgen Thank you, CUDA_VISIBLE_DEVICES works. Finally.
Author
Owner

@datalee commented on GitHub (Apr 12, 2024):

mark

<!-- gh-comment-id:2050896183 --> @datalee commented on GitHub (Apr 12, 2024): mark
Author
Owner

@datalee commented on GitHub (Apr 12, 2024):

It can also be specified like this:
CUDA_VISIBLE_DEVICES=xx OLLAMA_HOST=0.0.0.0:xxx OLLAMA_MODELS=xxx/ollama_cache ollama serve

<!-- gh-comment-id:2051189617 --> @datalee commented on GitHub (Apr 12, 2024): It can also be specified like this: ` CUDA_VISIBLE_DEVICES=xx OLLAMA_HOST=0.0.0.0:xxx OLLAMA_MODELS=xxx/ollama_cache ollama serve `
Author
Owner

@papandadj commented on GitHub (Apr 19, 2024):

damn. CUDA_VISIBLE_DEVICES is fine for me. thank you.

<!-- gh-comment-id:2066331682 --> @papandadj commented on GitHub (Apr 19, 2024): damn. CUDA_VISIBLE_DEVICES is fine for me. thank you.
Author
Owner

@charles-cai commented on GitHub (Apr 30, 2024):

@jeremytregunna gpustat --watch looks very cool :)
ah it's actually nvtop!

<!-- gh-comment-id:2087553512 --> @charles-cai commented on GitHub (Apr 30, 2024): @jeremytregunna `gpustat --watch` looks very cool :) ah it's actually nvtop!
Author
Owner

@pykeras commented on GitHub (May 8, 2024):

Automate/Easy GPU Selection for Ollama

Hi everyone,

I wanted to share a handy script I created for automating GPU selection when running Ollama. You can find the script here. This script allows you to specify which GPU(s) Ollama should utilize, making it easier to manage resources and optimize performance.

How to Use:

  • Download the ollama_gpu_selector.sh script from the gist.
  • Make it executable: chmod +x ollama_gpu_selector.sh.
  • Run the script with administrative privileges: sudo ./ollama_gpu_selector.sh.
  • Follow the prompts to select the GPU(s) for Ollama.

Additionally, I've included aliases in the gist for easier switching between GPU selections. Feel free to customize these aliases to suit your preferences.

If you encounter any issues or have suggestions for improvement, please let me know! I hope this script helps streamline your Ollama workflow.

Happy coding!

<!-- gh-comment-id:2101598931 --> @pykeras commented on GitHub (May 8, 2024): Automate/Easy GPU Selection for Ollama Hi everyone, I wanted to share a handy script I created for automating GPU selection when running Ollama. You can find the script [here](https://gist.github.com/pykeras/0b1e32b92b87cdce1f7195ea3409105c). This script allows you to specify which GPU(s) Ollama should utilize, making it easier to manage resources and optimize performance. How to Use: * Download the `ollama_gpu_selector.sh` script from the gist. * Make it executable: `chmod +x ollama_gpu_selector.sh`. * Run the script with administrative privileges: `sudo ./ollama_gpu_selector.sh`. * Follow the prompts to select the GPU(s) for Ollama. Additionally, I've included aliases in the gist for easier switching between GPU selections. Feel free to customize these aliases to suit your preferences. If you encounter any issues or have suggestions for improvement, please let me know! I hope this script helps streamline your Ollama workflow. Happy coding!
Author
Owner

@emourdavid commented on GitHub (May 13, 2024):

Automate/Easy GPU Selection for Ollama

Hi everyone,

I wanted to share a handy script I created for automating GPU selection when running Ollama. You can find the script here. This script allows you to specify which GPU(s) Ollama should utilize, making it easier to manage resources and optimize performance.

How to Use:

  • Download the ollama_gpu_selector.sh script from the gist.
  • Make it executable: chmod +x ollama_gpu_selector.sh.
  • Run the script with administrative privileges: sudo ./ollama_gpu_selector.sh.
  • Follow the prompts to select the GPU(s) for Ollama.

Additionally, I've included aliases in the gist for easier switching between GPU selections. Feel free to customize these aliases to suit your preferences.

If you encounter any issues or have suggestions for improvement, please let me know! I hope this script helps streamline your Ollama workflow.

Happy coding!

Thank you, I can run this successful.

<!-- gh-comment-id:2107597186 --> @emourdavid commented on GitHub (May 13, 2024): > Automate/Easy GPU Selection for Ollama > > Hi everyone, > > I wanted to share a handy script I created for automating GPU selection when running Ollama. You can find the script [here](https://gist.github.com/pykeras/0b1e32b92b87cdce1f7195ea3409105c). This script allows you to specify which GPU(s) Ollama should utilize, making it easier to manage resources and optimize performance. > > How to Use: > > * Download the `ollama_gpu_selector.sh` script from the gist. > * Make it executable: `chmod +x ollama_gpu_selector.sh`. > * Run the script with administrative privileges: `sudo ./ollama_gpu_selector.sh`. > * Follow the prompts to select the GPU(s) for Ollama. > > Additionally, I've included aliases in the gist for easier switching between GPU selections. Feel free to customize these aliases to suit your preferences. > > If you encounter any issues or have suggestions for improvement, please let me know! I hope this script helps streamline your Ollama workflow. > > Happy coding! Thank you, I can run this successful.
Author
Owner

@pccross commented on GitHub (Oct 4, 2024):

Does the CUDA_VISIBLE_DEVICES work on AMD ROCm GPU's? I tried setting it to just a single GPU (3, then 2, then 1), and it always loaded my LLM's (4 simultaneous instances of Llama3.1:8b) to different GPU's in what seemed random fashion, when I just wanted the 4 loaded to a single GPU (with 192GB VRAM).

<!-- gh-comment-id:2392834858 --> @pccross commented on GitHub (Oct 4, 2024): Does the CUDA_VISIBLE_DEVICES work on AMD ROCm GPU's? I tried setting it to just a single GPU (3, then 2, then 1), and it always loaded my LLM's (4 simultaneous instances of Llama3.1:8b) to different GPU's in what seemed random fashion, when I just wanted the 4 loaded to a single GPU (with 192GB VRAM).
Author
Owner

@jeremytregunna commented on GitHub (Oct 4, 2024):

Does the CUDA_VISIBLE_DEVICES work on AMD ROCm GPU's? I tried setting it to just a single GPU (3, then 2, then 1), and it always loaded my LLM's (4 simultaneous instances of Llama3.1:8b) to different GPU's in what seemed random fashion, when I just wanted the 4 loaded to a single GPU (with 192GB VRAM).

No, because AMD GPUs don't use CUDA. But you can get the right env var for you here: https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html

Though I should note, not sure how this interacts with Ollama because I don't use AMD GPUs, but if it works like the CUDA env vars do, it should "just work".

<!-- gh-comment-id:2393968080 --> @jeremytregunna commented on GitHub (Oct 4, 2024): > Does the CUDA_VISIBLE_DEVICES work on AMD ROCm GPU's? I tried setting it to just a single GPU (3, then 2, then 1), and it always loaded my LLM's (4 simultaneous instances of Llama3.1:8b) to different GPU's in what seemed random fashion, when I just wanted the 4 loaded to a single GPU (with 192GB VRAM). No, because AMD GPUs don't use CUDA. But you can get the right env var for you here: https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html Though I should note, not sure how this interacts with Ollama because I don't use AMD GPUs, but if it works like the CUDA env vars do, it should "just work".
Author
Owner

@AlessandroBorges commented on GitHub (Oct 6, 2024):

Ok this had an interesting effect. Loading dolphin-mixtral:8x7b-v2.7-q8_0 again, it splits 50%/50% on the A6000s now with FASTEST_FIRST, but it also uses about 1/4 of memory on the 3070. I can confirm all memory usage on all the GPUs is nominal before dolphin-mixtral is loaded. I essentially need to keep tho 3070 out of consideration for ollama entirely, so this won't exactly work since it'll always be in the mix.

2024-03-20T20:33:46,349694373-06:00

@jeremytregunna I think the odd one out here is the RTX 3070 8GB, especially when paired with two "800-pound gorillas" like the A6000 48GB. Unless you're in desperate need of that extra 8GB, it's probably better to remove the 3070 and let the pair of A6000s work together seamlessly. You can put this 3070 in another PC and use it to run embeddings.

<!-- gh-comment-id:2395557401 --> @AlessandroBorges commented on GitHub (Oct 6, 2024): > Ok this had an interesting effect. Loading dolphin-mixtral:8x7b-v2.7-q8_0 again, it splits 50%/50% on the A6000s now with `FASTEST_FIRST`, but it also uses about 1/4 of memory on the 3070. I can confirm all memory usage on all the GPUs is nominal before dolphin-mixtral is loaded. I essentially need to keep tho 3070 out of consideration for ollama entirely, so this won't exactly work since it'll always be in the mix. > > ![2024-03-20T20:33:46,349694373-06:00](https://private-user-images.githubusercontent.com/261615/314911618-9f212c31-66e6-469b-bc5a-09788e69de03.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjgyMzgzODUsIm5iZiI6MTcyODIzODA4NSwicGF0aCI6Ii8yNjE2MTUvMzE0OTExNjE4LTlmMjEyYzMxLTY2ZTYtNDY5Yi1iYzVhLTA5Nzg4ZTY5ZGUwMy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQxMDA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MTAwNlQxODA4MDVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1hMTM1MmNhNGQyODNmMGY1MDBjMGFlZTJhY2JhZGQyZjgzZDQ1YzI3YWYwNTBhMzQyZWVlYjljOTc1MjRlNWNiJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.VdpwCLdB7Fda9jS-Qyn-D8rxUFk9GY63VJQWdS45cok) @jeremytregunna I think the odd one out here is the RTX 3070 8GB, especially when paired with two "800-pound gorillas" like the A6000 48GB. Unless you're in desperate need of that extra 8GB, it's probably better to remove the 3070 and let the pair of A6000s work together seamlessly. You can put this 3070 in another PC and use it to run embeddings.
Author
Owner

@jeremytregunna commented on GitHub (Oct 7, 2024):

@jeremytregunna I think the odd one out here is the RTX 3070 8GB, especially when paired with two "800-pound gorillas" like the A6000 48GB. Unless you're in desperate need of that extra 8GB, it's probably better to remove the 3070 and let the pair of A6000s work together seamlessly. You can put this 3070 in another PC and use it to run embeddings.

Even if that's true, and certainly removing that GPU worked around the problem, it highlighted a bug in the nvidia drivers. Easy assumption to make, all GPUs will be the same, but that's not always true. In my case, the A6000s were used for inference with LLMs, and the 3070 was used for embedding models outside of Ollama. I've since moved the embedding work off of the A6000 nodes, but the issue stood. Anyway, the UUIDs work and the indexes didn't.

<!-- gh-comment-id:2397172794 --> @jeremytregunna commented on GitHub (Oct 7, 2024): > @jeremytregunna I think the odd one out here is the RTX 3070 8GB, especially when paired with two "800-pound gorillas" like the A6000 48GB. Unless you're in desperate need of that extra 8GB, it's probably better to remove the 3070 and let the pair of A6000s work together seamlessly. You can put this 3070 in another PC and use it to run embeddings. Even if that's true, and certainly removing that GPU worked around the problem, it highlighted a bug in the nvidia drivers. Easy assumption to make, all GPUs will be the same, but that's not always true. In my case, the A6000s were used for inference with LLMs, and the 3070 was used for embedding models outside of Ollama. I've since moved the embedding work off of the A6000 nodes, but the issue stood. Anyway, the UUIDs work and the indexes didn't.
Author
Owner

@PiDevi commented on GitHub (Oct 17, 2024):

I recently faced a similar challenge while managing multiple CUDA GPUs on my Windows machine. After thorough research, I discovered a convenient method for selectively enabling which GPUs are visible to specific programs.

Allow Specific GPU Access for Programs:

For users of Windows machines with Nvidia CUDA GPUs, the Nvidia Control Panel offers a graphical interface to configure program-specific GPU allocation. Open Nvidia Control Panel and navigate to 'Manage 3D Settings' > switch to the tab 'Program Settings' and select the desired program. Under the 'CUDA - GPUs' section, choose the desired GPU or list of GPUs to allocate to that program. Click on 'Apply', and restart your program such as Ollama.exe. For image generation UIs, you need to select the specific used python.exe in that UI installation (e.g. C:\ForgeUI\system\python\python.exe).

My Configuration:

In my setup, I have a 2060 (8GB) and two older P40s (24GB each). I utilize Ollama in parallel with two image generator IUs (Easy Diffusion and ForgeUI). Ollama loads onto one of my P40s, ForgeUI uses the 2060, while Easy Diffusion gets the second P40.

CUDA_VISIBLE_DEVICES Parameter:

It's important to note that from my understanding the CUDA_VISIBLE_DEVICES parameter is a CUDA-level setting applicable both locally and system-wide. From what I have experienced this parameter is not specific to Ollama. Setting this parameter to a specific GPU or list of GPUs unfortunately hide all my other CUDA GPUs not explicitly listed. Those not listed GPUs became unavailable to any program on my machine that relies on CUDA.

<!-- gh-comment-id:2418947195 --> @PiDevi commented on GitHub (Oct 17, 2024): I recently faced a similar challenge while managing multiple CUDA GPUs on my Windows machine. After thorough research, I discovered a convenient method for selectively enabling which GPUs are visible to specific programs. **Allow Specific GPU Access for Programs:** For users of Windows machines with Nvidia CUDA GPUs, the Nvidia Control Panel offers a graphical interface to configure program-specific GPU allocation. Open Nvidia Control Panel and navigate to 'Manage 3D Settings' > switch to the tab 'Program Settings' and select the desired program. Under the 'CUDA - GPUs' section, choose the desired GPU or list of GPUs to allocate to that program. Click on 'Apply', and restart your program such as Ollama.exe. For image generation UIs, you need to select the specific used python.exe in that UI installation (e.g. C:\ForgeUI\system\python\python.exe). **My Configuration:** In my setup, I have a 2060 (8GB) and two older P40s (24GB each). I utilize Ollama in parallel with two image generator IUs (Easy Diffusion and ForgeUI). Ollama loads onto one of my P40s, ForgeUI uses the 2060, while Easy Diffusion gets the second P40. **CUDA_VISIBLE_DEVICES Parameter:** It's important to note that from my understanding the CUDA_VISIBLE_DEVICES parameter is a CUDA-level setting applicable both locally and system-wide. From what I have experienced this parameter is not specific to Ollama. Setting this parameter to a specific GPU or list of GPUs unfortunately hide all my other CUDA GPUs not explicitly listed. Those not listed GPUs became unavailable to any program on my machine that relies on CUDA.
Author
Owner

@YouxunYao commented on GitHub (Oct 26, 2024):

(base) PS C:\Users\11648> conda activate OllamaGPU
(OllamaGPU) PS C:\Users\11648> $env:CUDA_VISIBLE_DEVICES ="1"
(OllamaGPU) PS C:\Users\11648> Start-Process "C:\Users\11648\AppData\Local\Programs\Ollama\ollama app.exe"
(OllamaGPU) PS C:\Users\11648>
So using anaconda env this way solved this problem for me, now Ollama only runs on the specified GPU, and at the same time it doesn't affect other applications.

<!-- gh-comment-id:2439644138 --> @YouxunYao commented on GitHub (Oct 26, 2024): (base) PS C:\Users\11648> conda activate OllamaGPU (OllamaGPU) PS C:\Users\11648> $env:CUDA_VISIBLE_DEVICES ="1" (OllamaGPU) PS C:\Users\11648> Start-Process "C:\Users\11648\AppData\Local\Programs\Ollama\ollama app.exe" (OllamaGPU) PS C:\Users\11648> So using anaconda env this way solved this problem for me, now Ollama only runs on the specified GPU, and at the same time it doesn't affect other applications.
Author
Owner

@mshakirDr commented on GitHub (Nov 17, 2024):

Two devices = A 4090 and an RTX Ada 2000.
Use CUDA_VISIBLE_DEVICES=0, CUDA_VISIBLE_DEVICES=1 in two terminal windows.
set OLLAMA_HOST to different ports in each window
Run ollama serve
Run inference on both models in parallel in python.
One model runs on Ada 2000 (the smaller GPU), the other is partially offloaded to CPU (RTX4090 is apparently only used for VRAM).
The above workaround was to circumvent "mllama doesn't support parallel requests yet" in Llama 3.2 Vision models. But it does not work either.

<!-- gh-comment-id:2481020174 --> @mshakirDr commented on GitHub (Nov 17, 2024): Two devices = A 4090 and an RTX Ada 2000. Use CUDA_VISIBLE_DEVICES=0, CUDA_VISIBLE_DEVICES=1 in two terminal windows. set OLLAMA_HOST to different ports in each window Run ollama serve Run inference on both models in parallel in python. One model runs on Ada 2000 (the smaller GPU), the other is partially offloaded to CPU (RTX4090 is apparently only used for VRAM). The above workaround was to circumvent "mllama doesn't support parallel requests yet" in Llama 3.2 Vision models. But it does not work either.
Author
Owner

@LeeABarron commented on GitHub (Nov 21, 2024):

@dhiltgen worked with your weekend changes! thank you!

I compiled with make CUSTOM_CPU_FLAGS="" -j 5 cuda_v12 CUDA_12_PATH=/usr/local/cuda-12.5

<!-- gh-comment-id:2492095552 --> @LeeABarron commented on GitHub (Nov 21, 2024): @dhiltgen worked with your weekend changes! thank you! I compiled with make CUSTOM_CPU_FLAGS="" -j 5 cuda_v12 CUDA_12_PATH=/usr/local/cuda-12.5
Author
Owner

@aviupa commented on GitHub (Feb 1, 2025):

Well if you were still not able to do it here's how I did it.
Switched to Ollama Docker:- https://github.com/valiantlynx/ollama-docker
Installed and ran everything from the documentation on the above link. Used docker-compose to do so. Then changed the "docker-compose-ollama-gpu.yaml" to:

deploy: resources: reservations: devices: - driver: nvidia capabilities: [gpu] device_ids: ["2"]

Ran the containers with docker-compose to use the 3rd GPU successfully.

<!-- gh-comment-id:2628880449 --> @aviupa commented on GitHub (Feb 1, 2025): Well if you were still not able to do it here's how I did it. Switched to Ollama Docker:- https://github.com/valiantlynx/ollama-docker Installed and ran everything from the documentation on the above link. Used docker-compose to do so. Then changed the "docker-compose-ollama-gpu.yaml" to: `deploy: resources: reservations: devices: - driver: nvidia capabilities: [gpu] device_ids: ["2"]` Ran the containers with docker-compose to use the 3rd GPU successfully.
Author
Owner

@ohpage commented on GitHub (Jun 18, 2025):

Solved at last
I got 2 GPU(cuda 0 : RTX3090/24G, cuda 1: rtx3060/12G) in my pc and want to put ollama in cuda 1.
Model is gemma3:12b_q4(8.1GB)

  1. $ systemctl stop ollama
  2. Set "Environment = "CUDA_VISIBLE_DEVICES=1" at ollama.service.
  3. $ systemctl start ollama
  4. $ systemctl run gemma3:12b
  5. $ nvidia-smi (check)

if something wrong after reboot, then i'll remove this comment

<!-- gh-comment-id:2984268671 --> @ohpage commented on GitHub (Jun 18, 2025): Solved at last I got 2 GPU(cuda 0 : RTX3090/24G, cuda 1: rtx3060/12G) in my pc and want to put ollama in cuda 1. Model is gemma3:12b_q4(8.1GB) 1. $ systemctl stop ollama 2. Set "Environment = "CUDA_VISIBLE_DEVICES=1" at ollama.service. 3. $ systemctl start ollama 4. $ systemctl run gemma3:12b 5. $ nvidia-smi (check) if something wrong after reboot, then i'll remove this comment
Author
Owner

@akaghzi commented on GitHub (Jun 30, 2025):

Solved at last I got 2 GPU(cuda 0 : RTX3090/24G, cuda 1: rtx3060/12G) in my pc and want to put ollama in cuda 1. Model is gemma3:12b_q4(8.1GB)

  1. $ systemctl stop ollama
  2. Set "Environment = "CUDA_VISIBLE_DEVICES=1" at ollama.service.
  3. $ systemctl start ollama
  4. $ systemctl run gemma3:12b
  5. $ nvidia-smi (check)

if something wrong after reboot, then i'll remove this comment

worked for me on ubuntu 24.04

<!-- gh-comment-id:3020693132 --> @akaghzi commented on GitHub (Jun 30, 2025): > Solved at last I got 2 GPU(cuda 0 : RTX3090/24G, cuda 1: rtx3060/12G) in my pc and want to put ollama in cuda 1. Model is gemma3:12b_q4(8.1GB) > > 1. $ systemctl stop ollama > 2. Set "Environment = "CUDA_VISIBLE_DEVICES=1" at ollama.service. > 3. $ systemctl start ollama > 4. $ systemctl run gemma3:12b > 5. $ nvidia-smi (check) > > if something wrong after reboot, then i'll remove this comment worked for me on ubuntu 24.04
Author
Owner

@Zabadeus commented on GitHub (Aug 7, 2025):

On Windows I fixed it by adding a new "User variables" (in "Environment Variables" with

Name: LLAMA_CUDA_FORCE
Value: 1

forcing the system to use my main (second) GPU when running LLama.cpp

<!-- gh-comment-id:3163241899 --> @Zabadeus commented on GitHub (Aug 7, 2025): On Windows I fixed it by adding a new "User variables" (in "Environment Variables" with Name: LLAMA_CUDA_FORCE Value: 1 forcing the system to use my main (second) GPU when running LLama.cpp
Author
Owner

@xxDoman commented on GitHub (Nov 24, 2025):

Poradnik: AMD MI50 + RTX 4070 na Ubuntu 24.04 (Ollama Dual-GPU)
Wymagania sprzętowe:
Płyta główna: MSI PRO B760-P WIFI DDR4 (wymaga patchowania w GRUB).

GPU 1 (AI): AMD Radeon Instinct MI50 32GB.

GPU 2 (Display): NVIDIA GeForce RTX 4070.

KROK 1: Instalacja Systemu i Sterowników Wstępnych
Zainstaluj Ubuntu 24.04 LTS.

KLUCZOWE: Podczas instalacji zaznacz opcję:

"Zainstaluj oprogramowanie stron trzecich dla urządzeń graficznych i Wi-Fi" (Install third-party software for graphics and Wi-Fi hardware).

Dlaczego: To zainstaluje wstępne sterowniki, które potem podmienimy/wyłączymy, ale zapewni bazę dla systemu.

KROK 2: Konfiguracja GRUB (Obowiązkowa dla MI50)
Płyta B760 nie obsługuje poprawnie karty serwerowej MI50 bez wymuszenia parametrów jądra.

Otwórz terminal i edytuj plik GRUB:

Bash

sudo nano /etc/default/grub
Znajdź linię GRUB_CMDLINE_LINUX_DEFAULT i zamień ją na dokładnie taką:

Bash

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.ignore_crat=1 amdgpu.exp_hw_support=1 iommu=pt"
Zapisz (Ctrl+O, Enter) i wyjdź (Ctrl+X).

Zaktualizuj GRUB:

Bash

sudo update-grub
KROK 3: "Patent na Nvidię" (Przełączenie na X11/Nouveau)
Musimy "oślepić" system na Nvidię przed instalacją Ollamy, aby instalator wykrył tylko AMD. Nie odinstalowujemy sterowników, tylko przełączamy je na bezpieczne.

Otwórz aplikację Oprogramowanie i Aktualizacje (Software & Updates).

Przejdź do zakładki Dodatkowe sterowniki (Additional Drivers).

Znajdź na liście kartę NVIDIA.

Zaznacz ostatnią opcję:

Używanie X.Org X server -- Nouveau display driver (otwartoźródłowy)

Kliknij Zastosuj zmiany.

ZRESTARTUJ KOMPUTER.

Po restarcie karta NVIDIA zniknie z zasobów CUDA, a system będzie działał na podstawowym sterowniku graficznym.

KROK 4: Instalacja Ollama (Wersja Specjalna)
Instalujemy konkretną wersję 0.12.3, która zawiera kompatybilny stos bibliotek ROCm dla Twojej konfiguracji.

Wpisz w terminalu:

Bash

curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.12.3 sh
Oczekiwany wynik: Skrypt pobierze pakiet AMD, wykryje kartę i wyświetli komunikat "AMD GPU ready".

KROK 5: Konfiguracja Usługi (Izolacja GPU)
Aby Ollama zawsze używała MI50, nawet gdy przywrócimy Nvidię, musimy dodać konfigurację, którą przetestowałeś.

Edytuj usługę Ollama:

Bash

sudo systemctl edit ollama
Wklej poniższą sekcję (poniżej znaczników komentarzy):

Ini, TOML

[Service]

1. Wymuszamy silnik ROCm

Environment="OLLAMA_LLM_LIBRARY=rocm"

2. Wskazujemy KONKRETNIE kartę AMD (MI50 ma zazwyczaj ID 0 w trybie obliczeniowym)

Environment="HIP_VISIBLE_DEVICES=0"

3. Ukrywamy Nvidię dla Ollamy (CUDA OFF)

Environment="CUDA_VISIBLE_DEVICES=-1"
Zapisz i wyjdź (Ctrl+O, Enter, Ctrl+X).

Przeładuj i zrestartuj usługę:

Bash

sudo systemctl daemon-reload
sudo systemctl restart ollama
KROK 6: Przywrócenie NVIDIA (Dla Pulpitu/Gier)
Teraz, gdy Ollama jest "zabetonowana" na AMD, możemy przywrócić pełną wydajność graficzną RTX 4070.

Otwórz ponownie Oprogramowanie i Aktualizacje > Dodatkowe sterowniki.

Przy karcie NVIDIA wybierz najnowszy sterownik własnościowy (np. nvidia-driver-535 lub 550 - ten z dopiskiem (własnościowy)).

Kliknij Zastosuj zmiany.

ZRESTARTUJ KOMPUTER.

KROK 7: Weryfikacja Końcowa
Uruchom Mission Center (lub btop).

Uruchom model:

Bash

ollama run llama3
Obserwuj:

Pulpit działa płynnie na RTX 4070.

Model ładuje się do VRAM na AMD MI50 (obciążenie i pamięć skaczą na GPU AMD).

Gotowe. Masz hybrydowy system AI/Gaming.

Image Image Image Image
<!-- gh-comment-id:3572218061 --> @xxDoman commented on GitHub (Nov 24, 2025): Poradnik: AMD MI50 + RTX 4070 na Ubuntu 24.04 (Ollama Dual-GPU) Wymagania sprzętowe: Płyta główna: MSI PRO B760-P WIFI DDR4 (wymaga patchowania w GRUB). GPU 1 (AI): AMD Radeon Instinct MI50 32GB. GPU 2 (Display): NVIDIA GeForce RTX 4070. KROK 1: Instalacja Systemu i Sterowników Wstępnych Zainstaluj Ubuntu 24.04 LTS. KLUCZOWE: Podczas instalacji zaznacz opcję: "Zainstaluj oprogramowanie stron trzecich dla urządzeń graficznych i Wi-Fi" (Install third-party software for graphics and Wi-Fi hardware). Dlaczego: To zainstaluje wstępne sterowniki, które potem podmienimy/wyłączymy, ale zapewni bazę dla systemu. KROK 2: Konfiguracja GRUB (Obowiązkowa dla MI50) Płyta B760 nie obsługuje poprawnie karty serwerowej MI50 bez wymuszenia parametrów jądra. Otwórz terminal i edytuj plik GRUB: Bash sudo nano /etc/default/grub Znajdź linię GRUB_CMDLINE_LINUX_DEFAULT i zamień ją na dokładnie taką: Bash GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.ignore_crat=1 amdgpu.exp_hw_support=1 iommu=pt" Zapisz (Ctrl+O, Enter) i wyjdź (Ctrl+X). Zaktualizuj GRUB: Bash sudo update-grub KROK 3: "Patent na Nvidię" (Przełączenie na X11/Nouveau) Musimy "oślepić" system na Nvidię przed instalacją Ollamy, aby instalator wykrył tylko AMD. Nie odinstalowujemy sterowników, tylko przełączamy je na bezpieczne. Otwórz aplikację Oprogramowanie i Aktualizacje (Software & Updates). Przejdź do zakładki Dodatkowe sterowniki (Additional Drivers). Znajdź na liście kartę NVIDIA. Zaznacz ostatnią opcję: Używanie X.Org X server -- Nouveau display driver (otwartoźródłowy) Kliknij Zastosuj zmiany. ZRESTARTUJ KOMPUTER. Po restarcie karta NVIDIA zniknie z zasobów CUDA, a system będzie działał na podstawowym sterowniku graficznym. KROK 4: Instalacja Ollama (Wersja Specjalna) Instalujemy konkretną wersję 0.12.3, która zawiera kompatybilny stos bibliotek ROCm dla Twojej konfiguracji. Wpisz w terminalu: Bash curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.12.3 sh Oczekiwany wynik: Skrypt pobierze pakiet AMD, wykryje kartę i wyświetli komunikat "AMD GPU ready". KROK 5: Konfiguracja Usługi (Izolacja GPU) Aby Ollama zawsze używała MI50, nawet gdy przywrócimy Nvidię, musimy dodać konfigurację, którą przetestowałeś. Edytuj usługę Ollama: Bash sudo systemctl edit ollama Wklej poniższą sekcję (poniżej znaczników komentarzy): Ini, TOML [Service] # 1. Wymuszamy silnik ROCm Environment="OLLAMA_LLM_LIBRARY=rocm" # 2. Wskazujemy KONKRETNIE kartę AMD (MI50 ma zazwyczaj ID 0 w trybie obliczeniowym) Environment="HIP_VISIBLE_DEVICES=0" # 3. Ukrywamy Nvidię dla Ollamy (CUDA OFF) Environment="CUDA_VISIBLE_DEVICES=-1" Zapisz i wyjdź (Ctrl+O, Enter, Ctrl+X). Przeładuj i zrestartuj usługę: Bash sudo systemctl daemon-reload sudo systemctl restart ollama KROK 6: Przywrócenie NVIDIA (Dla Pulpitu/Gier) Teraz, gdy Ollama jest "zabetonowana" na AMD, możemy przywrócić pełną wydajność graficzną RTX 4070. Otwórz ponownie Oprogramowanie i Aktualizacje > Dodatkowe sterowniki. Przy karcie NVIDIA wybierz najnowszy sterownik własnościowy (np. nvidia-driver-535 lub 550 - ten z dopiskiem (własnościowy)). Kliknij Zastosuj zmiany. ZRESTARTUJ KOMPUTER. ✅ KROK 7: Weryfikacja Końcowa Uruchom Mission Center (lub btop). Uruchom model: Bash ollama run llama3 Obserwuj: Pulpit działa płynnie na RTX 4070. Model ładuje się do VRAM na AMD MI50 (obciążenie i pamięć skaczą na GPU AMD). Gotowe. Masz hybrydowy system AI/Gaming. <img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/9bdae0a0-6d4d-4b25-a430-fcfc52f50924" /> <img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/9170308f-f09d-4615-88f1-8bb8ac3ab569" /> <img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/7259b988-3316-443f-a5c8-f27902fbd5c1" /> <img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/280871e4-ca49-4315-b3a7-924e0aee4649" />
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63072