[GH-ISSUE #1813] How to run Ollama only on a dedicated GPU? (Instead of all GPUs) #63072

New Issue

GiteaMirror · 2026-05-03T11:39:09-05:00

GiteaMirror commented

2026-05-03 11:39:09 -05:00

Originally created by @sthufnagl on GitHub (Jan 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1813

Originally assigned to: @dhiltgen on GitHub.

Hi,

I have 3x3090 and I want to run Ollama Instance only on a dedicated GPU. The reason for this: To have 3xOllama Instances (with different ports) for using with Autogen.
I also tried the "Docker Ollama" without luck.
Or is there an other solution?

Let me know...

Thanks in advance

Steve

Originally created by @sthufnagl on GitHub (Jan 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1813 Originally assigned to: @dhiltgen on GitHub. Hi, I have 3x3090 and I want to run Ollama Instance only on a dedicated GPU. The reason for this: To have 3xOllama Instances (with different ports) for using with Autogen. I also tried the "Docker Ollama" without luck. Or is there an other solution? Let me know... Thanks in advance Steve

GiteaMirror added the gpu label 2026-05-03 11:39:09 -05:00

GiteaMirror closed this issue

2026-05-03 11:39:11 -05:00

GiteaMirror commented

2026-05-03 11:39:12 -05:00

@Tomatcree01 commented on GitHub (Jan 5, 2024):

You could give me the other two

@Tomatcree01 commented on GitHub (Jan 5, 2024): You could give me the other two

GiteaMirror commented

2026-05-03 11:39:13 -05:00

@sthufnagl commented on GitHub (Jan 6, 2024):

:-)

@sthufnagl commented on GitHub (Jan 6, 2024): :-)

GiteaMirror commented

2026-05-03 11:39:14 -05:00

@sthufnagl commented on GitHub (Jan 6, 2024):

Could it be that the numbers of GPUs used with Ollama is related to the model?
At the page https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md they mentioned a "num_gpu" parameter.
==> I have to create a new Model File from an existant Model? And include this parameter?
Still searching....

@sthufnagl commented on GitHub (Jan 6, 2024): Could it be that the numbers of GPUs used with Ollama is related to the model? At the page https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md they mentioned a "num_gpu" parameter. ==> I have to create a new Model File from an existant Model? And include this parameter? Still searching....

GiteaMirror commented

2026-05-03 11:39:14 -05:00

@tarbard commented on GitHub (Jan 6, 2024):

Could it be that the numbers of GPUs used with Ollama is related to the model? At the page https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md they mentioned a "num_gpu" parameter.

That's just the number of layers. I don't think there's a way to control GPU affinity but I would also like to do this. Another issue for me is it is automatically splitting a model between 2 GPUs even though it would fit on a single GPU (which would be faster) so I would like to just make it use the one with bigger VRAM.

@tarbard commented on GitHub (Jan 6, 2024): > Could it be that the numbers of GPUs used with Ollama is related to the model? At the page https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md they mentioned a "num_gpu" parameter. That's just the number of layers. I don't think there's a way to control GPU affinity but I would also like to do this. Another issue for me is it is automatically splitting a model between 2 GPUs even though it would fit on a single GPU (which would be faster) so I would like to just make it use the one with bigger VRAM.

GiteaMirror commented

2026-05-03 11:39:15 -05:00

@tarbard commented on GitHub (Jan 6, 2024):

I tried a bit of research - it seems the relevant llama options are

-mg i, --main-gpu i: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used. Requires cuBLAS.

-ts SPLIT, --tensor-split SPLIT: When using multiple GPUs this option controls how large tensors should be split across all GPUs. SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS.

Checking the https://github.com/jmorganca/ollama/blob/main/docs/api.md docs we should be able to pass in main_gpu to the API, so I tried with setting main_gpu to 1

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?",
  "stream": false,
  "options": {
    "num_keep": 5,
    "seed": 42,
    "num_predict": 100,
    "top_k": 20,
    "top_p": 0.9,
    "tfs_z": 0.5,
    "typical_p": 0.7,
    "repeat_last_n": 33,
    "temperature": 0.8,
    "repeat_penalty": 1.2,
    "presence_penalty": 1.5,
    "frequency_penalty": 1.0,
    "mirostat": 1,
    "mirostat_tau": 0.8,
    "mirostat_eta": 0.6,
    "penalize_newline": true,
    "stop": ["\n", "user:"],
    "numa": false,
    "num_ctx": 1024,
    "num_batch": 2,
    "num_gqa": 1,
    "main_gpu": 1,
    "low_vram": false,
    "f16_kv": true,
    "vocab_only": false,
    "use_mmap": true,
    "use_mlock": false,
    "embedding_only": false,
    "rope_frequency_base": 1.1,
    "rope_frequency_scale": 0.8,
    "num_thread": 8
  }
}'

This didn't seem to work as the same memory split took place rather than it using only the second GPU. Maybe the option is not yet passed onto llama from ollama. I had a look at the ollama code but i'm not familiar with Go so i'm not sure.

@tarbard commented on GitHub (Jan 6, 2024): I tried a bit of research - it seems the relevant llama options are ``` -mg i, --main-gpu i: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used. Requires cuBLAS. -ts SPLIT, --tensor-split SPLIT: When using multiple GPUs this option controls how large tensors should be split across all GPUs. SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS. ``` Checking the https://github.com/jmorganca/ollama/blob/main/docs/api.md docs we should be able to pass in main_gpu to the API, so I tried with setting main_gpu to 1 ``` curl http://localhost:11434/api/generate -d '{ "model": "llama2", "prompt": "Why is the sky blue?", "stream": false, "options": { "num_keep": 5, "seed": 42, "num_predict": 100, "top_k": 20, "top_p": 0.9, "tfs_z": 0.5, "typical_p": 0.7, "repeat_last_n": 33, "temperature": 0.8, "repeat_penalty": 1.2, "presence_penalty": 1.5, "frequency_penalty": 1.0, "mirostat": 1, "mirostat_tau": 0.8, "mirostat_eta": 0.6, "penalize_newline": true, "stop": ["\n", "user:"], "numa": false, "num_ctx": 1024, "num_batch": 2, "num_gqa": 1, "main_gpu": 1, "low_vram": false, "f16_kv": true, "vocab_only": false, "use_mmap": true, "use_mlock": false, "embedding_only": false, "rope_frequency_base": 1.1, "rope_frequency_scale": 0.8, "num_thread": 8 } }' ``` This didn't seem to work as the same memory split took place rather than it using only the second GPU. Maybe the option is not yet passed onto llama from ollama. I had a look at the ollama code but i'm not familiar with Go so i'm not sure.

GiteaMirror commented

2026-05-03 11:39:15 -05:00

@sthufnagl commented on GitHub (Jan 7, 2024):

Thx tarbard...I will check it.

@sthufnagl commented on GitHub (Jan 7, 2024): Thx tarbard...I will check it.

GiteaMirror commented

2026-05-03 11:39:16 -05:00

@houstonhaynes commented on GitHub (Jan 7, 2024):

If you're running in three separate containers via docker you can start up each container to only be "aware" of one GPU.

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html

docker run --gpus '"device=1,2"' \
    nvidia/cuda nvidia-smi --query-gpu=uuid --format=csv

@houstonhaynes commented on GitHub (Jan 7, 2024): If you're running in three separate containers via docker you can start up each container to only be "aware" of one GPU. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html ```bash docker run --gpus '"device=1,2"' \ nvidia/cuda nvidia-smi --query-gpu=uuid --format=csv ```

GiteaMirror commented

2026-05-03 11:39:16 -05:00

@sthufnagl commented on GitHub (Jan 8, 2024):

@houstonhaynes...I had the same Idea, but it doesn't work for me. Ollama, running inside Docker, takes all GPUs no matter how I use the the Docker Parameter "--gpu" (also tried the ID of a GPU).
:-(
Does it work for you?

My solution now is to splt/distribute the 3090 to different PCs. To my surprise, even with very old PC Hardware, Ollama runs fast!
Also the uploading of a Model to VRAM is nearly the same.

@sthufnagl commented on GitHub (Jan 8, 2024): @houstonhaynes...I had the same Idea, but it doesn't work for me. Ollama, running inside Docker, takes all GPUs no matter how I use the the Docker Parameter "--gpu" (also tried the ID of a GPU). :-( Does it work for you? My solution now is to splt/distribute the 3090 to different PCs. To my surprise, even with very old PC Hardware, Ollama runs fast! Also the uploading of a Model to VRAM is nearly the same.

GiteaMirror commented

2026-05-03 11:39:17 -05:00

@houstonhaynes commented on GitHub (Jan 8, 2024):

That is wild - I guess I "trust the manual" too much! I have two machines with an RTX3050 on each and haven't moved one over to have two on one machine. I was just doing some spelunking for GPU driven inference with postgresml and spotted that "deep" info from NVidia along the way. I thought it would be useful when I upgrade. I'm sorry it's not more helpful but maybe the controls "under the hood" suggested above will give you the right lever(s). I'd love to know how that turns out in case it comes calling after I put a bunch of cards in a GPU chassis! 😸

@houstonhaynes commented on GitHub (Jan 8, 2024): That is wild - I guess I "trust the manual" too much! I have two machines with an RTX3050 on each and haven't moved one over to have two on one machine. I was just doing some spelunking for GPU driven inference with postgresml and spotted that "deep" info from NVidia along the way. I thought it would be useful when I upgrade. I'm sorry it's not more helpful but maybe the controls "under the hood" suggested above will give you the right lever(s). I'd love to know how that turns out in case it comes calling after I put a bunch of cards in a GPU chassis! 😸

GiteaMirror commented

2026-05-03 11:39:18 -05:00

@null-dev commented on GitHub (Jan 11, 2024):

BTW you can use CUDA_VISIBLE_DEVICES for this, see: https://stackoverflow.com/questions/39649102/how-do-i-select-which-gpu-to-run-a-job-on

Unfortunately, the name of the environment variable is kinda a lie. It appears the other GPUs are still visible, just not accessible, so when ollama calculates the compute capability level of the GPUs, it will take into account the other GPUs. This is bad, because if you have GPU 0 with compute capability X, and GPU 1 with compute capability Y and you set CUDA_VISIBLE_DEVICES=0, ollama will detect the compute capability as min(X, Y) when instead compute capability X is the best value. EDIT: Nevermind, this isn't a problem because it looks like Ollama doesn't actually do anything with the detected compute capability information, it's just used to validate whether or not to use GPUs at all.

@null-dev commented on GitHub (Jan 11, 2024): BTW you can use `CUDA_VISIBLE_DEVICES` for this, see: https://stackoverflow.com/questions/39649102/how-do-i-select-which-gpu-to-run-a-job-on Unfortunately, the name of the environment variable is kinda a lie. It appears the other GPUs are still visible, just not accessible, so when `ollama` calculates the compute capability level of the GPUs, it will take into account the other GPUs. ~~This is bad, because if you have GPU 0 with compute capability X, and GPU 1 with compute capability Y and you set `CUDA_VISIBLE_DEVICES=0`, ollama will detect the compute capability as `min(X, Y)` when instead compute capability `X` is the best value.~~ **EDIT:** Nevermind, this isn't a problem because it looks like Ollama doesn't actually do anything with the detected compute capability information, it's just used to validate whether or not to use GPUs at all.

GiteaMirror commented

2026-05-03 11:39:18 -05:00

@cgint commented on GitHub (Jan 21, 2024):

Same challenge here.

CUDA_VISIBLE_DEVICES somehow does not work for me as a switch between models that fit onto one GPU and others that need 2. I could though spin up two instances of ollama on two ports where one has CUDA_VISIBLE_DEVICES set to only 'see' one device and the second instance has access to both. Then I would have to decide myself depending on the model which instance to connect to.

Would really be awesome if either ...

there was a config option for OLLAMA that changes behaviour in a way that is does not try to balance the used VRAM over all available GPUs but e.g. only use one GPU if this already has enough VRAM to hold model + context.
there was an option to specify this on inference-calls. main_gpu mentioned by @tarbard sounds like that.

Will check out if main_gpu works on my system.

Damn!
Not working with Ollama in Python - although the option is handed over to the HTTP-Request to Ollama-Endpoint. 🤷

What i do get since activating {'main_gpu': 1} though ... is a log output when a model is loaded saying
ollama[1733]: ggml_cuda_set_main_device: using device 1 (NVIDIA GeForce RTX 4060 Ti) as main device.
But the model is still distributed across my 2 GPUs although it would fit onto one.

With my current solution i spin up another instance of ollama with the following command ...

CUDA_VISIBLE_DEVICES=0 OLLAMA_HOST=0.0.0.0:22222 ollama serve

... and whenever I know a model fits on one GPU i connect to this port on my local machine.

Thx for the CUDA_VISIBLE_DEVICES @null-dev

@cgint commented on GitHub (Jan 21, 2024): Same challenge here. `CUDA_VISIBLE_DEVICES` somehow does not work for me as a switch between models that fit onto one GPU and others that need 2. I could though spin up two instances of `ollama` on two ports where one has `CUDA_VISIBLE_DEVICES` set to only 'see' one device and the second instance has access to both. Then I would have to decide myself depending on the model which instance to connect to. Would really be awesome if either ... - there was a config option for OLLAMA that changes behaviour in a way that is does not try to balance the used VRAM over all available GPUs but e.g. only use one GPU if this already has enough VRAM to hold model + context. - there was an option to specify this on inference-calls. `main_gpu` mentioned by @tarbard sounds like that. Will check out if `main_gpu` works on my system. Damn! Not working with Ollama in Python - although the option is handed over to the HTTP-Request to Ollama-Endpoint. :shrug: What i do get since activating {'main_gpu': 1} though ... is a log output when a model is loaded saying `ollama[1733]: ggml_cuda_set_main_device: using device 1 (NVIDIA GeForce RTX 4060 Ti) as main device`. But the model is still distributed across my 2 GPUs although it would fit onto one. With my current solution i spin up another instance of `ollama` with the following command ... ``` CUDA_VISIBLE_DEVICES=0 OLLAMA_HOST=0.0.0.0:22222 ollama serve ``` ... and whenever I know a model fits on one GPU i connect to this port on my local machine. Thx for the `CUDA_VISIBLE_DEVICES` @null-dev

GiteaMirror commented

2026-05-03 11:39:19 -05:00

@matbeedotcom commented on GitHub (Jan 27, 2024):

-damn, I was not hoping for this outcome. Has anyone figured out how to restrict it to just one?- nvm, using CUDA_VISIBLE_DEVICES seemed to have done the trick

@matbeedotcom commented on GitHub (Jan 27, 2024): -damn, I was not hoping for this outcome. Has anyone figured out how to restrict it to just one?- nvm, using CUDA_VISIBLE_DEVICES seemed to have done the trick

GiteaMirror commented

2026-05-03 11:39:20 -05:00

@Koesn commented on GitHub (Feb 25, 2024):

Why this still unsupported? I'm running LM Studio to dedicate a GPU using tensor split 0,35 so I can fully offload Mistral 32k context to a 3060. I hope there's a tensor split on Ollama modelfile.

@Koesn commented on GitHub (Feb 25, 2024): Why this still unsupported? I'm running LM Studio to dedicate a GPU using tensor split 0,35 so I can fully offload Mistral 32k context to a 3060. I hope there's a tensor split on Ollama modelfile.

GiteaMirror commented

2026-05-03 11:39:20 -05:00

@dhiltgen commented on GitHub (Mar 12, 2024):

CUDA_VISIBLE_DEVICES should work. We do have a defect related to memory prediction calculations in this case tracked via #1514

If you're seeing it load onto unexpected GPUs when this variable is set, please share the server log and some more details about the setup and I'll re-open.

@dhiltgen commented on GitHub (Mar 12, 2024): CUDA_VISIBLE_DEVICES should work. We do have a defect related to memory prediction calculations in this case tracked via #1514 If you're seeing it load onto unexpected GPUs when this variable is set, please share the server log and some more details about the setup and I'll re-open.

GiteaMirror commented

2026-05-03 11:39:21 -05:00

@jeremytregunna commented on GitHub (Mar 14, 2024):

CUDA_VISIBLE_DEVICES should work. We do have a defect related to memory prediction calculations in this case tracked via #1514

If you're seeing it load onto unexpected GPUs when this variable is set, please share the server log and some more details about the setup and I'll re-open.

As you can see in the above image, I have 3 GPUs. 2x RTX A6000 and 1x 3070. I use the A6000s for bigger models through Ollama, and the smaller GPU I want to reserve for embedding models. However, when I start the server using the systemd config below:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/home/ubuntu/.local/bin:/home/ubuntu/miniconda3/bin:/home/ubuntu/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
Environment="CUDA_VISIBLE_DEVICES=0,2"
Environment="OLLAMA_HOST=0.0.0.0:11434"

[Install]
WantedBy=default.target

Restart Ollama, and use say dolphin-mixtral:8x7b-v2.7-q8_0 (a model that will occupy more GPU memory than i have on any one GPU), it distributes it over device 0 and 1 instead of 0 and 2. I can wholly confirm I did a systemctl daemon-reload, then a systemctl restart ollama before then sending a message to the dolphin-mixtral model and watching nvtop.

So it doesn't seem as though CUDA_VISIBLE_DEVICES is working as intended. For completeness here's the output of nvidia-smi:

Thu Mar 14 22:51:19 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:81:00.0 Off |                    0 |
| 30%   57C    P8              22W / 300W |  43657MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3070        Off | 00000000:C1:00.0 Off |                  N/A |
|  0%   47C    P8              22W / 270W |   5246MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000               Off | 00000000:C2:00.0 Off |                  Off |
| 31%   60C    P8              28W / 300W |      1MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2873      C   /usr/local/bin/ollama                     43650MiB |
|    1   N/A  N/A      2873      C   /usr/local/bin/ollama                      5240MiB |
+---------------------------------------------------------------------------------------+

Any help would be appreciated. @dhiltgen

@jeremytregunna commented on GitHub (Mar 14, 2024): > CUDA_VISIBLE_DEVICES should work. We do have a defect related to memory prediction calculations in this case tracked via #1514 > > If you're seeing it load onto unexpected GPUs when this variable is set, please share the server log and some more details about the setup and I'll re-open. ![image](https://github.com/ollama/ollama/assets/261615/de90fb66-c672-4d05-9eb4-22895da3137a) As you can see in the above image, I have 3 GPUs. 2x RTX A6000 and 1x 3070. I use the A6000s for bigger models through Ollama, and the smaller GPU I want to reserve for embedding models. However, when I start the server using the systemd config below: ``` [Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=/usr/local/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=/home/ubuntu/.local/bin:/home/ubuntu/miniconda3/bin:/home/ubuntu/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" Environment="CUDA_VISIBLE_DEVICES=0,2" Environment="OLLAMA_HOST=0.0.0.0:11434" [Install] WantedBy=default.target ``` Restart Ollama, and use say dolphin-mixtral:8x7b-v2.7-q8_0 (a model that will occupy more GPU memory than i have on any one GPU), it distributes it over device 0 and 1 instead of 0 and 2. I can wholly confirm I did a `systemctl daemon-reload`, then a `systemctl restart ollama` before then sending a message to the dolphin-mixtral model and watching nvtop. So it doesn't seem as though CUDA_VISIBLE_DEVICES is working as intended. For completeness here's the output of nvidia-smi: ``` Thu Mar 14 22:51:19 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA RTX A6000 Off | 00000000:81:00.0 Off | 0 | | 30% 57C P8 22W / 300W | 43657MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3070 Off | 00000000:C1:00.0 Off | N/A | | 0% 47C P8 22W / 270W | 5246MiB / 8192MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA RTX A6000 Off | 00000000:C2:00.0 Off | Off | | 31% 60C P8 28W / 300W | 1MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2873 C /usr/local/bin/ollama 43650MiB | | 1 N/A N/A 2873 C /usr/local/bin/ollama 5240MiB | +---------------------------------------------------------------------------------------+ ``` Any help would be appreciated. @dhiltgen

GiteaMirror commented

2026-05-03 11:39:21 -05:00

@dhiltgen commented on GitHub (Mar 15, 2024):

@jeremytregunna it sounds like there might be an ordering/enumeration bug where we're not consistent with other tools. If I had to guess, I'd speculate this is some tools/libraries using PCI bus/slot, and others sorting by capability/performance.

Can you enable OLLAMA_DEBUG=1 and start up the server?

Also try CUDA_VISIBLE_DEVICES=0,1 and from what you describe, that sounds like it might get the GPU assignment right.

@dhiltgen commented on GitHub (Mar 15, 2024): @jeremytregunna it sounds like there might be an ordering/enumeration bug where we're not consistent with other tools. If I had to guess, I'd speculate this is some tools/libraries using PCI bus/slot, and others sorting by capability/performance. Can you enable OLLAMA_DEBUG=1 and start up the server? Also try `CUDA_VISIBLE_DEVICES=0,1` and from what you describe, that sounds like it might get the GPU assignment right.

GiteaMirror commented

2026-05-03 11:39:22 -05:00

@jeremytregunna commented on GitHub (Mar 16, 2024):

@jeremytregunna it sounds like there might be an ordering/enumeration bug where we're not consistent with other tools. If I had to guess, I'd speculate this is some tools/libraries using PCI bus/slot, and others sorting by capability/performance.

Can you enable OLLAMA_DEBUG=1 and start up the server?

Also try CUDA_VISIBLE_DEVICES=0,1 and from what you describe, that sounds like it might get the GPU assignment right.

Hrmm... I've run it with debug logs on a few times, and the ordering never seems to change, it always reports the output below:

CUDA driver version: 535.161.07
time=2024-03-15T23:25:09.751Z level=INFO source=gpu.go:82 msg="Nvidia GPU detected"
time=2024-03-15T23:25:09.751Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
[0] CUDA device name: NVIDIA RTX A6000
[0] CUDA part number: 900-5G133-0300-000
[0] CUDA S/N: 1651922013945
[0] CUDA vbios version: 94.02.5C.00.06
[0] CUDA brand: 13
[0] CUDA totalMem 48305799168
[0] CUDA usedMem 467599360
[1] CUDA device name: NVIDIA GeForce RTX 3070
[1] CUDA part number: 
nvmlDeviceGetSerial failed: 3
[1] CUDA vbios version: 94.04.67.00.3E
[1] CUDA brand: 5
[1] CUDA totalMem 8589934592
[1] CUDA usedMem 230031360
[2] CUDA device name: NVIDIA RTX A6000
[2] CUDA part number: 900-5G133-1700-000
[2] CUDA S/N: 1320722000285
[2] CUDA vbios version: 94.02.5C.00.02
[2] CUDA brand: 13
[2] CUDA totalMem 51527024640
[2] CUDA usedMem 486866944
time=2024-03-15T23:25:09.769Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-15T23:25:09.769Z level=DEBUG source=gpu.go:180 msg="cuda detected 3 devices with 92043M available memory"

I verified they're the same devices by looking at the serial number. I also tried what you said with using `CUDA_VISIBLE_DEVICES=0,1" and 1,2 with no luck

The whole log is preserved below, note this is with 0,2 but as I previously mentioned, that made no difference:

Mar 15 23:35:20 calgary systemd[1]: Stopping Ollama Service...
Mar 15 23:35:20 calgary systemd[1]: ollama.service: Deactivated successfully.
Mar 15 23:35:20 calgary systemd[1]: Stopped Ollama Service.
Mar 15 23:35:20 calgary systemd[1]: ollama.service: Consumed 5.777s CPU time.
Mar 15 23:35:20 calgary systemd[1]: Started Ollama Service.
Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.657Z level=INFO source=images.go:806 msg="total blobs: 48"
Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.657Z level=INFO source=images.go:813 msg="total unused blobs removed: 0"
Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.658Z level=INFO source=routes.go:1110 msg="Listening on [::]:11434 (version 0.1.29)"
Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.658Z level=INFO source=payload_common.go:112 msg="Extracting dynamic libraries to /tmp/ollama4171821284/runners ..."
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.312Z level=INFO source=payload_common.go:139 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60000 cpu cpu_avx]"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.312Z level=INFO source=gpu.go:77 msg="Detecting GPU type"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.312Z level=INFO source=gpu.go:191 msg="Searching for GPU management library libnvidia-ml.so"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.317Z level=INFO source=gpu.go:237 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.161.07]"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.334Z level=INFO source=gpu.go:82 msg="Nvidia GPU detected"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.334Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.352Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.959Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama4171821284/runners/cuda_v11/libext_server.so"
Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.959Z level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
Mar 15 23:36:36 calgary ollama[5122]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
Mar 15 23:36:36 calgary ollama[5122]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
Mar 15 23:36:36 calgary ollama[5122]: ggml_init_cublas: found 2 CUDA devices:
Mar 15 23:36:36 calgary ollama[5122]:   Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
Mar 15 23:36:36 calgary ollama[5122]:   Device 1: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: loaded meta data with 24 key-value pairs and 995 tensors from /usr/share/ollama/.ollama/models/blobs/sha256:a03abff90c35c22bb4e10be3fcb0b974525e50c5e65ce1b4db59781fc413dc2e (version GGUF V3 (latest))
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   1:                               general.name str              = cognitivecomputations
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   4:                          llama.block_count u32              = 32
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv   9:                         llama.expert_count u32              = 8
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  10:                    llama.expert_used_count u32              = 2
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 1000000.000000
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  13:                          general.file_type u32              = 7
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 32000
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv  23:               general.quantization_version u32              = 2
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - type  f32:   65 tensors
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - type  f16:   32 tensors
Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - type q8_0:  898 tensors
Mar 15 23:36:37 calgary ollama[5122]: llm_load_vocab: special tokens definition check successful ( 261/32002 ).
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: format           = GGUF V3 (latest)
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: arch             = llama
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: vocab type       = SPM
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_vocab          = 32002
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_merges         = 0
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_ctx_train      = 32768
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd           = 4096
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_head           = 32
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_head_kv        = 8
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_layer          = 32
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_rot            = 128
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_head_k    = 128
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_head_v    = 128
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_gqa            = 4
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_k_gqa     = 1024
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_v_gqa     = 1024
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_ff             = 14336
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_expert         = 8
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_expert_used    = 2
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: pooling type     = 0
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: rope type        = 0
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: rope scaling     = linear
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: freq_base_train  = 1000000.0
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: freq_scale_train = 1
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_yarn_orig_ctx  = 32768
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: rope_finetuned   = unknown
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model type       = 7B
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model ftype      = Q8_0
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model params     = 46.70 B
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model size       = 46.22 GiB (8.50 BPW)
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: general.name     = cognitivecomputations
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: BOS token        = 1 '<s>'
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: UNK token        = 0 '<unk>'
Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: LF token         = 13 '<0x0A>'
Mar 15 23:36:37 calgary ollama[5122]: llm_load_tensors: ggml ctx size =    1.14 MiB
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: offloading 32 repeating layers to GPU
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: offloading non-repeating layers to GPU
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: offloaded 33/33 layers to GPU
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors:        CPU buffer size =   132.82 MiB
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors:      CUDA0 buffer size = 42647.22 MiB
Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors:      CUDA1 buffer size =  4544.62 MiB
Mar 15 23:36:48 calgary ollama[5122]: ....................................................................................................
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: n_ctx      = 2048
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: freq_base  = 1000000.0
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: freq_scale = 1
Mar 15 23:36:48 calgary ollama[5122]: llama_kv_cache_init:      CUDA0 KV buffer size =   232.00 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_kv_cache_init:      CUDA1 KV buffer size =    24.00 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model:  CUDA_Host input buffer size   =    13.02 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model:      CUDA0 compute buffer size =   184.03 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model:      CUDA1 compute buffer size =   192.01 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model:  CUDA_Host compute buffer size =     8.00 MiB
Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: graph splits (measure): 3
Mar 15 23:36:48 calgary ollama[5122]: loading library /tmp/ollama4171821284/runners/cuda_v11/libext_server.so
Mar 15 23:36:48 calgary ollama[5122]: {"function":"initialize","level":"INFO","line":440,"msg":"initializing slots","n_slots":1,"tid":"137734259725888","timestamp":1710545808}
Mar 15 23:36:48 calgary ollama[5122]: {"function":"initialize","level":"INFO","line":449,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"137734259725888","timestamp":1710545808}
Mar 15 23:36:48 calgary ollama[5122]: time=2024-03-15T23:36:48.328Z level=INFO source=dyn_ext_server.go:162 msg="Starting llama main loop"
Mar 15 23:36:48 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1590,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"137729434187328","timestamp":1710545808}
Mar 15 23:36:48 calgary ollama[5122]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545808}
Mar 15 23:36:48 calgary ollama[5122]: {"function":"update_slots","ga_i":0,"level":"INFO","line":1821,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":111,"slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545808}
Mar 15 23:36:48 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1848,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545808}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =     781.86 ms /   111 tokens (    7.04 ms per token,   141.97 tokens per second)","n_prompt_tokens_processed":111,"n_tokens_second":141.96842417607155,"slot_id":0,"t_prompt_processing":781.864,"t_token":7.04381981981982,"task_id":0,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =   10352.39 ms /   327 runs   (   31.66 ms per token,    31.59 tokens per second)","n_decoded":327,"n_tokens_second":31.586915019027494,"slot_id":0,"t_token":31.65867889908257,"t_token_generation":10352.388,"task_id":0,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":281,"msg":"          total time =   11134.25 ms","slot_id":0,"t_prompt_processing":781.864,"t_token_generation":10352.388,"t_total":11134.252,"task_id":0,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1652,"msg":"slot released","n_cache_tokens":438,"n_ctx":2048,"n_past":437,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545819,"truncated":false}
Mar 15 23:36:59 calgary ollama[5122]: [GIN] 2024/03/15 - 23:36:59 | 200 | 23.883120028s |      10.7.14.22 | POST     "/api/chat"
Mar 15 23:36:59 calgary ollama[5122]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"update_slots","ga_i":0,"level":"INFO","line":1821,"msg":"slot progression","n_past":21,"n_past_se":0,"n_prompt_tokens_processed":131,"slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:36:59 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1848,"msg":"kv cache rm [p0, end)","p0":21,"slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545819}
Mar 15 23:37:00 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =     836.71 ms /   131 tokens (    6.39 ms per token,   156.57 tokens per second)","n_prompt_tokens_processed":131,"n_tokens_second":156.56578332490747,"slot_id":0,"t_prompt_processing":836.709,"t_token":6.387091603053435,"task_id":330,"tid":"137729434187328","timestamp":1710545820}
Mar 15 23:37:00 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =     190.45 ms /     7 runs   (   27.21 ms per token,    36.75 tokens per second)","n_decoded":7,"n_tokens_second":36.75486083034482,"slot_id":0,"t_token":27.207285714285714,"t_token_generation":190.451,"task_id":330,"tid":"137729434187328","timestamp":1710545820}
Mar 15 23:37:00 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":281,"msg":"          total time =    1027.16 ms","slot_id":0,"t_prompt_processing":836.709,"t_token_generation":190.451,"t_total":1027.1599999999999,"task_id":330,"tid":"137729434187328","timestamp":1710545820}
Mar 15 23:37:00 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1652,"msg":"slot released","n_cache_tokens":159,"n_ctx":2048,"n_past":158,"n_system_tokens":0,"slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545820,"truncated":false}
Mar 15 23:37:00 calgary ollama[5122]: [GIN] 2024/03/15 - 23:37:00 | 200 |   1.02968349s |      10.7.14.22 | POST     "/api/generate"

@jeremytregunna commented on GitHub (Mar 16, 2024): > @jeremytregunna it sounds like there might be an ordering/enumeration bug where we're not consistent with other tools. If I had to guess, I'd speculate this is some tools/libraries using PCI bus/slot, and others sorting by capability/performance. > > Can you enable OLLAMA_DEBUG=1 and start up the server? > > Also try `CUDA_VISIBLE_DEVICES=0,1` and from what you describe, that sounds like it might get the GPU assignment right. Hrmm... I've run it with debug logs on a few times, and the ordering never seems to change, it always reports the output below: ``` CUDA driver version: 535.161.07 time=2024-03-15T23:25:09.751Z level=INFO source=gpu.go:82 msg="Nvidia GPU detected" time=2024-03-15T23:25:09.751Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" [0] CUDA device name: NVIDIA RTX A6000 [0] CUDA part number: 900-5G133-0300-000 [0] CUDA S/N: 1651922013945 [0] CUDA vbios version: 94.02.5C.00.06 [0] CUDA brand: 13 [0] CUDA totalMem 48305799168 [0] CUDA usedMem 467599360 [1] CUDA device name: NVIDIA GeForce RTX 3070 [1] CUDA part number: nvmlDeviceGetSerial failed: 3 [1] CUDA vbios version: 94.04.67.00.3E [1] CUDA brand: 5 [1] CUDA totalMem 8589934592 [1] CUDA usedMem 230031360 [2] CUDA device name: NVIDIA RTX A6000 [2] CUDA part number: 900-5G133-1700-000 [2] CUDA S/N: 1320722000285 [2] CUDA vbios version: 94.02.5C.00.02 [2] CUDA brand: 13 [2] CUDA totalMem 51527024640 [2] CUDA usedMem 486866944 time=2024-03-15T23:25:09.769Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6" time=2024-03-15T23:25:09.769Z level=DEBUG source=gpu.go:180 msg="cuda detected 3 devices with 92043M available memory" ``` I verified they're the same devices by looking at the serial number. I also tried what you said with using `CUDA_VISIBLE_DEVICES=0,1" and 1,2 with no luck The whole log is preserved below, note this is with `0,2` but as I previously mentioned, that made no difference: ``` Mar 15 23:35:20 calgary systemd[1]: Stopping Ollama Service... Mar 15 23:35:20 calgary systemd[1]: ollama.service: Deactivated successfully. Mar 15 23:35:20 calgary systemd[1]: Stopped Ollama Service. Mar 15 23:35:20 calgary systemd[1]: ollama.service: Consumed 5.777s CPU time. Mar 15 23:35:20 calgary systemd[1]: Started Ollama Service. Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.657Z level=INFO source=images.go:806 msg="total blobs: 48" Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.657Z level=INFO source=images.go:813 msg="total unused blobs removed: 0" Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.658Z level=INFO source=routes.go:1110 msg="Listening on [::]:11434 (version 0.1.29)" Mar 15 23:35:20 calgary ollama[5122]: time=2024-03-15T23:35:20.658Z level=INFO source=payload_common.go:112 msg="Extracting dynamic libraries to /tmp/ollama4171821284/runners ..." Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.312Z level=INFO source=payload_common.go:139 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60000 cpu cpu_avx]" Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.312Z level=INFO source=gpu.go:77 msg="Detecting GPU type" Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.312Z level=INFO source=gpu.go:191 msg="Searching for GPU management library libnvidia-ml.so" Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.317Z level=INFO source=gpu.go:237 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.161.07]" Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.334Z level=INFO source=gpu.go:82 msg="Nvidia GPU detected" Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.334Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Mar 15 23:35:24 calgary ollama[5122]: time=2024-03-15T23:35:24.352Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6" Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6" Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=gpu.go:119 msg="CUDA Compute Capability detected: 8.6" Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.942Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.959Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama4171821284/runners/cuda_v11/libext_server.so" Mar 15 23:36:35 calgary ollama[5122]: time=2024-03-15T23:36:35.959Z level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" Mar 15 23:36:36 calgary ollama[5122]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes Mar 15 23:36:36 calgary ollama[5122]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: no Mar 15 23:36:36 calgary ollama[5122]: ggml_init_cublas: found 2 CUDA devices: Mar 15 23:36:36 calgary ollama[5122]: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes Mar 15 23:36:36 calgary ollama[5122]: Device 1: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: loaded meta data with 24 key-value pairs and 995 tensors from /usr/share/ollama/.ollama/models/blobs/sha256:a03abff90c35c22bb4e10be3fcb0b974525e50c5e65ce1b4db59781fc413dc2e (version GGUF V3 (latest)) Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 0: general.architecture str = llama Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 1: general.name str = cognitivecomputations Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 2: llama.context_length u32 = 32768 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 4: llama.block_count u32 = 32 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 9: llama.expert_count u32 = 8 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 10: llama.expert_used_count u32 = 2 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 12: llama.rope.freq_base f32 = 1000000.000000 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 13: general.file_type u32 = 7 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 14: tokenizer.ggml.model str = llama Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<... Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000... Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 32000 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 22: tokenizer.chat_template str = {% if not add_generation_prompt is de... Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - kv 23: general.quantization_version u32 = 2 Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - type f32: 65 tensors Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - type f16: 32 tensors Mar 15 23:36:37 calgary ollama[5122]: llama_model_loader: - type q8_0: 898 tensors Mar 15 23:36:37 calgary ollama[5122]: llm_load_vocab: special tokens definition check successful ( 261/32002 ). Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: format = GGUF V3 (latest) Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: arch = llama Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: vocab type = SPM Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_vocab = 32002 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_merges = 0 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_ctx_train = 32768 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd = 4096 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_head = 32 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_head_kv = 8 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_layer = 32 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_rot = 128 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_head_k = 128 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_head_v = 128 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_gqa = 4 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_k_gqa = 1024 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_embd_v_gqa = 1024 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_norm_eps = 0.0e+00 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_ff = 14336 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_expert = 8 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_expert_used = 2 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: pooling type = 0 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: rope type = 0 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: rope scaling = linear Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: freq_base_train = 1000000.0 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: freq_scale_train = 1 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: n_yarn_orig_ctx = 32768 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: rope_finetuned = unknown Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model type = 7B Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model ftype = Q8_0 Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model params = 46.70 B Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: model size = 46.22 GiB (8.50 BPW) Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: general.name = cognitivecomputations Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: BOS token = 1 '<s>' Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: EOS token = 32000 '<|im_end|>' Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: UNK token = 0 '<unk>' Mar 15 23:36:37 calgary ollama[5122]: llm_load_print_meta: LF token = 13 '<0x0A>' Mar 15 23:36:37 calgary ollama[5122]: llm_load_tensors: ggml ctx size = 1.14 MiB Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: offloading 32 repeating layers to GPU Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: offloading non-repeating layers to GPU Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: offloaded 33/33 layers to GPU Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: CPU buffer size = 132.82 MiB Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: CUDA0 buffer size = 42647.22 MiB Mar 15 23:36:42 calgary ollama[5122]: llm_load_tensors: CUDA1 buffer size = 4544.62 MiB Mar 15 23:36:48 calgary ollama[5122]: .................................................................................................... Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: n_ctx = 2048 Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: freq_base = 1000000.0 Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: freq_scale = 1 Mar 15 23:36:48 calgary ollama[5122]: llama_kv_cache_init: CUDA0 KV buffer size = 232.00 MiB Mar 15 23:36:48 calgary ollama[5122]: llama_kv_cache_init: CUDA1 KV buffer size = 24.00 MiB Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: CUDA0 compute buffer size = 184.03 MiB Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: CUDA1 compute buffer size = 192.01 MiB Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB Mar 15 23:36:48 calgary ollama[5122]: llama_new_context_with_model: graph splits (measure): 3 Mar 15 23:36:48 calgary ollama[5122]: loading library /tmp/ollama4171821284/runners/cuda_v11/libext_server.so Mar 15 23:36:48 calgary ollama[5122]: {"function":"initialize","level":"INFO","line":440,"msg":"initializing slots","n_slots":1,"tid":"137734259725888","timestamp":1710545808} Mar 15 23:36:48 calgary ollama[5122]: {"function":"initialize","level":"INFO","line":449,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"137734259725888","timestamp":1710545808} Mar 15 23:36:48 calgary ollama[5122]: time=2024-03-15T23:36:48.328Z level=INFO source=dyn_ext_server.go:162 msg="Starting llama main loop" Mar 15 23:36:48 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1590,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"137729434187328","timestamp":1710545808} Mar 15 23:36:48 calgary ollama[5122]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545808} Mar 15 23:36:48 calgary ollama[5122]: {"function":"update_slots","ga_i":0,"level":"INFO","line":1821,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":111,"slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545808} Mar 15 23:36:48 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1848,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545808} Mar 15 23:36:59 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 781.86 ms / 111 tokens ( 7.04 ms per token, 141.97 tokens per second)","n_prompt_tokens_processed":111,"n_tokens_second":141.96842417607155,"slot_id":0,"t_prompt_processing":781.864,"t_token":7.04381981981982,"task_id":0,"tid":"137729434187328","timestamp":1710545819} Mar 15 23:36:59 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 10352.39 ms / 327 runs ( 31.66 ms per token, 31.59 tokens per second)","n_decoded":327,"n_tokens_second":31.586915019027494,"slot_id":0,"t_token":31.65867889908257,"t_token_generation":10352.388,"task_id":0,"tid":"137729434187328","timestamp":1710545819} Mar 15 23:36:59 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":281,"msg":" total time = 11134.25 ms","slot_id":0,"t_prompt_processing":781.864,"t_token_generation":10352.388,"t_total":11134.252,"task_id":0,"tid":"137729434187328","timestamp":1710545819} Mar 15 23:36:59 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1652,"msg":"slot released","n_cache_tokens":438,"n_ctx":2048,"n_past":437,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"137729434187328","timestamp":1710545819,"truncated":false} Mar 15 23:36:59 calgary ollama[5122]: [GIN] 2024/03/15 - 23:36:59 | 200 | 23.883120028s | 10.7.14.22 | POST "/api/chat" Mar 15 23:36:59 calgary ollama[5122]: {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545819} Mar 15 23:36:59 calgary ollama[5122]: {"function":"update_slots","ga_i":0,"level":"INFO","line":1821,"msg":"slot progression","n_past":21,"n_past_se":0,"n_prompt_tokens_processed":131,"slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545819} Mar 15 23:36:59 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1848,"msg":"kv cache rm [p0, end)","p0":21,"slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545819} Mar 15 23:37:00 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 836.71 ms / 131 tokens ( 6.39 ms per token, 156.57 tokens per second)","n_prompt_tokens_processed":131,"n_tokens_second":156.56578332490747,"slot_id":0,"t_prompt_processing":836.709,"t_token":6.387091603053435,"task_id":330,"tid":"137729434187328","timestamp":1710545820} Mar 15 23:37:00 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 190.45 ms / 7 runs ( 27.21 ms per token, 36.75 tokens per second)","n_decoded":7,"n_tokens_second":36.75486083034482,"slot_id":0,"t_token":27.207285714285714,"t_token_generation":190.451,"task_id":330,"tid":"137729434187328","timestamp":1710545820} Mar 15 23:37:00 calgary ollama[5122]: {"function":"print_timings","level":"INFO","line":281,"msg":" total time = 1027.16 ms","slot_id":0,"t_prompt_processing":836.709,"t_token_generation":190.451,"t_total":1027.1599999999999,"task_id":330,"tid":"137729434187328","timestamp":1710545820} Mar 15 23:37:00 calgary ollama[5122]: {"function":"update_slots","level":"INFO","line":1652,"msg":"slot released","n_cache_tokens":159,"n_ctx":2048,"n_past":158,"n_system_tokens":0,"slot_id":0,"task_id":330,"tid":"137729434187328","timestamp":1710545820,"truncated":false} Mar 15 23:37:00 calgary ollama[5122]: [GIN] 2024/03/15 - 23:37:00 | 200 | 1.02968349s | 10.7.14.22 | POST "/api/generate" ```

GiteaMirror commented

2026-05-03 11:39:23 -05:00

@dhiltgen commented on GitHub (Mar 18, 2024):

@jeremytregunna looking back on that screen shot you posted above, I think the problem may be a result of how you have your cards plugged into your PCI slots. I believe you have 1 of the A6000's and the 3070 in the PCI 4@16x slots, but the other A6000 is in a older/slower PCI 1@16x slot. If you put both of the A6000's into the gen 4 slots and the 3070 into the gen 1 slot, perhaps things will be selected properly.

@dhiltgen commented on GitHub (Mar 18, 2024): @jeremytregunna looking back on that screen shot you posted above, I think the problem may be a result of how you have your cards plugged into your PCI slots. I believe you have 1 of the A6000's and the 3070 in the PCI 4@16x slots, but the other A6000 is in a older/slower PCI 1@16x slot. If you put both of the A6000's into the gen 4 slots and the 3070 into the gen 1 slot, perhaps things will be selected properly.

GiteaMirror commented

2026-05-03 11:39:24 -05:00

@jeremytregunna commented on GitHub (Mar 18, 2024):

@jeremytregunna looking back on that screen shot you posted above, I think the problem may be a result of how you have your cards plugged into your PCI slots. I believe you have 1 of the A6000's and the 3070 in the PCI 4@16x slots, but the other A6000 is in a older/slower PCI 1@16x slot. If you put both of the A6000's into the gen 4 slots and the 3070 into the gen 1 slot, perhaps things will be selected properly.

Nope that's not it, but you are correct in one respect. The second A6000, since not being used, is currently at PCE1 speeds but, if I select it specifically in some other torch code, it bumps up to PCIE4x16 speeds. nvtop right now reports all 3 cards at PCE gen1 speeds because nothing is loaded. I can assure you, they're all plugged into gen 4 x16 slots.

@jeremytregunna commented on GitHub (Mar 18, 2024): > @jeremytregunna looking back on that screen shot you posted above, I think the problem may be a result of how you have your cards plugged into your PCI slots. I believe you have 1 of the A6000's and the 3070 in the PCI 4@16x slots, but the other A6000 is in a older/slower PCI 1@16x slot. If you put both of the A6000's into the gen 4 slots and the 3070 into the gen 1 slot, perhaps things will be selected properly. Nope that's not it, but you are correct in one respect. The second A6000, since not being used, is currently at PCE1 speeds but, if I select it specifically in some other torch code, it bumps up to PCIE4x16 speeds. nvtop right now reports all 3 cards at PCE gen1 speeds because nothing is loaded. I can assure you, they're all plugged into gen 4 x16 slots.

GiteaMirror commented

2026-05-03 11:39:24 -05:00

@dhiltgen commented on GitHub (Mar 19, 2024):

Can you try setting CUDA_DEVICE_ORDER as well. Options are FASTEST_FIRST or PCI_BUS_ID
It looks like you can also specify device UUIDs for the visible device setting which might help. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#id205

Use nvidia-smi -L to get the UUIDs of your GPUs.

Hopefully some combination of these will get things aligned.

@dhiltgen commented on GitHub (Mar 19, 2024): Can you try setting `CUDA_DEVICE_ORDER` as well. Options are `FASTEST_FIRST` or `PCI_BUS_ID` It looks like you can also specify device UUIDs for the visible device setting which might help. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#id205 Use `nvidia-smi -L` to get the UUIDs of your GPUs. Hopefully some combination of these will get things aligned.

GiteaMirror commented

2026-05-03 11:39:26 -05:00

@jeremytregunna commented on GitHub (Mar 21, 2024):

Can you try setting CUDA_DEVICE_ORDER as well. Options are FASTEST_FIRST or PCI_BUS_ID It looks like you can also specify device UUIDs for the visible device setting which might help. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#id205

Use nvidia-smi -L to get the UUIDs of your GPUs.

Hopefully some combination of these will get things aligned.

Ok this had an interesting effect. Loading dolphin-mixtral:8x7b-v2.7-q8_0 again, it splits 50%/50% on the A6000s now with FASTEST_FIRST, but it also uses about 1/4 of memory on the 3070. I can confirm all memory usage on all the GPUs is nominal before dolphin-mixtral is loaded. I essentially need to keep tho 3070 out of consideration for ollama entirely, so this won't exactly work since it'll always be in the mix.

@jeremytregunna commented on GitHub (Mar 21, 2024): > Can you try setting `CUDA_DEVICE_ORDER` as well. Options are `FASTEST_FIRST` or `PCI_BUS_ID` It looks like you can also specify device UUIDs for the visible device setting which might help. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#id205 > > Use `nvidia-smi -L` to get the UUIDs of your GPUs. > > Hopefully some combination of these will get things aligned. Ok this had an interesting effect. Loading dolphin-mixtral:8x7b-v2.7-q8_0 again, it splits 50%/50% on the A6000s now with `FASTEST_FIRST`, but it also uses about 1/4 of memory on the 3070. I can confirm all memory usage on all the GPUs is nominal before dolphin-mixtral is loaded. I essentially need to keep tho 3070 out of consideration for ollama entirely, so this won't exactly work since it'll always be in the mix. ![2024-03-20T20:33:46,349694373-06:00](https://github.com/ollama/ollama/assets/261615/9f212c31-66e6-469b-bc5a-09788e69de03)

GiteaMirror commented

2026-05-03 11:39:27 -05:00

@jeremytregunna commented on GitHub (Mar 21, 2024):

@dhiltgen So tried with the explicit UUIDs with CUDA_VISIBLE_DEVICES and that works, but their GPU instance IDs do not work. For now, this is resolved, but I am left wondering if Ollama can do better?

@jeremytregunna commented on GitHub (Mar 21, 2024): @dhiltgen So tried with the explicit UUIDs with `CUDA_VISIBLE_DEVICES` and that works, but their GPU instance IDs do not work. For now, this is resolved, but I am left wondering if Ollama can do better?

GiteaMirror commented

2026-05-03 11:39:29 -05:00

@Koesn commented on GitHub (Mar 25, 2024):

@dhiltgen Thank you, CUDA_VISIBLE_DEVICES works. Finally.

@Koesn commented on GitHub (Mar 25, 2024): @dhiltgen Thank you, CUDA_VISIBLE_DEVICES works. Finally.

GiteaMirror commented

2026-05-03 11:39:30 -05:00

@datalee commented on GitHub (Apr 12, 2024):

mark

@datalee commented on GitHub (Apr 12, 2024): mark

GiteaMirror commented

2026-05-03 11:39:30 -05:00

@datalee commented on GitHub (Apr 12, 2024):

It can also be specified like this：
CUDA_VISIBLE_DEVICES=xx OLLAMA_HOST=0.0.0.0:xxx OLLAMA_MODELS=xxx/ollama_cache ollama serve

@datalee commented on GitHub (Apr 12, 2024): It can also be specified like this： ` CUDA_VISIBLE_DEVICES=xx OLLAMA_HOST=0.0.0.0:xxx OLLAMA_MODELS=xxx/ollama_cache ollama serve `

GiteaMirror commented

2026-05-03 11:39:31 -05:00

@papandadj commented on GitHub (Apr 19, 2024):

damn. CUDA_VISIBLE_DEVICES is fine for me. thank you.

@papandadj commented on GitHub (Apr 19, 2024): damn. CUDA_VISIBLE_DEVICES is fine for me. thank you.

GiteaMirror commented

2026-05-03 11:39:31 -05:00

@charles-cai commented on GitHub (Apr 30, 2024):

@jeremytregunna gpustat --watch looks very cool :)
ah it's actually nvtop!

@charles-cai commented on GitHub (Apr 30, 2024): @jeremytregunna `gpustat --watch` looks very cool :) ah it's actually nvtop!

GiteaMirror commented

2026-05-03 11:39:32 -05:00

@pykeras commented on GitHub (May 8, 2024):

Automate/Easy GPU Selection for Ollama

Hi everyone,

I wanted to share a handy script I created for automating GPU selection when running Ollama. You can find the script here. This script allows you to specify which GPU(s) Ollama should utilize, making it easier to manage resources and optimize performance.

How to Use:

Download the ollama_gpu_selector.sh script from the gist.
Make it executable: chmod +x ollama_gpu_selector.sh.
Run the script with administrative privileges: sudo ./ollama_gpu_selector.sh.
Follow the prompts to select the GPU(s) for Ollama.

Additionally, I've included aliases in the gist for easier switching between GPU selections. Feel free to customize these aliases to suit your preferences.

If you encounter any issues or have suggestions for improvement, please let me know! I hope this script helps streamline your Ollama workflow.

Happy coding!

@pykeras commented on GitHub (May 8, 2024): Automate/Easy GPU Selection for Ollama Hi everyone, I wanted to share a handy script I created for automating GPU selection when running Ollama. You can find the script [here](https://gist.github.com/pykeras/0b1e32b92b87cdce1f7195ea3409105c). This script allows you to specify which GPU(s) Ollama should utilize, making it easier to manage resources and optimize performance. How to Use: * Download the `ollama_gpu_selector.sh` script from the gist. * Make it executable: `chmod +x ollama_gpu_selector.sh`. * Run the script with administrative privileges: `sudo ./ollama_gpu_selector.sh`. * Follow the prompts to select the GPU(s) for Ollama. Additionally, I've included aliases in the gist for easier switching between GPU selections. Feel free to customize these aliases to suit your preferences. If you encounter any issues or have suggestions for improvement, please let me know! I hope this script helps streamline your Ollama workflow. Happy coding!

GiteaMirror commented

2026-05-03 11:39:32 -05:00

@emourdavid commented on GitHub (May 13, 2024):

Automate/Easy GPU Selection for Ollama

Hi everyone,

I wanted to share a handy script I created for automating GPU selection when running Ollama. You can find the script here. This script allows you to specify which GPU(s) Ollama should utilize, making it easier to manage resources and optimize performance.

How to Use:

Download the ollama_gpu_selector.sh script from the gist.

Make it executable: chmod +x ollama_gpu_selector.sh.

Run the script with administrative privileges: sudo ./ollama_gpu_selector.sh.

Follow the prompts to select the GPU(s) for Ollama.

Additionally, I've included aliases in the gist for easier switching between GPU selections. Feel free to customize these aliases to suit your preferences.

If you encounter any issues or have suggestions for improvement, please let me know! I hope this script helps streamline your Ollama workflow.

Happy coding!

Thank you, I can run this successful.

@emourdavid commented on GitHub (May 13, 2024): > Automate/Easy GPU Selection for Ollama > > Hi everyone, > > I wanted to share a handy script I created for automating GPU selection when running Ollama. You can find the script [here](https://gist.github.com/pykeras/0b1e32b92b87cdce1f7195ea3409105c). This script allows you to specify which GPU(s) Ollama should utilize, making it easier to manage resources and optimize performance. > > How to Use: > > * Download the `ollama_gpu_selector.sh` script from the gist. > * Make it executable: `chmod +x ollama_gpu_selector.sh`. > * Run the script with administrative privileges: `sudo ./ollama_gpu_selector.sh`. > * Follow the prompts to select the GPU(s) for Ollama. > > Additionally, I've included aliases in the gist for easier switching between GPU selections. Feel free to customize these aliases to suit your preferences. > > If you encounter any issues or have suggestions for improvement, please let me know! I hope this script helps streamline your Ollama workflow. > > Happy coding! Thank you, I can run this successful.

GiteaMirror commented

2026-05-03 11:39:33 -05:00

@pccross commented on GitHub (Oct 4, 2024):

Does the CUDA_VISIBLE_DEVICES work on AMD ROCm GPU's? I tried setting it to just a single GPU (3, then 2, then 1), and it always loaded my LLM's (4 simultaneous instances of Llama3.1:8b) to different GPU's in what seemed random fashion, when I just wanted the 4 loaded to a single GPU (with 192GB VRAM).

@pccross commented on GitHub (Oct 4, 2024): Does the CUDA_VISIBLE_DEVICES work on AMD ROCm GPU's? I tried setting it to just a single GPU (3, then 2, then 1), and it always loaded my LLM's (4 simultaneous instances of Llama3.1:8b) to different GPU's in what seemed random fashion, when I just wanted the 4 loaded to a single GPU (with 192GB VRAM).

GiteaMirror commented

2026-05-03 11:39:33 -05:00

@jeremytregunna commented on GitHub (Oct 4, 2024):

Does the CUDA_VISIBLE_DEVICES work on AMD ROCm GPU's? I tried setting it to just a single GPU (3, then 2, then 1), and it always loaded my LLM's (4 simultaneous instances of Llama3.1:8b) to different GPU's in what seemed random fashion, when I just wanted the 4 loaded to a single GPU (with 192GB VRAM).

No, because AMD GPUs don't use CUDA. But you can get the right env var for you here: https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html

Though I should note, not sure how this interacts with Ollama because I don't use AMD GPUs, but if it works like the CUDA env vars do, it should "just work".

@jeremytregunna commented on GitHub (Oct 4, 2024): > Does the CUDA_VISIBLE_DEVICES work on AMD ROCm GPU's? I tried setting it to just a single GPU (3, then 2, then 1), and it always loaded my LLM's (4 simultaneous instances of Llama3.1:8b) to different GPU's in what seemed random fashion, when I just wanted the 4 loaded to a single GPU (with 192GB VRAM). No, because AMD GPUs don't use CUDA. But you can get the right env var for you here: https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html Though I should note, not sure how this interacts with Ollama because I don't use AMD GPUs, but if it works like the CUDA env vars do, it should "just work".

GiteaMirror commented

2026-05-03 11:39:33 -05:00

@AlessandroBorges commented on GitHub (Oct 6, 2024):

Ok this had an interesting effect. Loading dolphin-mixtral:8x7b-v2.7-q8_0 again, it splits 50%/50% on the A6000s now with FASTEST_FIRST, but it also uses about 1/4 of memory on the 3070. I can confirm all memory usage on all the GPUs is nominal before dolphin-mixtral is loaded. I essentially need to keep tho 3070 out of consideration for ollama entirely, so this won't exactly work since it'll always be in the mix.

@jeremytregunna I think the odd one out here is the RTX 3070 8GB, especially when paired with two "800-pound gorillas" like the A6000 48GB. Unless you're in desperate need of that extra 8GB, it's probably better to remove the 3070 and let the pair of A6000s work together seamlessly. You can put this 3070 in another PC and use it to run embeddings.

@AlessandroBorges commented on GitHub (Oct 6, 2024): > Ok this had an interesting effect. Loading dolphin-mixtral:8x7b-v2.7-q8_0 again, it splits 50%/50% on the A6000s now with `FASTEST_FIRST`, but it also uses about 1/4 of memory on the 3070. I can confirm all memory usage on all the GPUs is nominal before dolphin-mixtral is loaded. I essentially need to keep tho 3070 out of consideration for ollama entirely, so this won't exactly work since it'll always be in the mix. > > ![2024-03-20T20:33:46,349694373-06:00](https://private-user-images.githubusercontent.com/261615/314911618-9f212c31-66e6-469b-bc5a-09788e69de03.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjgyMzgzODUsIm5iZiI6MTcyODIzODA4NSwicGF0aCI6Ii8yNjE2MTUvMzE0OTExNjE4LTlmMjEyYzMxLTY2ZTYtNDY5Yi1iYzVhLTA5Nzg4ZTY5ZGUwMy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQxMDA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MTAwNlQxODA4MDVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1hMTM1MmNhNGQyODNmMGY1MDBjMGFlZTJhY2JhZGQyZjgzZDQ1YzI3YWYwNTBhMzQyZWVlYjljOTc1MjRlNWNiJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.VdpwCLdB7Fda9jS-Qyn-D8rxUFk9GY63VJQWdS45cok) @jeremytregunna I think the odd one out here is the RTX 3070 8GB, especially when paired with two "800-pound gorillas" like the A6000 48GB. Unless you're in desperate need of that extra 8GB, it's probably better to remove the 3070 and let the pair of A6000s work together seamlessly. You can put this 3070 in another PC and use it to run embeddings.

GiteaMirror commented

2026-05-03 11:39:34 -05:00

@jeremytregunna commented on GitHub (Oct 7, 2024):

@jeremytregunna I think the odd one out here is the RTX 3070 8GB, especially when paired with two "800-pound gorillas" like the A6000 48GB. Unless you're in desperate need of that extra 8GB, it's probably better to remove the 3070 and let the pair of A6000s work together seamlessly. You can put this 3070 in another PC and use it to run embeddings.

Even if that's true, and certainly removing that GPU worked around the problem, it highlighted a bug in the nvidia drivers. Easy assumption to make, all GPUs will be the same, but that's not always true. In my case, the A6000s were used for inference with LLMs, and the 3070 was used for embedding models outside of Ollama. I've since moved the embedding work off of the A6000 nodes, but the issue stood. Anyway, the UUIDs work and the indexes didn't.

@jeremytregunna commented on GitHub (Oct 7, 2024): > @jeremytregunna I think the odd one out here is the RTX 3070 8GB, especially when paired with two "800-pound gorillas" like the A6000 48GB. Unless you're in desperate need of that extra 8GB, it's probably better to remove the 3070 and let the pair of A6000s work together seamlessly. You can put this 3070 in another PC and use it to run embeddings. Even if that's true, and certainly removing that GPU worked around the problem, it highlighted a bug in the nvidia drivers. Easy assumption to make, all GPUs will be the same, but that's not always true. In my case, the A6000s were used for inference with LLMs, and the 3070 was used for embedding models outside of Ollama. I've since moved the embedding work off of the A6000 nodes, but the issue stood. Anyway, the UUIDs work and the indexes didn't.

GiteaMirror commented

2026-05-03 11:39:34 -05:00

@PiDevi commented on GitHub (Oct 17, 2024):

I recently faced a similar challenge while managing multiple CUDA GPUs on my Windows machine. After thorough research, I discovered a convenient method for selectively enabling which GPUs are visible to specific programs.

Allow Specific GPU Access for Programs:

For users of Windows machines with Nvidia CUDA GPUs, the Nvidia Control Panel offers a graphical interface to configure program-specific GPU allocation. Open Nvidia Control Panel and navigate to 'Manage 3D Settings' > switch to the tab 'Program Settings' and select the desired program. Under the 'CUDA - GPUs' section, choose the desired GPU or list of GPUs to allocate to that program. Click on 'Apply', and restart your program such as Ollama.exe. For image generation UIs, you need to select the specific used python.exe in that UI installation (e.g. C:\ForgeUI\system\python\python.exe).

My Configuration:

In my setup, I have a 2060 (8GB) and two older P40s (24GB each). I utilize Ollama in parallel with two image generator IUs (Easy Diffusion and ForgeUI). Ollama loads onto one of my P40s, ForgeUI uses the 2060, while Easy Diffusion gets the second P40.

CUDA_VISIBLE_DEVICES Parameter:

It's important to note that from my understanding the CUDA_VISIBLE_DEVICES parameter is a CUDA-level setting applicable both locally and system-wide. From what I have experienced this parameter is not specific to Ollama. Setting this parameter to a specific GPU or list of GPUs unfortunately hide all my other CUDA GPUs not explicitly listed. Those not listed GPUs became unavailable to any program on my machine that relies on CUDA.

@PiDevi commented on GitHub (Oct 17, 2024): I recently faced a similar challenge while managing multiple CUDA GPUs on my Windows machine. After thorough research, I discovered a convenient method for selectively enabling which GPUs are visible to specific programs. **Allow Specific GPU Access for Programs:** For users of Windows machines with Nvidia CUDA GPUs, the Nvidia Control Panel offers a graphical interface to configure program-specific GPU allocation. Open Nvidia Control Panel and navigate to 'Manage 3D Settings' > switch to the tab 'Program Settings' and select the desired program. Under the 'CUDA - GPUs' section, choose the desired GPU or list of GPUs to allocate to that program. Click on 'Apply', and restart your program such as Ollama.exe. For image generation UIs, you need to select the specific used python.exe in that UI installation (e.g. C:\ForgeUI\system\python\python.exe). **My Configuration:** In my setup, I have a 2060 (8GB) and two older P40s (24GB each). I utilize Ollama in parallel with two image generator IUs (Easy Diffusion and ForgeUI). Ollama loads onto one of my P40s, ForgeUI uses the 2060, while Easy Diffusion gets the second P40. **CUDA_VISIBLE_DEVICES Parameter:** It's important to note that from my understanding the CUDA_VISIBLE_DEVICES parameter is a CUDA-level setting applicable both locally and system-wide. From what I have experienced this parameter is not specific to Ollama. Setting this parameter to a specific GPU or list of GPUs unfortunately hide all my other CUDA GPUs not explicitly listed. Those not listed GPUs became unavailable to any program on my machine that relies on CUDA.

GiteaMirror commented

2026-05-03 11:39:35 -05:00

@YouxunYao commented on GitHub (Oct 26, 2024):

(base) PS C:\Users\11648> conda activate OllamaGPU
(OllamaGPU) PS C:\Users\11648> $env:CUDA_VISIBLE_DEVICES ="1"
(OllamaGPU) PS C:\Users\11648> Start-Process "C:\Users\11648\AppData\Local\Programs\Ollama\ollama app.exe"
(OllamaGPU) PS C:\Users\11648>
So using anaconda env this way solved this problem for me, now Ollama only runs on the specified GPU, and at the same time it doesn't affect other applications.

@YouxunYao commented on GitHub (Oct 26, 2024): (base) PS C:\Users\11648> conda activate OllamaGPU (OllamaGPU) PS C:\Users\11648> $env:CUDA_VISIBLE_DEVICES ="1" (OllamaGPU) PS C:\Users\11648> Start-Process "C:\Users\11648\AppData\Local\Programs\Ollama\ollama app.exe" (OllamaGPU) PS C:\Users\11648> So using anaconda env this way solved this problem for me, now Ollama only runs on the specified GPU, and at the same time it doesn't affect other applications.

GiteaMirror commented

2026-05-03 11:39:36 -05:00

@mshakirDr commented on GitHub (Nov 17, 2024):

Two devices = A 4090 and an RTX Ada 2000.
Use CUDA_VISIBLE_DEVICES=0, CUDA_VISIBLE_DEVICES=1 in two terminal windows.
set OLLAMA_HOST to different ports in each window
Run ollama serve
Run inference on both models in parallel in python.
One model runs on Ada 2000 (the smaller GPU), the other is partially offloaded to CPU (RTX4090 is apparently only used for VRAM).
The above workaround was to circumvent "mllama doesn't support parallel requests yet" in Llama 3.2 Vision models. But it does not work either.

@mshakirDr commented on GitHub (Nov 17, 2024): Two devices = A 4090 and an RTX Ada 2000. Use CUDA_VISIBLE_DEVICES=0, CUDA_VISIBLE_DEVICES=1 in two terminal windows. set OLLAMA_HOST to different ports in each window Run ollama serve Run inference on both models in parallel in python. One model runs on Ada 2000 (the smaller GPU), the other is partially offloaded to CPU (RTX4090 is apparently only used for VRAM). The above workaround was to circumvent "mllama doesn't support parallel requests yet" in Llama 3.2 Vision models. But it does not work either.

GiteaMirror commented

2026-05-03 11:39:36 -05:00

@LeeABarron commented on GitHub (Nov 21, 2024):

@dhiltgen worked with your weekend changes! thank you!

I compiled with make CUSTOM_CPU_FLAGS="" -j 5 cuda_v12 CUDA_12_PATH=/usr/local/cuda-12.5

@LeeABarron commented on GitHub (Nov 21, 2024): @dhiltgen worked with your weekend changes! thank you! I compiled with make CUSTOM_CPU_FLAGS="" -j 5 cuda_v12 CUDA_12_PATH=/usr/local/cuda-12.5

GiteaMirror commented

2026-05-03 11:39:37 -05:00

@aviupa commented on GitHub (Feb 1, 2025):

Well if you were still not able to do it here's how I did it.
Switched to Ollama Docker:- https://github.com/valiantlynx/ollama-docker
Installed and ran everything from the documentation on the above link. Used docker-compose to do so. Then changed the "docker-compose-ollama-gpu.yaml" to:

deploy: resources: reservations: devices: - driver: nvidia capabilities: [gpu] device_ids: ["2"]

Ran the containers with docker-compose to use the 3rd GPU successfully.

@aviupa commented on GitHub (Feb 1, 2025): Well if you were still not able to do it here's how I did it. Switched to Ollama Docker:- https://github.com/valiantlynx/ollama-docker Installed and ran everything from the documentation on the above link. Used docker-compose to do so. Then changed the "docker-compose-ollama-gpu.yaml" to: `deploy: resources: reservations: devices: - driver: nvidia capabilities: [gpu] device_ids: ["2"]` Ran the containers with docker-compose to use the 3rd GPU successfully.

GiteaMirror commented

2026-05-03 11:39:38 -05:00

@ohpage commented on GitHub (Jun 18, 2025):

Solved at last
I got 2 GPU(cuda 0 : RTX3090/24G, cuda 1: rtx3060/12G) in my pc and want to put ollama in cuda 1.
Model is gemma3:12b_q4(8.1GB)

$ systemctl stop ollama
Set "Environment = "CUDA_VISIBLE_DEVICES=1" at ollama.service.
$ systemctl start ollama
$ systemctl run gemma3:12b
$ nvidia-smi (check)

if something wrong after reboot, then i'll remove this comment

@ohpage commented on GitHub (Jun 18, 2025): Solved at last I got 2 GPU(cuda 0 : RTX3090/24G, cuda 1: rtx3060/12G) in my pc and want to put ollama in cuda 1. Model is gemma3:12b_q4(8.1GB) 1. $ systemctl stop ollama 2. Set "Environment = "CUDA_VISIBLE_DEVICES=1" at ollama.service. 3. $ systemctl start ollama 4. $ systemctl run gemma3:12b 5. $ nvidia-smi (check) if something wrong after reboot, then i'll remove this comment

GiteaMirror commented

2026-05-03 11:39:39 -05:00

@akaghzi commented on GitHub (Jun 30, 2025):

Solved at last I got 2 GPU(cuda 0 : RTX3090/24G, cuda 1: rtx3060/12G) in my pc and want to put ollama in cuda 1. Model is gemma3:12b_q4(8.1GB)

$ systemctl stop ollama

Set "Environment = "CUDA_VISIBLE_DEVICES=1" at ollama.service.

$ systemctl start ollama

$ systemctl run gemma3:12b

$ nvidia-smi (check)

if something wrong after reboot, then i'll remove this comment

worked for me on ubuntu 24.04

@akaghzi commented on GitHub (Jun 30, 2025): > Solved at last I got 2 GPU(cuda 0 : RTX3090/24G, cuda 1: rtx3060/12G) in my pc and want to put ollama in cuda 1. Model is gemma3:12b_q4(8.1GB) > > 1. $ systemctl stop ollama > 2. Set "Environment = "CUDA_VISIBLE_DEVICES=1" at ollama.service. > 3. $ systemctl start ollama > 4. $ systemctl run gemma3:12b > 5. $ nvidia-smi (check) > > if something wrong after reboot, then i'll remove this comment worked for me on ubuntu 24.04

GiteaMirror commented

2026-05-03 11:39:40 -05:00

@Zabadeus commented on GitHub (Aug 7, 2025):

On Windows I fixed it by adding a new "User variables" (in "Environment Variables" with

Name: LLAMA_CUDA_FORCE
Value: 1

forcing the system to use my main (second) GPU when running LLama.cpp

@Zabadeus commented on GitHub (Aug 7, 2025): On Windows I fixed it by adding a new "User variables" (in "Environment Variables" with Name: LLAMA_CUDA_FORCE Value: 1 forcing the system to use my main (second) GPU when running LLama.cpp

GiteaMirror commented

2026-05-03 11:39:40 -05:00

@xxDoman commented on GitHub (Nov 24, 2025):

Poradnik: AMD MI50 + RTX 4070 na Ubuntu 24.04 (Ollama Dual-GPU)
Wymagania sprzętowe:
Płyta główna: MSI PRO B760-P WIFI DDR4 (wymaga patchowania w GRUB).

GPU 1 (AI): AMD Radeon Instinct MI50 32GB.

GPU 2 (Display): NVIDIA GeForce RTX 4070.

KROK 1: Instalacja Systemu i Sterowników Wstępnych
Zainstaluj Ubuntu 24.04 LTS.

KLUCZOWE: Podczas instalacji zaznacz opcję:

"Zainstaluj oprogramowanie stron trzecich dla urządzeń graficznych i Wi-Fi" (Install third-party software for graphics and Wi-Fi hardware).

Dlaczego: To zainstaluje wstępne sterowniki, które potem podmienimy/wyłączymy, ale zapewni bazę dla systemu.

KROK 2: Konfiguracja GRUB (Obowiązkowa dla MI50)
Płyta B760 nie obsługuje poprawnie karty serwerowej MI50 bez wymuszenia parametrów jądra.

Otwórz terminal i edytuj plik GRUB:

Bash

sudo nano /etc/default/grub
Znajdź linię GRUB_CMDLINE_LINUX_DEFAULT i zamień ją na dokładnie taką:

Bash

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.ignore_crat=1 amdgpu.exp_hw_support=1 iommu=pt"
Zapisz (Ctrl+O, Enter) i wyjdź (Ctrl+X).

Zaktualizuj GRUB:

Bash

sudo update-grub
KROK 3: "Patent na Nvidię" (Przełączenie na X11/Nouveau)
Musimy "oślepić" system na Nvidię przed instalacją Ollamy, aby instalator wykrył tylko AMD. Nie odinstalowujemy sterowników, tylko przełączamy je na bezpieczne.

Otwórz aplikację Oprogramowanie i Aktualizacje (Software & Updates).

Przejdź do zakładki Dodatkowe sterowniki (Additional Drivers).

Znajdź na liście kartę NVIDIA.

Zaznacz ostatnią opcję:

Używanie X.Org X server -- Nouveau display driver (otwartoźródłowy)

Kliknij Zastosuj zmiany.

ZRESTARTUJ KOMPUTER.

Po restarcie karta NVIDIA zniknie z zasobów CUDA, a system będzie działał na podstawowym sterowniku graficznym.

KROK 4: Instalacja Ollama (Wersja Specjalna)
Instalujemy konkretną wersję 0.12.3, która zawiera kompatybilny stos bibliotek ROCm dla Twojej konfiguracji.

Wpisz w terminalu:

Bash

curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.12.3 sh
Oczekiwany wynik: Skrypt pobierze pakiet AMD, wykryje kartę i wyświetli komunikat "AMD GPU ready".

KROK 5: Konfiguracja Usługi (Izolacja GPU)
Aby Ollama zawsze używała MI50, nawet gdy przywrócimy Nvidię, musimy dodać konfigurację, którą przetestowałeś.

Edytuj usługę Ollama:

Bash

sudo systemctl edit ollama
Wklej poniższą sekcję (poniżej znaczników komentarzy):

Ini, TOML

[Service]

1. Wymuszamy silnik ROCm

Environment="OLLAMA_LLM_LIBRARY=rocm"

2. Wskazujemy KONKRETNIE kartę AMD (MI50 ma zazwyczaj ID 0 w trybie obliczeniowym)

Environment="HIP_VISIBLE_DEVICES=0"

3. Ukrywamy Nvidię dla Ollamy (CUDA OFF)

Environment="CUDA_VISIBLE_DEVICES=-1"
Zapisz i wyjdź (Ctrl+O, Enter, Ctrl+X).

Przeładuj i zrestartuj usługę:

Bash

sudo systemctl daemon-reload
sudo systemctl restart ollama
KROK 6: Przywrócenie NVIDIA (Dla Pulpitu/Gier)
Teraz, gdy Ollama jest "zabetonowana" na AMD, możemy przywrócić pełną wydajność graficzną RTX 4070.

Otwórz ponownie Oprogramowanie i Aktualizacje > Dodatkowe sterowniki.

Przy karcie NVIDIA wybierz najnowszy sterownik własnościowy (np. nvidia-driver-535 lub 550 - ten z dopiskiem (własnościowy)).

Kliknij Zastosuj zmiany.

ZRESTARTUJ KOMPUTER.

✅ KROK 7: Weryfikacja Końcowa
Uruchom Mission Center (lub btop).

Uruchom model:

Bash

ollama run llama3
Obserwuj:

Pulpit działa płynnie na RTX 4070.

Model ładuje się do VRAM na AMD MI50 (obciążenie i pamięć skaczą na GPU AMD).

Gotowe. Masz hybrydowy system AI/Gaming.

@xxDoman commented on GitHub (Nov 24, 2025): Poradnik: AMD MI50 + RTX 4070 na Ubuntu 24.04 (Ollama Dual-GPU) Wymagania sprzętowe: Płyta główna: MSI PRO B760-P WIFI DDR4 (wymaga patchowania w GRUB). GPU 1 (AI): AMD Radeon Instinct MI50 32GB. GPU 2 (Display): NVIDIA GeForce RTX 4070. KROK 1: Instalacja Systemu i Sterowników Wstępnych Zainstaluj Ubuntu 24.04 LTS. KLUCZOWE: Podczas instalacji zaznacz opcję: "Zainstaluj oprogramowanie stron trzecich dla urządzeń graficznych i Wi-Fi" (Install third-party software for graphics and Wi-Fi hardware). Dlaczego: To zainstaluje wstępne sterowniki, które potem podmienimy/wyłączymy, ale zapewni bazę dla systemu. KROK 2: Konfiguracja GRUB (Obowiązkowa dla MI50) Płyta B760 nie obsługuje poprawnie karty serwerowej MI50 bez wymuszenia parametrów jądra. Otwórz terminal i edytuj plik GRUB: Bash sudo nano /etc/default/grub Znajdź linię GRUB_CMDLINE_LINUX_DEFAULT i zamień ją na dokładnie taką: Bash GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.ignore_crat=1 amdgpu.exp_hw_support=1 iommu=pt" Zapisz (Ctrl+O, Enter) i wyjdź (Ctrl+X). Zaktualizuj GRUB: Bash sudo update-grub KROK 3: "Patent na Nvidię" (Przełączenie na X11/Nouveau) Musimy "oślepić" system na Nvidię przed instalacją Ollamy, aby instalator wykrył tylko AMD. Nie odinstalowujemy sterowników, tylko przełączamy je na bezpieczne. Otwórz aplikację Oprogramowanie i Aktualizacje (Software & Updates). Przejdź do zakładki Dodatkowe sterowniki (Additional Drivers). Znajdź na liście kartę NVIDIA. Zaznacz ostatnią opcję: Używanie X.Org X server -- Nouveau display driver (otwartoźródłowy) Kliknij Zastosuj zmiany. ZRESTARTUJ KOMPUTER. Po restarcie karta NVIDIA zniknie z zasobów CUDA, a system będzie działał na podstawowym sterowniku graficznym. KROK 4: Instalacja Ollama (Wersja Specjalna) Instalujemy konkretną wersję 0.12.3, która zawiera kompatybilny stos bibliotek ROCm dla Twojej konfiguracji. Wpisz w terminalu: Bash curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.12.3 sh Oczekiwany wynik: Skrypt pobierze pakiet AMD, wykryje kartę i wyświetli komunikat "AMD GPU ready". KROK 5: Konfiguracja Usługi (Izolacja GPU) Aby Ollama zawsze używała MI50, nawet gdy przywrócimy Nvidię, musimy dodać konfigurację, którą przetestowałeś. Edytuj usługę Ollama: Bash sudo systemctl edit ollama Wklej poniższą sekcję (poniżej znaczników komentarzy): Ini, TOML [Service] # 1. Wymuszamy silnik ROCm Environment="OLLAMA_LLM_LIBRARY=rocm" # 2. Wskazujemy KONKRETNIE kartę AMD (MI50 ma zazwyczaj ID 0 w trybie obliczeniowym) Environment="HIP_VISIBLE_DEVICES=0" # 3. Ukrywamy Nvidię dla Ollamy (CUDA OFF) Environment="CUDA_VISIBLE_DEVICES=-1" Zapisz i wyjdź (Ctrl+O, Enter, Ctrl+X). Przeładuj i zrestartuj usługę: Bash sudo systemctl daemon-reload sudo systemctl restart ollama KROK 6: Przywrócenie NVIDIA (Dla Pulpitu/Gier) Teraz, gdy Ollama jest "zabetonowana" na AMD, możemy przywrócić pełną wydajność graficzną RTX 4070. Otwórz ponownie Oprogramowanie i Aktualizacje > Dodatkowe sterowniki. Przy karcie NVIDIA wybierz najnowszy sterownik własnościowy (np. nvidia-driver-535 lub 550 - ten z dopiskiem (własnościowy)). Kliknij Zastosuj zmiany. ZRESTARTUJ KOMPUTER. ✅ KROK 7: Weryfikacja Końcowa Uruchom Mission Center (lub btop). Uruchom model: Bash ollama run llama3 Obserwuj: Pulpit działa płynnie na RTX 4070. Model ładuje się do VRAM na AMD MI50 (obciążenie i pamięć skaczą na GPU AMD). Gotowe. Masz hybrydowy system AI/Gaming. <img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/9bdae0a0-6d4d-4b25-a430-fcfc52f50924" /> <img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/9170308f-f09d-4615-88f1-8bb8ac3ab569" /> <img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/7259b988-3316-443f-a5c8-f27902fbd5c1" /> <img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/280871e4-ca49-4315-b3a7-924e0aee4649" />

Sign in to join this conversation.

Branches Tags

main

hoyyeva/fix-claude-channels-env

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#63072