[GH-ISSUE #8066] ollama 0.5.1 is detecting my NVIDIA Tesla M40, but they are not used. #51668

Closed
opened 2026-04-28 20:43:19 -05:00 by GiteaMirror · 15 comments
Owner

Originally created by @bones0 on GitHub (Dec 12, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/8066

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

ollama 0.5.1 binary distribution is recognising the TESLA M40:

Dec 12 07:11:11 bigrig ollama[362206]: time=2024-12-12T07:11:11.731Z level=INFO source=types.go:123 msg="inference compute" id=GPU-c8f87326-45f6-945a-1a1a-63bd9a7fc262 library=cuda variant=v12 compute=8.6 driver=12.2 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" available="23.4 GiB"
Dec 12 07:11:11 bigrig ollama[362206]: time=2024-12-12T07:11:11.731Z level=INFO source=types.go:123 msg="inference compute" id=GPU-603a9272-c602-62ea-4090-51223189bb8f library=cuda variant=v12 compute=7.5 driver=12.2 name="Tesla T4" total="15.6 GiB" available="15.5 GiB"
Dec 12 07:11:11 bigrig ollama[362206]: time=2024-12-12T07:11:11.731Z level=INFO source=types.go:123 msg="inference compute" id=GPU-4fe5252f-aa78-f5f8-958a-5a8ae3ffe9e4 library=cuda variant=v12 compute=7.5 driver=12.2 name="Tesla T4" total="14.6 GiB" available="14.5 GiB"
Dec 12 07:11:11 bigrig ollama[362206]: time=2024-12-12T07:11:11.731Z level=INFO source=types.go:123 msg="inference compute" id=GPU-e6b47121-0d1d-8c63-1ab0-14012d5eb87f library=cuda variant=v12 compute=6.1 driver=12.2 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
Dec 12 07:11:11 bigrig ollama[362206]: time=2024-12-12T07:11:11.731Z level=INFO source=types.go:123 msg="inference compute" id=GPU-079cdcf9-556e-2f0c-6e6d-042eec929d92 library=cuda variant=v11 compute=5.2 driver=12.2 name="Tesla M40 24GB" total="23.9 GiB" available="23.8 GiB"
Dec 12 07:11:11 bigrig ollama[362206]: time=2024-12-12T07:11:11.731Z level=INFO source=types.go:123 msg="inference compute" id=GPU-41ac58ae-d8b7-afdb-c25f-ca6f09b57999 library=cuda variant=v11 compute=5.2 driver=12.2 name="Tesla M40 24GB" total="23.9 GiB" available="23.8 GiB"

But later on, only the other GPUs are used:

Dec 12 07:14:01 bigrig ollama[362206]: ggml_cuda_init: found 4 CUDA devices:
Dec 12 07:14:01 bigrig ollama[362206]:   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Dec 12 07:14:01 bigrig ollama[362206]:   Device 1: Tesla T4, compute capability 7.5, VMM: yes
Dec 12 07:14:01 bigrig ollama[362206]:   Device 2: Tesla T4, compute capability 7.5, VMM: yes
Dec 12 07:14:01 bigrig ollama[362206]:   Device 3: Tesla P40, compute capability 6.1, VMM: yes
Dec 12 07:14:01 bigrig ollama[362206]: llm_load_tensors: ggml ctx size =    2.00 MiB
Dec 12 07:14:41 bigrig ollama[362206]: llm_load_tensors: offloading 30 repeating layers to GPU
Dec 12 07:14:41 bigrig ollama[362206]: llm_load_tensors: offloaded 30/61 layers to GPU
Dec 12 07:14:41 bigrig ollama[362206]: llm_load_tensors:  CUDA_Host buffer size = 62745.29 MiB
Dec 12 07:14:41 bigrig ollama[362206]: llm_load_tensors:      CUDA0 buffer size = 21335.35 MiB
Dec 12 07:14:41 bigrig ollama[362206]: llm_load_tensors:      CUDA1 buffer size = 12801.21 MiB
Dec 12 07:14:41 bigrig ollama[362206]: llm_load_tensors:      CUDA2 buffer size = 10667.68 MiB
Dec 12 07:14:41 bigrig ollama[362206]: llm_load_tensors:      CUDA3 buffer size = 19201.82 MiB

Compiling ollama from source, explicitely with setting the architectures (export CMAKE_CUDA_ARCHITECTURES="50;52;61;70;75;80;90") does not change the behaviour.

ollama-logs-M40.txt

Running /usr/local/bin/ollama run deepseek-coder-v2:236b for testing. This model should be big enough to cause ollama to fill all the GPUs. But it does not do that.

image

I already tried to invoke different CUDA-Versions by update-alternatives, but to no avail.

BTW: export CUDA_VISIBLE_DEVICES=4,5 does not have any effect

I have a llama.cpp 2749, self-compiled, which is using the M40. Since the output is nearly identical to the one in the ollama-log, which is only recognizing 4 CUDA-Devices, I suspect the problem is somehow related with the llama.cpp shipped with ollama 0.5.1. This is the output in question:

Log start
main: build = 2749 (928e0b70)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1733989418
...
ggml_cuda_init: found 6 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: Tesla T4, compute capability 7.5, VMM: yes
  Device 2: Tesla T4, compute capability 7.5, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
  Device 4: Tesla M40 24GB, compute capability 5.2, VMM: yes
  Device 5: Tesla M40 24GB, compute capability 5.2, VMM: yes

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.5.1

Originally created by @bones0 on GitHub (Dec 12, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/8066 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? ollama 0.5.1 binary distribution is recognising the TESLA M40: ``` Dec 12 07:11:11 bigrig ollama[362206]: time=2024-12-12T07:11:11.731Z level=INFO source=types.go:123 msg="inference compute" id=GPU-c8f87326-45f6-945a-1a1a-63bd9a7fc262 library=cuda variant=v12 compute=8.6 driver=12.2 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" available="23.4 GiB" Dec 12 07:11:11 bigrig ollama[362206]: time=2024-12-12T07:11:11.731Z level=INFO source=types.go:123 msg="inference compute" id=GPU-603a9272-c602-62ea-4090-51223189bb8f library=cuda variant=v12 compute=7.5 driver=12.2 name="Tesla T4" total="15.6 GiB" available="15.5 GiB" Dec 12 07:11:11 bigrig ollama[362206]: time=2024-12-12T07:11:11.731Z level=INFO source=types.go:123 msg="inference compute" id=GPU-4fe5252f-aa78-f5f8-958a-5a8ae3ffe9e4 library=cuda variant=v12 compute=7.5 driver=12.2 name="Tesla T4" total="14.6 GiB" available="14.5 GiB" Dec 12 07:11:11 bigrig ollama[362206]: time=2024-12-12T07:11:11.731Z level=INFO source=types.go:123 msg="inference compute" id=GPU-e6b47121-0d1d-8c63-1ab0-14012d5eb87f library=cuda variant=v12 compute=6.1 driver=12.2 name="Tesla P40" total="23.9 GiB" available="23.7 GiB" Dec 12 07:11:11 bigrig ollama[362206]: time=2024-12-12T07:11:11.731Z level=INFO source=types.go:123 msg="inference compute" id=GPU-079cdcf9-556e-2f0c-6e6d-042eec929d92 library=cuda variant=v11 compute=5.2 driver=12.2 name="Tesla M40 24GB" total="23.9 GiB" available="23.8 GiB" Dec 12 07:11:11 bigrig ollama[362206]: time=2024-12-12T07:11:11.731Z level=INFO source=types.go:123 msg="inference compute" id=GPU-41ac58ae-d8b7-afdb-c25f-ca6f09b57999 library=cuda variant=v11 compute=5.2 driver=12.2 name="Tesla M40 24GB" total="23.9 GiB" available="23.8 GiB" ``` But later on, only the other GPUs are used: ``` Dec 12 07:14:01 bigrig ollama[362206]: ggml_cuda_init: found 4 CUDA devices: Dec 12 07:14:01 bigrig ollama[362206]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Dec 12 07:14:01 bigrig ollama[362206]: Device 1: Tesla T4, compute capability 7.5, VMM: yes Dec 12 07:14:01 bigrig ollama[362206]: Device 2: Tesla T4, compute capability 7.5, VMM: yes Dec 12 07:14:01 bigrig ollama[362206]: Device 3: Tesla P40, compute capability 6.1, VMM: yes Dec 12 07:14:01 bigrig ollama[362206]: llm_load_tensors: ggml ctx size = 2.00 MiB Dec 12 07:14:41 bigrig ollama[362206]: llm_load_tensors: offloading 30 repeating layers to GPU Dec 12 07:14:41 bigrig ollama[362206]: llm_load_tensors: offloaded 30/61 layers to GPU Dec 12 07:14:41 bigrig ollama[362206]: llm_load_tensors: CUDA_Host buffer size = 62745.29 MiB Dec 12 07:14:41 bigrig ollama[362206]: llm_load_tensors: CUDA0 buffer size = 21335.35 MiB Dec 12 07:14:41 bigrig ollama[362206]: llm_load_tensors: CUDA1 buffer size = 12801.21 MiB Dec 12 07:14:41 bigrig ollama[362206]: llm_load_tensors: CUDA2 buffer size = 10667.68 MiB Dec 12 07:14:41 bigrig ollama[362206]: llm_load_tensors: CUDA3 buffer size = 19201.82 MiB ``` Compiling ollama from source, explicitely with setting the architectures (export CMAKE_CUDA_ARCHITECTURES="50;52;61;70;75;80;90") does not change the behaviour. [ollama-logs-M40.txt](https://github.com/user-attachments/files/18107418/ollama-logs-M40.txt) Running _/usr/local/bin/ollama run deepseek-coder-v2:236b_ for testing. This model should be big enough to cause ollama to fill all the GPUs. But it does not do that. ![image](https://github.com/user-attachments/assets/3eb96e36-2ebb-45de-9404-e7cacd822e4e) I already tried to invoke different CUDA-Versions by update-alternatives, but to no avail. BTW: _export CUDA_VISIBLE_DEVICES=4,5_ does not have any effect I have a llama.cpp 2749, self-compiled, which is using the M40. Since the output is nearly identical to the one in the ollama-log, which is only recognizing 4 CUDA-Devices, I suspect the problem is somehow related with the llama.cpp shipped with ollama 0.5.1. This is the output in question: ``` Log start main: build = 2749 (928e0b70) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1733989418 ... ggml_cuda_init: found 6 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: Tesla T4, compute capability 7.5, VMM: yes Device 2: Tesla T4, compute capability 7.5, VMM: yes Device 3: Tesla P40, compute capability 6.1, VMM: yes Device 4: Tesla M40 24GB, compute capability 5.2, VMM: yes Device 5: Tesla M40 24GB, compute capability 5.2, VMM: yes ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.5.1
GiteaMirror added the bug label 2026-04-28 20:43:19 -05:00
Author
Owner

@frenzybiscuit commented on GitHub (Dec 12, 2024):

compute capability 5.2

^ this is doing you no favors.

Does Ollama even support 5.2?

<!-- gh-comment-id:2538524615 --> @frenzybiscuit commented on GitHub (Dec 12, 2024): `compute capability 5.2` ^ this is doing you no favors. Does Ollama even support 5.2?
Author
Owner

@rick-github commented on GitHub (Dec 12, 2024):

ollama 0.5.2-rc3 bumps to a new version of llama.cpp, does that use the M40s?

<!-- gh-comment-id:2539173797 --> @rick-github commented on GitHub (Dec 12, 2024): ollama [0.5.2-rc3](https://github.com/ollama/ollama/releases/tag/v0.5.2-rc3) bumps to a new version of llama.cpp, does that use the M40s?
Author
Owner

@bones0 commented on GitHub (Dec 12, 2024):

No.
ollama version is 0.5.2-rc3-0-g581a4a5-dirty

time=2024-12-12T15:42:55.845Z level=INFO source=routes.go:1247 msg="Listening on 127.0.0.1:11434 (version 0.5.2-rc3-0-g581a4a5-dirty)"
time=2024-12-12T15:42:55.845Z level=INFO source=routes.go:1276 msg="Dynamic LLM libraries" runners="[cuda_v11_avx cuda_v12_avx rocm_avx cpu cpu_avx cpu_avx2]"
time=2024-12-12T15:42:55.845Z level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2024-12-12T15:42:57.326Z level=INFO source=types.go:131 msg="inference compute" id=GPU-c8f87326-45f6-945a-1a1a-63bd9a7fc262 library=cuda variant=v12 compute=8.6 driver=12.2 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" available="23.4 GiB"
time=2024-12-12T15:42:57.326Z level=INFO source=types.go:131 msg="inference compute" id=GPU-603a9272-c602-62ea-4090-51223189bb8f library=cuda variant=v12 compute=7.5 driver=12.2 name="Tesla T4" total="15.6 GiB" available="15.5 GiB"
time=2024-12-12T15:42:57.326Z level=INFO source=types.go:131 msg="inference compute" id=GPU-4fe5252f-aa78-f5f8-958a-5a8ae3ffe9e4 library=cuda variant=v12 compute=7.5 driver=12.2 name="Tesla T4" total="14.6 GiB" available="14.5 GiB"
time=2024-12-12T15:42:57.326Z level=INFO source=types.go:131 msg="inference compute" id=GPU-e6b47121-0d1d-8c63-1ab0-14012d5eb87f library=cuda variant=v12 compute=6.1 driver=12.2 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
time=2024-12-12T15:42:57.326Z level=INFO source=types.go:131 msg="inference compute" id=GPU-079cdcf9-556e-2f0c-6e6d-042eec929d92 library=cuda variant=v11 compute=5.2 driver=12.2 name="Tesla M40 24GB" total="23.9 GiB" available="23.8 GiB"
time=2024-12-12T15:42:57.326Z level=INFO source=types.go:131 msg="inference compute" id=GPU-41ac58ae-d8b7-afdb-c25f-ca6f09b57999 library=cuda variant=v11 compute=5.2 driver=12.2 name="Tesla M40 24GB" total="23.9 GiB" available="23.8 GiB"

...

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: Tesla T4, compute capability 7.5, VMM: yes
  Device 2: Tesla T4, compute capability 7.5, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
time=2024-12-12T15:43:20.589Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=16
time=2024-12-12T15:43:20.590Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:40287"
<!-- gh-comment-id:2539331769 --> @bones0 commented on GitHub (Dec 12, 2024): No. ollama version is 0.5.2-rc3-0-g581a4a5-dirty ``` time=2024-12-12T15:42:55.845Z level=INFO source=routes.go:1247 msg="Listening on 127.0.0.1:11434 (version 0.5.2-rc3-0-g581a4a5-dirty)" time=2024-12-12T15:42:55.845Z level=INFO source=routes.go:1276 msg="Dynamic LLM libraries" runners="[cuda_v11_avx cuda_v12_avx rocm_avx cpu cpu_avx cpu_avx2]" time=2024-12-12T15:42:55.845Z level=INFO source=gpu.go:226 msg="looking for compatible GPUs" time=2024-12-12T15:42:57.326Z level=INFO source=types.go:131 msg="inference compute" id=GPU-c8f87326-45f6-945a-1a1a-63bd9a7fc262 library=cuda variant=v12 compute=8.6 driver=12.2 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" available="23.4 GiB" time=2024-12-12T15:42:57.326Z level=INFO source=types.go:131 msg="inference compute" id=GPU-603a9272-c602-62ea-4090-51223189bb8f library=cuda variant=v12 compute=7.5 driver=12.2 name="Tesla T4" total="15.6 GiB" available="15.5 GiB" time=2024-12-12T15:42:57.326Z level=INFO source=types.go:131 msg="inference compute" id=GPU-4fe5252f-aa78-f5f8-958a-5a8ae3ffe9e4 library=cuda variant=v12 compute=7.5 driver=12.2 name="Tesla T4" total="14.6 GiB" available="14.5 GiB" time=2024-12-12T15:42:57.326Z level=INFO source=types.go:131 msg="inference compute" id=GPU-e6b47121-0d1d-8c63-1ab0-14012d5eb87f library=cuda variant=v12 compute=6.1 driver=12.2 name="Tesla P40" total="23.9 GiB" available="23.7 GiB" time=2024-12-12T15:42:57.326Z level=INFO source=types.go:131 msg="inference compute" id=GPU-079cdcf9-556e-2f0c-6e6d-042eec929d92 library=cuda variant=v11 compute=5.2 driver=12.2 name="Tesla M40 24GB" total="23.9 GiB" available="23.8 GiB" time=2024-12-12T15:42:57.326Z level=INFO source=types.go:131 msg="inference compute" id=GPU-41ac58ae-d8b7-afdb-c25f-ca6f09b57999 library=cuda variant=v11 compute=5.2 driver=12.2 name="Tesla M40 24GB" total="23.9 GiB" available="23.8 GiB" ... ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 4 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: Tesla T4, compute capability 7.5, VMM: yes Device 2: Tesla T4, compute capability 7.5, VMM: yes Device 3: Tesla P40, compute capability 6.1, VMM: yes time=2024-12-12T15:43:20.589Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=16 time=2024-12-12T15:43:20.590Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:40287" ```
Author
Owner

@bones0 commented on GitHub (Dec 12, 2024):

I cloned the newest version off llama.cpp, compiled it according to the inscructions and it is using the M40. Maybe the less you change the better it works.

<!-- gh-comment-id:2539500790 --> @bones0 commented on GitHub (Dec 12, 2024): I cloned the newest version off llama.cpp, compiled it according to the inscructions and it is using the M40. Maybe the less you change the better it works.
Author
Owner

@bones0 commented on GitHub (Dec 12, 2024):

compute capability 5.2

^ this is doing you no favors.

48GB VRAM.

Does Ollama even support 5.2?

According to the documentation, it does, yes.

<!-- gh-comment-id:2539745137 --> @bones0 commented on GitHub (Dec 12, 2024): > `compute capability 5.2` > > ^ this is doing you no favors. 48GB VRAM. > Does Ollama even support 5.2? According to the documentation, it does, yes.
Author
Owner

@rick-github commented on GitHub (Dec 12, 2024):

time=2024-12-12T15:43:20.589Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=16

5.* is not listed in the archs so I suspect the build process is skipping over that. There's a large amount of churn in the build process at the moment so possibly something slipped through the cracks.

<!-- gh-comment-id:2539750927 --> @rick-github commented on GitHub (Dec 12, 2024): ``` time=2024-12-12T15:43:20.589Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=16 ``` 5.* is not listed in the archs so I suspect the build process is skipping over that. There's a large amount of churn in the build process at the moment so possibly something slipped through the cracks.
Author
Owner

@dhiltgen commented on GitHub (Dec 13, 2024):

This is likely a dup of #6930 where our current logic doesn't handle mixed compute capability versions that straddle the compatibility matrix for v12 vs v11 well.

I'll try to get the PR rebased and updated...

<!-- gh-comment-id:2541968266 --> @dhiltgen commented on GitHub (Dec 13, 2024): This is likely a dup of #6930 where our current logic doesn't handle mixed compute capability versions that straddle the compatibility matrix for v12 vs v11 well. I'll try to get the [PR](https://github.com/ollama/ollama/pull/6983) rebased and updated...
Author
Owner

@rick-github commented on GitHub (Dec 13, 2024):

Does that mean setting OLLAMA_LLM_LIBRARY=cuda_v11 would use all devices?

<!-- gh-comment-id:2541973757 --> @rick-github commented on GitHub (Dec 13, 2024): Does that mean setting `OLLAMA_LLM_LIBRARY=cuda_v11` would use all devices?
Author
Owner

@dhiltgen commented on GitHub (Dec 13, 2024):

@rick-github I think the existing discovery logic will partition the GPUs by variant, and we sort by the most VRAM, so the newer GPUs will get precedence.

<!-- gh-comment-id:2542434982 --> @dhiltgen commented on GitHub (Dec 13, 2024): @rick-github I think the existing discovery logic will partition the GPUs by variant, and we sort by the most VRAM, so the newer GPUs will get precedence.
Author
Owner

@bones0 commented on GitHub (Dec 14, 2024):

This is likely a dup of #6930 where our current logic doesn't handle mixed compute capability versions that straddle the compatibility matrix for v12 vs v11 well.

I noticed the v11/v12-thing and tried to compile it with different cuda-versions (via update-alternatives). I also tried to add 5.2 to the v12-configfile. But that was just wild blackbox guessing and did not work.

<!-- gh-comment-id:2543013576 --> @bones0 commented on GitHub (Dec 14, 2024): > This is likely a dup of #6930 where our current logic doesn't handle mixed compute capability versions that straddle the compatibility matrix for v12 vs v11 well. I noticed the v11/v12-thing and tried to compile it with different cuda-versions (via update-alternatives). I also tried to add 5.2 to the v12-configfile. But that was just wild blackbox guessing and did not work.
Author
Owner

@bones0 commented on GitHub (Dec 14, 2024):

@rick-github I think the existing discovery logic will partition the GPUs by variant, and we sort by the most VRAM, so the newer GPUs will get precedence.

That works. It starts with the 3090 and goes down from there. It's definitely important for the calculations.

<!-- gh-comment-id:2543014612 --> @bones0 commented on GitHub (Dec 14, 2024): > @rick-github I think the existing discovery logic will partition the GPUs by variant, and we sort by the most VRAM, so the newer GPUs will get precedence. That works. It starts with the 3090 and goes down from there. It's definitely important for the calculations.
Author
Owner

@bones0 commented on GitHub (Dec 20, 2024):

I manually created sort of a hybrid between the HEAD as per today and https://github.com/ollama/ollama/pull/6983 which shows the following behaviour:

  • Starting the runners via ollama: 4 GPUs, no M40

  • Starting the runners manually (v11 or v12, both the same): --> 6 GPUs, including M40
    ./ollama_llama_server -model /usr/share/ollama/.ollama/models/blobs/sha256-e16120252a9b0e49ed8074d11838d8b0227957a09d749d18425e491243e13822

Some parameter in runner.go (or passed to runner.go) seems to change the behaviour.

We will see if this is still the case when the pull request is merged.

<!-- gh-comment-id:2556626105 --> @bones0 commented on GitHub (Dec 20, 2024): I manually created sort of a hybrid between the HEAD as per today and https://github.com/ollama/ollama/pull/6983 which shows the following behaviour: - Starting the runners via ollama: 4 GPUs, no M40 - Starting the runners manually (v11 or v12, both the same): --> 6 GPUs, including M40 ` ./ollama_llama_server -model /usr/share/ollama/.ollama/models/blobs/sha256-e16120252a9b0e49ed8074d11838d8b0227957a09d749d18425e491243e13822` Some parameter in runner.go (or passed to runner.go) seems to change the behaviour. We will see if this is still the case when the pull request is merged.
Author
Owner

@bones0 commented on GitHub (Dec 20, 2024):

We will see if this is still the case when the pull request is merged.

This works also using the runners from the binary version in /tmp.

<!-- gh-comment-id:2556708767 --> @bones0 commented on GitHub (Dec 20, 2024): > We will see if this is still the case when the pull request is merged. This works also using the runners from the binary version in /tmp.
Author
Owner

@bones0 commented on GitHub (Feb 24, 2025):

I managed to throw together a code which seems to distribute the load to the older cards (M40) only and ommit the newer cards. Based on the commit https://github.com/ollama/ollama/issues/8066 and some own improvisations. This is not exactly what we need, but somebody may have a usecase or an idea what I missed. Currently just available as a diff. gpu.go most likely does not need to be changed:

diff --git a/discover/gpu.go b/discover/gpu.go
index ba906a18..37ebbd8a 100644
--- a/discover/gpu.go
+++ b/discover/gpu.go
@@ -66,8 +66,8 @@ var (
 // With our current CUDA compile flags, older than 5.0 will not work properly
 // (string values used to allow ldflags overrides at build time)
 var (
-       CudaComputeMajorMin = "5"
-       CudaComputeMinorMin = "0"
+       CudaComputeMajorMin = "3"
+       CudaComputeMinorMin = "5"
 )

 var RocmComputeMajorMin = "9"
diff --git a/discover/gpu_test.go b/discover/gpu_test.go
index 0c6ef7ba..7a010b9d 100644
--- a/discover/gpu_test.go
+++ b/discover/gpu_test.go
@@ -2,6 +2,7 @@ package discover

 import (
        "runtime"
+       "sort"
        "testing"

        "github.com/stretchr/testify/assert"
@@ -57,4 +58,58 @@ func TestByLibrary(t *testing.T) {
        }
 }

+func TestByVariant(t *testing.T) {
+       type testCase struct {
+               input  []GpuInfo
+               expect []GpuInfo
+       }
+
+       testCases := map[string]*testCase{
+               "empty":                {input: []GpuInfo{}, expect: []GpuInfo{}},
+               "one item, no variant": {input: []GpuInfo{{Library: "cpu"}}, expect: []GpuInfo{{Library: "cpu"}}},
+               "both v11":             {input: []GpuInfo{{Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v11"}}, expect: []GpuInfo{{Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v11"}}},
+               "v11, v12":             {input: []GpuInfo{{Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v12"}}, expect: []GpuInfo{{Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v12"}}},
+               "v12, v11":             {input: []GpuInfo{{Library: "cuda", Variant: "v12"}, {Library: "cuda", Variant: "v11"}}, expect: []GpuInfo{{Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v12"}}},
+       }
+
+       for k, v := range testCases {
+               t.Run(k, func(t *testing.T) {
+                       resp := append(make([]GpuInfo, 0, len(v.input)), v.input...)
+                       sort.Sort(ByVariant(resp))
+                       if len(resp) != len(v.expect) {
+                               t.Fatalf("expected length %d, got %d => %+v", len(v.expect), len(resp), resp)
+                       }
+                       for i := range resp {
+                               if resp[i].Variant != v.expect[i].Variant || resp[i].Library != v.expect[i].Library {
+                                       t.Fatalf("expected index %d, got %v wanted %+v", i, resp[i], v.expect[i])
+                               }
+                       }
+               })
+       }
+}
+
+func TestBestRunnerName(t *testing.T) {
+       type testCase struct {
+               input  GpuInfoList
+               expect string
+       }
+
+       testCases := map[string]*testCase{
+               "empty":                {input: []GpuInfo{}, expect: ""},
+               "one item, no variant": {input: []GpuInfo{{Library: "cpu"}}, expect: "cpu"},
+               "both v11":             {input: []GpuInfo{{Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v11"}}, expect: "cuda_v11"},
+               "v11, v12":             {input: []GpuInfo{{Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v12"}}, expect: "cuda_v11"},
+               "v12, v11":             {input: []GpuInfo{{Library: "cuda", Variant: "v12"}, {Library: "cuda", Variant: "v11"}}, expect: "cuda_v11"},
+       }
+
+       for k, v := range testCases {
+               t.Run(k, func(t *testing.T) {
+                       resp := v.input.BestRunnerName()
+                       if resp != v.expect {
+                               t.Fatalf("got %v wanted %+v", resp, v.expect)
+                       }
+               })
+       }
+}
+
 // TODO - add some logic to figure out card type through other means and actually verify we got back what we expected
diff --git a/discover/types.go b/discover/types.go
index c5212d94..6f8b294d 100644
--- a/discover/types.go
+++ b/discover/types.go
@@ -4,7 +4,11 @@ import (
        "fmt"
        "log/slog"

+       "sort"
+       "strings"
+
        "github.com/ollama/ollama/format"
+//     "github.com/ollama/ollama/runner"
 )

 type memInfo struct {
@@ -99,25 +103,27 @@ type UnsupportedGPUInfo struct {
        Reason string `json:"reason"`
 }

-// Split up the set of gpu info's by Library and variant
+// Split up the set of gpu info's by Library
+// This assumes the oldest version is compatible with the newest card, which may
+// not be the case if the user has a very new and very old GPU
 func (l GpuInfoList) ByLibrary() []GpuInfoList {
        resp := []GpuInfoList{}
        libs := []string{}
        for _, info := range l {
                found := false
-               requested := info.Library
-               if info.Variant != "" {
-                       requested += "_" + info.Variant
-               }
+//             requested := info.Library
+//             if info.Variant != "" {
+//                     requested += "_" + info.Variant
+//             }
                for i, lib := range libs {
-                       if lib == requested {
+                       if lib == info.Library {
                                resp[i] = append(resp[i], info)
                                found = true
                                break
                        }
                }
                if !found {
-                       libs = append(libs, requested)
+                       libs = append(libs, info.Library)
                        resp = append(resp, []GpuInfo{info})
                }
        }
@@ -147,6 +153,13 @@ func (a ByFreeMemory) Len() int           { return len(a) }
 func (a ByFreeMemory) Swap(i, j int)      { a[i], a[j] = a[j], a[i] }
 func (a ByFreeMemory) Less(i, j int) bool { return a[i].FreeMemory < a[j].FreeMemory }

+// Sort by Variant
+type ByVariant []GpuInfo
+
+func (a ByVariant) Len() int           { return len(a) }
+func (a ByVariant) Swap(i, j int)      { a[i], a[j] = a[j], a[i] }
+func (a ByVariant) Less(i, j int) bool { return strings.Compare(a[i].Variant, a[j].Variant) < 0 } // TODO do better than alpha sort
+
 type SystemInfo struct {
        System          CPUInfo              `json:"system"`
        GPUs            []GpuInfo            `json:"gpus"`
@@ -181,3 +194,20 @@ func (l GpuInfoList) FlashAttentionSupported() bool {
        }
        return true
 }
+
+
+func (l GpuInfoList) BestRunnerName() string {
+       if len(l) == 0 {
+               return ""
+       }
+       // Sort by variant, which will yield the oldest variant first
+       sgl := append(make(GpuInfoList, 0, len(l)), l...)
+       sort.Sort(ByVariant(sgl))
+       info := sgl[0]
+
+       requested := info.Library
+       if info.Variant != "" {
+               requested += "_" + info.Variant
+       }
+       return requested
+}
diff --git a/llm/server.go b/llm/server.go
index fd027a53..43364e96 100644
--- a/llm/server.go
+++ b/llm/server.go
@@ -236,7 +236,7 @@ func NewLlamaServer(gpus discover.GpuInfoList, model string, f *ggml.GGML, adapt
                }
        }

-       lib := gpus[0].RunnerName()
+       lib := gpus.BestRunnerName()
        requested := envconfig.LLMLibrary()
        if libs[requested] != "" {
                slog.Info("using requested gpu library", "requested", requested)
<!-- gh-comment-id:2677931338 --> @bones0 commented on GitHub (Feb 24, 2025): I managed to throw together a code which seems to distribute the load to the older cards (M40) only and ommit the newer cards. Based on the commit https://github.com/ollama/ollama/issues/8066 and some own improvisations. This is not exactly what we need, but somebody may have a usecase or an idea what I missed. Currently just available as a diff. gpu.go most likely does not need to be changed: ``` diff --git a/discover/gpu.go b/discover/gpu.go index ba906a18..37ebbd8a 100644 --- a/discover/gpu.go +++ b/discover/gpu.go @@ -66,8 +66,8 @@ var ( // With our current CUDA compile flags, older than 5.0 will not work properly // (string values used to allow ldflags overrides at build time) var ( - CudaComputeMajorMin = "5" - CudaComputeMinorMin = "0" + CudaComputeMajorMin = "3" + CudaComputeMinorMin = "5" ) var RocmComputeMajorMin = "9" diff --git a/discover/gpu_test.go b/discover/gpu_test.go index 0c6ef7ba..7a010b9d 100644 --- a/discover/gpu_test.go +++ b/discover/gpu_test.go @@ -2,6 +2,7 @@ package discover import ( "runtime" + "sort" "testing" "github.com/stretchr/testify/assert" @@ -57,4 +58,58 @@ func TestByLibrary(t *testing.T) { } } +func TestByVariant(t *testing.T) { + type testCase struct { + input []GpuInfo + expect []GpuInfo + } + + testCases := map[string]*testCase{ + "empty": {input: []GpuInfo{}, expect: []GpuInfo{}}, + "one item, no variant": {input: []GpuInfo{{Library: "cpu"}}, expect: []GpuInfo{{Library: "cpu"}}}, + "both v11": {input: []GpuInfo{{Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v11"}}, expect: []GpuInfo{{Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v11"}}}, + "v11, v12": {input: []GpuInfo{{Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v12"}}, expect: []GpuInfo{{Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v12"}}}, + "v12, v11": {input: []GpuInfo{{Library: "cuda", Variant: "v12"}, {Library: "cuda", Variant: "v11"}}, expect: []GpuInfo{{Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v12"}}}, + } + + for k, v := range testCases { + t.Run(k, func(t *testing.T) { + resp := append(make([]GpuInfo, 0, len(v.input)), v.input...) + sort.Sort(ByVariant(resp)) + if len(resp) != len(v.expect) { + t.Fatalf("expected length %d, got %d => %+v", len(v.expect), len(resp), resp) + } + for i := range resp { + if resp[i].Variant != v.expect[i].Variant || resp[i].Library != v.expect[i].Library { + t.Fatalf("expected index %d, got %v wanted %+v", i, resp[i], v.expect[i]) + } + } + }) + } +} + +func TestBestRunnerName(t *testing.T) { + type testCase struct { + input GpuInfoList + expect string + } + + testCases := map[string]*testCase{ + "empty": {input: []GpuInfo{}, expect: ""}, + "one item, no variant": {input: []GpuInfo{{Library: "cpu"}}, expect: "cpu"}, + "both v11": {input: []GpuInfo{{Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v11"}}, expect: "cuda_v11"}, + "v11, v12": {input: []GpuInfo{{Library: "cuda", Variant: "v11"}, {Library: "cuda", Variant: "v12"}}, expect: "cuda_v11"}, + "v12, v11": {input: []GpuInfo{{Library: "cuda", Variant: "v12"}, {Library: "cuda", Variant: "v11"}}, expect: "cuda_v11"}, + } + + for k, v := range testCases { + t.Run(k, func(t *testing.T) { + resp := v.input.BestRunnerName() + if resp != v.expect { + t.Fatalf("got %v wanted %+v", resp, v.expect) + } + }) + } +} + // TODO - add some logic to figure out card type through other means and actually verify we got back what we expected diff --git a/discover/types.go b/discover/types.go index c5212d94..6f8b294d 100644 --- a/discover/types.go +++ b/discover/types.go @@ -4,7 +4,11 @@ import ( "fmt" "log/slog" + "sort" + "strings" + "github.com/ollama/ollama/format" +// "github.com/ollama/ollama/runner" ) type memInfo struct { @@ -99,25 +103,27 @@ type UnsupportedGPUInfo struct { Reason string `json:"reason"` } -// Split up the set of gpu info's by Library and variant +// Split up the set of gpu info's by Library +// This assumes the oldest version is compatible with the newest card, which may +// not be the case if the user has a very new and very old GPU func (l GpuInfoList) ByLibrary() []GpuInfoList { resp := []GpuInfoList{} libs := []string{} for _, info := range l { found := false - requested := info.Library - if info.Variant != "" { - requested += "_" + info.Variant - } +// requested := info.Library +// if info.Variant != "" { +// requested += "_" + info.Variant +// } for i, lib := range libs { - if lib == requested { + if lib == info.Library { resp[i] = append(resp[i], info) found = true break } } if !found { - libs = append(libs, requested) + libs = append(libs, info.Library) resp = append(resp, []GpuInfo{info}) } } @@ -147,6 +153,13 @@ func (a ByFreeMemory) Len() int { return len(a) } func (a ByFreeMemory) Swap(i, j int) { a[i], a[j] = a[j], a[i] } func (a ByFreeMemory) Less(i, j int) bool { return a[i].FreeMemory < a[j].FreeMemory } +// Sort by Variant +type ByVariant []GpuInfo + +func (a ByVariant) Len() int { return len(a) } +func (a ByVariant) Swap(i, j int) { a[i], a[j] = a[j], a[i] } +func (a ByVariant) Less(i, j int) bool { return strings.Compare(a[i].Variant, a[j].Variant) < 0 } // TODO do better than alpha sort + type SystemInfo struct { System CPUInfo `json:"system"` GPUs []GpuInfo `json:"gpus"` @@ -181,3 +194,20 @@ func (l GpuInfoList) FlashAttentionSupported() bool { } return true } + + +func (l GpuInfoList) BestRunnerName() string { + if len(l) == 0 { + return "" + } + // Sort by variant, which will yield the oldest variant first + sgl := append(make(GpuInfoList, 0, len(l)), l...) + sort.Sort(ByVariant(sgl)) + info := sgl[0] + + requested := info.Library + if info.Variant != "" { + requested += "_" + info.Variant + } + return requested +} diff --git a/llm/server.go b/llm/server.go index fd027a53..43364e96 100644 --- a/llm/server.go +++ b/llm/server.go @@ -236,7 +236,7 @@ func NewLlamaServer(gpus discover.GpuInfoList, model string, f *ggml.GGML, adapt } } - lib := gpus[0].RunnerName() + lib := gpus.BestRunnerName() requested := envconfig.LLMLibrary() if libs[requested] != "" { slog.Info("using requested gpu library", "requested", requested) ```
Author
Owner

@prusnak commented on GitHub (Feb 25, 2025):

Fixed with https://github.com/ollama/ollama/pull/8567

<!-- gh-comment-id:2682919092 --> @prusnak commented on GitHub (Feb 25, 2025): Fixed with https://github.com/ollama/ollama/pull/8567
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51668