[GH-ISSUE #7429] cuda device ordering inconsistent between runtime and management library #66781

Closed
opened 2026-05-04 08:10:34 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @Nepherpitou on GitHub (Oct 30, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7429

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

My GPU setup is:

  1. RTX 3090 - first PCIE 5.0 x16, but secondary GPU
  2. RTX 4090 - second PCIE 4.0 x4, but primary GPU

So, I have a weird bug with memory estimations. There are two calls for device memory usage info:

  1. C.cudart_bootstrap(*cHandles.cudart, C.int(i), &memInfo) here, i is device index (and gpuInfo.index). In my case it's 0 for 4090 on this step.
  2. C.nvml_get_free(*cHandles.nvml, C.int(gpuInfo.index), &memInfo.free, &memInfo.total, &memInfo.used) here we get memory info for gpuInfo.index, but nvml device order is different and 4090 is 1!

As a result I have estimated memory usage of 2Gb for RTX 3090 while nvidia-smi reported only 300mb, and 300mb usage for RTX 4090, while nvidia-smi reported 2Gb. This results in wrong layer split prediction.

While its fine for me since it not works well with flash attention, but still an issue.

Ollama logs

time=2024-10-30T23:40:22.092+03:00 level=DEBUG source=gpu.go:562 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll]
CUDA driver version: 12.7
time=2024-10-30T23:40:22.139+03:00 level=DEBUG source=gpu.go:129 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll
[GPU-4e64b2bc-98b0-d948-a660-7668c70aba4f] CUDA totalMem 24563 mb
[GPU-4e64b2bc-98b0-d948-a660-7668c70aba4f] CUDA freeMem 22994 mb
[GPU-4e64b2bc-98b0-d948-a660-7668c70aba4f] Compute Capability 8.9
time=2024-10-30T23:40:22.271+03:00 level=INFO source=gpu.go:326 msg="detected OS VRAM overhead" id=GPU-4e64b2bc-98b0-d948-a660-7668c70aba4f library=cuda compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" overhead="1.3 GiB"
[GPU-c06ff468-596d-5c2b-52ed-c764302de199] CUDA totalMem 24575 mb
[GPU-c06ff468-596d-5c2b-52ed-c764302de199] CUDA freeMem 23306 mb
[GPU-c06ff468-596d-5c2b-52ed-c764302de199] Compute Capability 8.6
time=2024-10-30T23:40:22.563+03:00 level=DEBUG source=amd_windows.go:35 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found."
releasing cuda driver library
releasing nvml library
time=2024-10-30T23:40:22.564+03:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-4e64b2bc-98b0-d948-a660-7668c70aba4f library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
time=2024-10-30T23:40:22.565+03:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-c06ff468-596d-5c2b-52ed-c764302de199 library=cuda variant=v12 compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB"
time=2024-10-30T23:40:35.137+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="127.2 GiB" before.free="95.7 GiB" before.free_swap="119.8 GiB" now.total="127.2 GiB" now.free="95.7 GiB" now.free_swap="119.6 GiB"
time=2024-10-30T23:40:35.152+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4e64b2bc-98b0-d948-a660-7668c70aba4f name="NVIDIA GeForce RTX 4090" overhead="1.3 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="286.3 MiB"
time=2024-10-30T23:40:35.167+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-c06ff468-596d-5c2b-52ed-c764302de199 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="21.9 GiB" now.used="2.1 GiB"
releasing nvml library
time=2024-10-30T23:40:35.187+03:00 level=DEBUG source=sched.go:225 msg="loading first model" model=I:\localai\models\ollama\blobs\sha256-9167b346a6e1f45064e0500cf8539572e5889ba631eecb40a3cab48338b6d7df
time=2024-10-30T23:40:35.187+03:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-10-30T23:40:35.188+03:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[21.9 GiB]"
time=2024-10-30T23:40:35.188+03:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 21.9 GiB]"
time=2024-10-30T23:40:35.189+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="127.2 GiB" before.free="95.7 GiB" before.free_swap="119.6 GiB" now.total="127.2 GiB" now.free="95.7 GiB" now.free_swap="119.6 GiB"
time=2024-10-30T23:40:35.213+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4e64b2bc-98b0-d948-a660-7668c70aba4f name="NVIDIA GeForce RTX 4090" overhead="1.3 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="286.3 MiB"
time=2024-10-30T23:40:35.229+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-c06ff468-596d-5c2b-52ed-c764302de199 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="24.0 GiB" before.free="21.9 GiB" now.total="24.0 GiB" now.free="21.9 GiB" now.used="2.1 GiB"
releasing nvml library
time=2024-10-30T23:40:35.229+03:00 level=INFO source=llama-server.go:72 msg="system memory" total="127.2 GiB" free="95.7 GiB" free_swap="119.6 GiB"
time=2024-10-30T23:40:35.229+03:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 21.9 GiB]"
time=2024-10-30T23:40:35.230+03:00 level=INFO source=memory.go:346 msg="offload to cuda" layers.requested=999 layers.model=81 layers.offload=55 layers.split=28,27 memory.available="[22.5 GiB 21.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="58.6 GiB" memory.required.partial="43.4 GiB" memory.required.kv="10.0 GiB" memory.required.allocations="[22.0 GiB 21.4 GiB]" memory.weights.total="45.4 GiB" memory.weights.repeating="44.5 GiB" memory.weights.nonrepeating="974.6 MiB" memory.graph.full="5.1 GiB" memory.graph.partial="5.1 GiB"

nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 566.03                 Driver Version: 566.03         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  |   00000000:01:00.0 Off |                  N/A |
|  0%   37C    P8             15W /  370W |      37MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090      WDDM  |   00000000:16:00.0  On |                  Off |
| 30%   34C    P0             61W /  450W |    1517MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.4.0-rc5

Originally created by @Nepherpitou on GitHub (Oct 30, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7429 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? ### My GPU setup is: 1. RTX 3090 - first PCIE 5.0 x16, but secondary GPU 2. RTX 4090 - second PCIE 4.0 x4, but primary GPU So, I have a weird bug with memory estimations. There are two calls for device memory usage info: 1. `C.cudart_bootstrap(*cHandles.cudart, C.int(i), &memInfo)` here, `i` is device index (and `gpuInfo.index`). In my case it's `0` for 4090 on this step. 2. `C.nvml_get_free(*cHandles.nvml, C.int(gpuInfo.index), &memInfo.free, &memInfo.total, &memInfo.used)` here we get memory info for `gpuInfo.index`, but nvml device order is different and 4090 is `1`! As a result I have estimated memory usage of 2Gb for RTX 3090 while nvidia-smi reported only 300mb, and 300mb usage for RTX 4090, while nvidia-smi reported 2Gb. This results in wrong layer split prediction. While its fine for me since it not works well with flash attention, but still an issue. ### Ollama logs ``` time=2024-10-30T23:40:22.092+03:00 level=DEBUG source=gpu.go:562 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll] CUDA driver version: 12.7 time=2024-10-30T23:40:22.139+03:00 level=DEBUG source=gpu.go:129 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll [GPU-4e64b2bc-98b0-d948-a660-7668c70aba4f] CUDA totalMem 24563 mb [GPU-4e64b2bc-98b0-d948-a660-7668c70aba4f] CUDA freeMem 22994 mb [GPU-4e64b2bc-98b0-d948-a660-7668c70aba4f] Compute Capability 8.9 time=2024-10-30T23:40:22.271+03:00 level=INFO source=gpu.go:326 msg="detected OS VRAM overhead" id=GPU-4e64b2bc-98b0-d948-a660-7668c70aba4f library=cuda compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" overhead="1.3 GiB" [GPU-c06ff468-596d-5c2b-52ed-c764302de199] CUDA totalMem 24575 mb [GPU-c06ff468-596d-5c2b-52ed-c764302de199] CUDA freeMem 23306 mb [GPU-c06ff468-596d-5c2b-52ed-c764302de199] Compute Capability 8.6 time=2024-10-30T23:40:22.563+03:00 level=DEBUG source=amd_windows.go:35 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found." releasing cuda driver library releasing nvml library time=2024-10-30T23:40:22.564+03:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-4e64b2bc-98b0-d948-a660-7668c70aba4f library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB" time=2024-10-30T23:40:22.565+03:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-c06ff468-596d-5c2b-52ed-c764302de199 library=cuda variant=v12 compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB" time=2024-10-30T23:40:35.137+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="127.2 GiB" before.free="95.7 GiB" before.free_swap="119.8 GiB" now.total="127.2 GiB" now.free="95.7 GiB" now.free_swap="119.6 GiB" time=2024-10-30T23:40:35.152+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4e64b2bc-98b0-d948-a660-7668c70aba4f name="NVIDIA GeForce RTX 4090" overhead="1.3 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="286.3 MiB" time=2024-10-30T23:40:35.167+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-c06ff468-596d-5c2b-52ed-c764302de199 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="21.9 GiB" now.used="2.1 GiB" releasing nvml library time=2024-10-30T23:40:35.187+03:00 level=DEBUG source=sched.go:225 msg="loading first model" model=I:\localai\models\ollama\blobs\sha256-9167b346a6e1f45064e0500cf8539572e5889ba631eecb40a3cab48338b6d7df time=2024-10-30T23:40:35.187+03:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]" time=2024-10-30T23:40:35.188+03:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[21.9 GiB]" time=2024-10-30T23:40:35.188+03:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 21.9 GiB]" time=2024-10-30T23:40:35.189+03:00 level=DEBUG source=gpu.go:396 msg="updating system memory data" before.total="127.2 GiB" before.free="95.7 GiB" before.free_swap="119.6 GiB" now.total="127.2 GiB" now.free="95.7 GiB" now.free_swap="119.6 GiB" time=2024-10-30T23:40:35.213+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-4e64b2bc-98b0-d948-a660-7668c70aba4f name="NVIDIA GeForce RTX 4090" overhead="1.3 GiB" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="286.3 MiB" time=2024-10-30T23:40:35.229+03:00 level=DEBUG source=gpu.go:444 msg="updating cuda memory data" gpu=GPU-c06ff468-596d-5c2b-52ed-c764302de199 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="24.0 GiB" before.free="21.9 GiB" now.total="24.0 GiB" now.free="21.9 GiB" now.used="2.1 GiB" releasing nvml library time=2024-10-30T23:40:35.229+03:00 level=INFO source=llama-server.go:72 msg="system memory" total="127.2 GiB" free="95.7 GiB" free_swap="119.6 GiB" time=2024-10-30T23:40:35.229+03:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=2 available="[22.5 GiB 21.9 GiB]" time=2024-10-30T23:40:35.230+03:00 level=INFO source=memory.go:346 msg="offload to cuda" layers.requested=999 layers.model=81 layers.offload=55 layers.split=28,27 memory.available="[22.5 GiB 21.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="58.6 GiB" memory.required.partial="43.4 GiB" memory.required.kv="10.0 GiB" memory.required.allocations="[22.0 GiB 21.4 GiB]" memory.weights.total="45.4 GiB" memory.weights.repeating="44.5 GiB" memory.weights.nonrepeating="974.6 MiB" memory.graph.full="5.1 GiB" memory.graph.partial="5.1 GiB" ``` ### nvidia-smi output ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 566.03 Driver Version: 566.03 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 WDDM | 00000000:01:00.0 Off | N/A | | 0% 37C P8 15W / 370W | 37MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 WDDM | 00000000:16:00.0 On | Off | | 30% 34C P0 61W / 450W | 1517MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.4.0-rc5
GiteaMirror added the bugnvidiawindows labels 2026-05-04 08:10:39 -05:00
Author
Owner

@dhiltgen commented on GitHub (Nov 1, 2024):

It's unfortunate two different libraries from nvidia order devices differently. That wasn't expected.

I'll have to add some additional logic to lookup the device uuid and correlate via that instead of index.

<!-- gh-comment-id:2452119483 --> @dhiltgen commented on GitHub (Nov 1, 2024): It's unfortunate two different libraries from nvidia order devices differently. That wasn't expected. I'll have to add some additional logic to lookup the device uuid and correlate via that instead of index.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66781