[GH-ISSUE #6080] Incorrect free VRAM reporting when two CUDA cards with different VRAM capacities are installed, preventing Ollama from using GPU inference #3800

Closed
opened 2026-04-12 14:38:03 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @XJTU-WXY on GitHub (Jul 30, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6080

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Dear ollama developer:
First of all, thank you very much for developing and maintaining ollama. Open source leads the world to a brighter future!

I use the gemma2:27b model, my problem is:

  • When my device only has a Tesla P40 (with 24G VRAM) installed, ollama can automatically use GPU inference and run very well.
  • When I also install a Quadro K620 (with 2G VRAM) for display output, ollama cannot use P40 and is forced to use CPU inference.

The nvidia-smi output is:
image
The server.log is
server-2.log

I set the evironment variable "CUDA_VISIBLE_DEVICES" to the uuid of my P40, and the log time=2024-07-31T04:44:20.388+08:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-085790c7-bee0-4de1-db17-6685d68470ca library=cuda compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB" seems to tell that ollama has found the P40 card. However, the next log time=2024-07-31T04:44:37.793+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=47 layers.offload=0 layers.split="" memory.available="[1.6 GiB]" memory.required.full="15.3 GiB" memory.required.partial="0 B" memory.required.kv="736.0 MiB" memory.required.allocations="[0 B]" memory.weights.total="14.4 GiB" memory.weights.repeating="13.5 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="509.0 MiB" memory.graph.partial="1.4 GiB" seems to tell that available VRAM was just 1.6GB which was exactly the free VRAM of my K620, and then ollama used cpu_avx2 to run the inference.

I guess here Ollama ignored the CUDA_VISIBLE_DEVICES environment variable I set, and detected the free VRAM of K620 instead of P40, and automatically used CPU reasoning after finding that gemma2:27b could not be run with only 1.5GB of VRAM.
I don't know Golang, so I wonder if this is a bug in ollama's judgment of free VRAM. I hope you can check this problem, thanks!

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.3.0

Originally created by @XJTU-WXY on GitHub (Jul 30, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6080 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Dear ollama developer: First of all, thank you very much for developing and maintaining ollama. Open source leads the world to a brighter future! I use the _gemma2:27b_ model, my problem is: - When my device only has a Tesla P40 (with 24G VRAM) installed, ollama can automatically use GPU inference and run very well. - When I also install a Quadro K620 (with 2G VRAM) for display output, ollama cannot use P40 and is forced to use CPU inference. **The nvidia-smi output is:** ![image](https://github.com/user-attachments/assets/5e3c2b91-7c40-44cc-b0ee-bb614b666032) **The server.log is** [server-2.log](https://github.com/user-attachments/files/16433537/server-2.log) I set the evironment variable _"CUDA_VISIBLE_DEVICES"_ to the uuid of my P40, and the log `time=2024-07-31T04:44:20.388+08:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-085790c7-bee0-4de1-db17-6685d68470ca library=cuda compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"` seems to tell that ollama **has found the P40 card.** However, the next log `time=2024-07-31T04:44:37.793+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=47 layers.offload=0 layers.split="" memory.available="[1.6 GiB]" memory.required.full="15.3 GiB" memory.required.partial="0 B" memory.required.kv="736.0 MiB" memory.required.allocations="[0 B]" memory.weights.total="14.4 GiB" memory.weights.repeating="13.5 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="509.0 MiB" memory.graph.partial="1.4 GiB"` seems to tell that **available VRAM was just 1.6GB** which was exactly **the free VRAM of my K620**, and then ollama used cpu_avx2 to run the inference. I guess here Ollama ignored the CUDA_VISIBLE_DEVICES environment variable I set, and detected the free VRAM of K620 instead of P40, and automatically used CPU reasoning after finding that gemma2:27b could not be run with only 1.5GB of VRAM. I don't know Golang, so I wonder if this is a bug in ollama's judgment of free VRAM. I hope you can check this problem, thanks! ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.0
GiteaMirror added the nvidiabugneeds more infowindows labels 2026-04-12 14:38:03 -05:00
Author
Owner

@dhiltgen commented on GitHub (Jul 30, 2024):

This sounds like it's a variation of #5239

Could you try running the server with OLLAMA_DEBUG=1 set? The CUDA_VISIBLE_DEVICES should have caused it to ignore the smaller GPU as a workaround, but clearly there's another bug in there somewhere.

<!-- gh-comment-id:2259306267 --> @dhiltgen commented on GitHub (Jul 30, 2024): This sounds like it's a variation of #5239 Could you try running the server with OLLAMA_DEBUG=1 set? The CUDA_VISIBLE_DEVICES should have caused it to ignore the smaller GPU as a workaround, but clearly there's another bug in there somewhere.
Author
Owner

@XJTU-WXY commented on GitHub (Jul 31, 2024):

Thanks for your reply!
I've set OLLAMA_DEBUG=1 and here is the server log:
server.log

I noticed this log time=2024-07-31T12:02:33.626+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-085790c7-bee0-4de1-db17-6685d68470ca name="Tesla P40" overhead="0 B" before.total="23.9 GiB" before.free="23.7 GiB" now.total="2.0 GiB" now.free="1.4 GiB" now.used="617.1 MiB"

First, Ollama detected the P40 correctly, but in this "updating cuda memory data" step, it re-detected the the VRAM of P40 as 2GB (actually the VRAM of K620). I think this may be the problem.

<!-- gh-comment-id:2259609958 --> @XJTU-WXY commented on GitHub (Jul 31, 2024): Thanks for your reply! I've set OLLAMA_DEBUG=1 and here is the server log: [server.log](https://github.com/user-attachments/files/16436096/server.log) I noticed this log `time=2024-07-31T12:02:33.626+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-085790c7-bee0-4de1-db17-6685d68470ca name="Tesla P40" overhead="0 B" before.total="23.9 GiB" before.free="23.7 GiB" now.total="2.0 GiB" now.free="1.4 GiB" now.used="617.1 MiB"` First, Ollama detected the P40 correctly, but in this "updating cuda memory data" step, it re-detected the the VRAM of P40 as 2GB (actually the VRAM of K620). I think this may be the problem.
Author
Owner

@ahjavid commented on GitHub (Aug 1, 2024):

Hi @XJTU-WXY, I think I might have found a possible solution to your issue. Have you tried relocating your K620 to a different PCIe slot and moving the P40 to the top slot (primary slot) on your motherboard? This might allow Ollama to correctly detect the free VRAM of the P40 and use it for GPU inference. Let me know if this works for you!

<!-- gh-comment-id:2264035062 --> @ahjavid commented on GitHub (Aug 1, 2024): Hi @XJTU-WXY, I think I might have found a possible solution to your issue. Have you tried relocating your K620 to a different PCIe slot and moving the P40 to the top slot (primary slot) on your motherboard? This might allow Ollama to correctly detect the free VRAM of the P40 and use it for GPU inference. Let me know if this works for you!
Author
Owner

@XJTU-WXY commented on GitHub (Aug 2, 2024):

Hello! During this time, I did some tests and tried to read the relevant code of ollama, and then I found some problems.

First of all, I have found a way to make ollama correctly detect the VRAM of CUDA cards on my device: set the numbers of both cards in the CUDA_VISIBLE_DEVICES environment variable and reverse their order (CUDA_VISIBLE_DEVICES=1,0)

The reason for doing this is:
When I set CUDA_VISIBLE_DEVICES to only the P40 card, the VRAM detection will be incorrect as mentioned before. So I tried to leave it blank (Equivalent to CUDA_VISIBLE_DEVICES=0,1). I guessed this could allow Ollama to detect both cards and no longer believe the VRAM is insufficient to run the inference, but it still didn't work. The log was like this: server.log

I noticed these logs:
time=2024-08-02T08:33:01.218+08:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-085790c7-bee0-4de1-db17-6685d68470ca library=cuda compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"
time=2024-08-02T08:33:01.218+08:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-481cc3a7-596a-18f2-bdfe-3e2eab3612ba library=cuda compute=5.0 driver=12.4 name="Quadro K620" total="2.0 GiB" available="1.6 GiB"
They were printed by
ce1fb4447e/gpu/types.go (L102-L115)
Then I noticed these logs:
time=2024-08-02T08:33:30.963+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-085790c7-bee0-4de1-db17-6685d68470ca name="Tesla P40" overhead="0 B" before.total="23.9 GiB" before.free="23.7 GiB" now.total="2.0 GiB" now.free="1.1 GiB" now.used="881.9 MiB"
time=2024-08-02T08:33:30.978+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-481cc3a7-596a-18f2-bdfe-3e2eab3612ba name="Quadro K620" overhead="22.3 GiB" before.total="2.0 GiB" before.free="1.6 GiB" now.total="24.0 GiB" now.free="1.6 GiB" now.used="110.4 MiB"
They were printed by
ce1fb4447e/gpu/gpu.go (L380-L423)
As we can see, the order of CUDA cards in the "inference compute" is reversed compared to the actual (According to the output of nvidia-smi, the K620 should be first and the P40 should be second), but the "updating cuda memory data" step detected VRAM in the actual order, which caused the P40 and K620 to swap their VRAM capacities, and the part of the K620's VRAM that exceeded 2GB was determined to be OS VRAM overhead (I don't understand what this means, maybe shared VRAM?), so now the VRAM capacities of both the P40 and K620 are determined to be 2GB. Ollama believed that the VRAM was not enough to run model inference, so it used CPU inference.
When I reversed the order of numbers of the two cards in CUDA_VISIBLE_DEVICES, ollama could finally use GPU inference. The log was like this:
server-1.log. The order of CUDA cards in "inference compute" is correct, and therefore the VRAM detection is correct.

I know nothing about Golang and CUDA libraries, so I have no idea why these issues occurred.

@ahjavid Thanks for your suggestion, but the layout of the PCIE slot in my motherboard does not allow me to change the position of the two cards, I can't verify whether the method you mentioned can solve the problem. Sorry.

<!-- gh-comment-id:2264380144 --> @XJTU-WXY commented on GitHub (Aug 2, 2024): Hello! During this time, I did some tests and tried to read the relevant code of ollama, and then I found some problems. First of all, I have found a way to make ollama correctly detect the VRAM of CUDA cards on my device: set the numbers of both cards in the _CUDA_VISIBLE_DEVICES_ environment variable and reverse their order **(CUDA_VISIBLE_DEVICES=1,0)** The reason for doing this is: When I set _CUDA_VISIBLE_DEVICES_ to only the P40 card, the VRAM detection will be incorrect as mentioned before. So I tried to leave it blank (Equivalent to **CUDA_VISIBLE_DEVICES=0,1**). I guessed this could allow Ollama to detect both cards and no longer believe the VRAM is insufficient to run the inference, but it still didn't work. The log was like this: [server.log](https://github.com/user-attachments/files/16463743/server.log) I noticed these logs: `time=2024-08-02T08:33:01.218+08:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-085790c7-bee0-4de1-db17-6685d68470ca library=cuda compute=6.1 driver=12.4 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"` `time=2024-08-02T08:33:01.218+08:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-481cc3a7-596a-18f2-bdfe-3e2eab3612ba library=cuda compute=5.0 driver=12.4 name="Quadro K620" total="2.0 GiB" available="1.6 GiB"` They were printed by https://github.com/ollama/ollama/blob/ce1fb4447efc9958dcf279f7eb2ae6941bec1220/gpu/types.go#L102-L115 Then I noticed these logs: `time=2024-08-02T08:33:30.963+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-085790c7-bee0-4de1-db17-6685d68470ca name="Tesla P40" overhead="0 B" before.total="23.9 GiB" before.free="23.7 GiB" now.total="2.0 GiB" now.free="1.1 GiB" now.used="881.9 MiB"` `time=2024-08-02T08:33:30.978+08:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-481cc3a7-596a-18f2-bdfe-3e2eab3612ba name="Quadro K620" overhead="22.3 GiB" before.total="2.0 GiB" before.free="1.6 GiB" now.total="24.0 GiB" now.free="1.6 GiB" now.used="110.4 MiB"` They were printed by https://github.com/ollama/ollama/blob/ce1fb4447efc9958dcf279f7eb2ae6941bec1220/gpu/gpu.go#L380-L423 As we can see, the order of CUDA cards in the _"inference compute"_ is reversed compared to the actual (According to the output of nvidia-smi, the K620 should be first and the P40 should be second), but the "updating cuda memory data" step detected VRAM in the actual order, which caused the P40 and K620 to swap their VRAM capacities, and the part of the K620's VRAM that exceeded 2GB was determined to be _OS VRAM overhead_ (I don't understand what this means, maybe shared VRAM?), so now the VRAM capacities of both the P40 and K620 are determined to be 2GB. Ollama believed that the VRAM was not enough to run model inference, so it used CPU inference. When I reversed the order of numbers of the two cards in CUDA_VISIBLE_DEVICES, ollama could finally use GPU inference. The log was like this: [server-1.log](https://github.com/user-attachments/files/16463877/server-1.log). The order of CUDA cards in _"inference compute"_ is correct, and therefore the VRAM detection is correct. I know nothing about Golang and CUDA libraries, so I have no idea why these issues occurred. @ahjavid Thanks for your suggestion, but the layout of the PCIE slot in my motherboard does not allow me to change the position of the two cards, I can't verify whether the method you mentioned can solve the problem. Sorry.
Author
Owner

@dhiltgen commented on GitHub (Oct 22, 2024):

Is this still seen on the latest releases? If so, could you share an updated server log with OLLAMA_DEBUG=1 set?

<!-- gh-comment-id:2430512127 --> @dhiltgen commented on GitHub (Oct 22, 2024): Is this still seen on the latest releases? If so, could you share an updated server log with OLLAMA_DEBUG=1 set?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3800