[GH-ISSUE #6456] Ollama not using 20GB of VRAM from Tesla P40 card #66098

Closed
opened 2026-05-03 23:57:58 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @Happydragun4now on GitHub (Aug 22, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6456

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Not sure if this is a bug, damaged hardware, or a driver issue but I thought I would report it just in case.
Ollama sees 23.7GB available on each card when it detects them, but then only 3.7 when it's trying to allocate memory. From the server logs:

time=2024-08-21T17:49:38.582-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-92d1b3ad-0ab8-2ece-050e-b4f5252f8098 library=cuda compute=6.1 driver=12.6 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"

time=2024-08-21T17:49:38.582-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-7a8dc17e-85e1-5bc8-e230-119d6be5252c library=cuda compute=6.1 driver=12.6 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"

layers.requested=-1 layers.model=81 layers.offload=48 layers.split=3,45 memory.available="[3.7 GiB 23.7 GiB]"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.3.6

Originally created by @Happydragun4now on GitHub (Aug 22, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6456 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Not sure if this is a bug, damaged hardware, or a driver issue but I thought I would report it just in case. Ollama sees 23.7GB available on each card when it detects them, but then only 3.7 when it's trying to allocate memory. From the server logs: time=2024-08-21T17:49:38.582-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-92d1b3ad-0ab8-2ece-050e-b4f5252f8098 library=cuda compute=6.1 driver=12.6 name="Tesla P40" total="23.9 GiB" available="23.7 GiB" time=2024-08-21T17:49:38.582-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-7a8dc17e-85e1-5bc8-e230-119d6be5252c library=cuda compute=6.1 driver=12.6 name="Tesla P40" total="23.9 GiB" available="23.7 GiB" layers.requested=-1 layers.model=81 layers.offload=48 layers.split=3,45 memory.available="[3.7 GiB 23.7 GiB]" ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.6
GiteaMirror added the bugneeds more info labels 2026-05-03 23:57:58 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 22, 2024):

What's the output of nvidia-smi? If you can include a complete log, it may include details that give a better understanding of what's going on.

<!-- gh-comment-id:2303861312 --> @rick-github commented on GitHub (Aug 22, 2024): What's the output of `nvidia-smi`? If you can include a complete log, it may include details that give a better understanding of what's going on.
Author
Owner

@Happydragun4now commented on GitHub (Aug 22, 2024):

I'm on windows BTW.
I ruled out a hardware issue because when I change the order of the devices in CUDA_VISIBLE_DEVICES it changes which card loads 20GB
I am still not sure if it's a driver issue but I have tried CUDA 11.7, 12.4, 12.6, as well as a few different server drivers for the P40's, and I have tried Ollama 0.3.6 and 0.3.7

Sorry I can't get a screenshot right now but SMI shows the same as this:
llm_load_tensors: CUDA0 buffer size = 920.12 MiB
llm_load_tensors: CUDA1 buffer size = 21536.62 MiB
One card will load 20GB and the other will load around 1GB.

Here's the full logs
server1.log

<!-- gh-comment-id:2304729480 --> @Happydragun4now commented on GitHub (Aug 22, 2024): I'm on windows BTW. I ruled out a hardware issue because when I change the order of the devices in CUDA_VISIBLE_DEVICES it changes which card loads 20GB I am still not sure if it's a driver issue but I have tried CUDA 11.7, 12.4, 12.6, as well as a few different server drivers for the P40's, and I have tried Ollama 0.3.6 and 0.3.7 Sorry I can't get a screenshot right now but SMI shows the same as this: llm_load_tensors: CUDA0 buffer size = 920.12 MiB llm_load_tensors: CUDA1 buffer size = 21536.62 MiB One card will load 20GB and the other will load around 1GB. Here's the full logs [server1.log](https://github.com/user-attachments/files/16711549/server1.log)
Author
Owner

@rick-github commented on GitHub (Aug 22, 2024):

The reason I asked for the output of nvidia-smi is because it shows what processes are using the GPU. The log shows that one of the GPUs has only 3.6GiB free:

time=2024-08-21T16:26:53.736-07:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 
layers.model=81 layers.offload=46 layers.split=2,44 memory.available="[3.6 GiB 23.7 GiB]" 
memory.required.full="44.7  GiB" memory.required.partial="26.4 GiB" memory.required.kv="640.0 MiB"
memory.required.allocations="[3.1 GiB 23.3 GiB]" memory.weights.total="38.9 GiB" memory.weights.repeating="38.1 GiB"
memory.weights.nonrepeating="822.0 MiB"  memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"

If you can identify what is using 20G of one of your cards, you might be able to free up some VRAM for model loading.

<!-- gh-comment-id:2304751422 --> @rick-github commented on GitHub (Aug 22, 2024): The reason I asked for the output of `nvidia-smi` is because it shows what processes are using the GPU. The log shows that one of the GPUs has only 3.6GiB free: ``` time=2024-08-21T16:26:53.736-07:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=46 layers.split=2,44 memory.available="[3.6 GiB 23.7 GiB]" memory.required.full="44.7 GiB" memory.required.partial="26.4 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[3.1 GiB 23.3 GiB]" memory.weights.total="38.9 GiB" memory.weights.repeating="38.1 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" ``` If you can identify what is using 20G of one of your cards, you might be able to free up some VRAM for model loading.
Author
Owner

@Happydragun4now commented on GitHub (Aug 22, 2024):

as far as I could tell nothing was using it, it showed something like 1600/24000M. maybe it's getting reserved by something or Ollama/CUDA isn't reading it properly?

image
I found this image from when I was attempting to find a fix last night, sorry it doesn't have the processes

<!-- gh-comment-id:2304788909 --> @Happydragun4now commented on GitHub (Aug 22, 2024): as far as I could tell nothing was using it, it showed something like 1600/24000M. maybe it's getting reserved by something or Ollama/CUDA isn't reading it properly? ![image](https://github.com/user-attachments/assets/a98854b7-34e5-43ae-b619-579eadc4f5de) I found this image from when I was attempting to find a fix last night, sorry it doesn't have the processes
Author
Owner

@Happydragun4now commented on GitHub (Aug 24, 2024):

This seemed to be due to the Quadro K2200, disabling it in windows made the model load properly across the 2 P40's.

I have CUDA_VISIBLE_DEVICES set to the UUIDs of the P40's, so the Quadro shouldn't be detected but maybe CUDA was confusing their available VRAM between the 2 cards?

Not sure if you want the ticket for investigation or would like to close it, but thanks for taking the time to look at this

<!-- gh-comment-id:2307939412 --> @Happydragun4now commented on GitHub (Aug 24, 2024): This seemed to be due to the Quadro K2200, disabling it in windows made the model load properly across the 2 P40's. I have CUDA_VISIBLE_DEVICES set to the UUIDs of the P40's, so the Quadro shouldn't be detected but maybe CUDA was confusing their available VRAM between the 2 cards? Not sure if you want the ticket for investigation or would like to close it, but thanks for taking the time to look at this
Author
Owner

@dhiltgen commented on GitHub (Oct 17, 2024):

Please give the latest release a try (0.3.13), as we've fixed a number of bugs in GPU discovery that may resolve this. We should be sorting by VRAM now (largest available first) and specifying the order based on UUIDs to avoid potential misidentification of filtered out GPUs.

If you still see us incorrectly landing on the K2200 and crashing, please set OLLAMA_DEBUG=1 to increase log verbosity and share an updated full server log showing the startup, and crash.

<!-- gh-comment-id:2420239043 --> @dhiltgen commented on GitHub (Oct 17, 2024): Please give the latest release a try (0.3.13), as we've fixed a number of bugs in GPU discovery that may resolve this. We should be sorting by VRAM now (largest available first) and specifying the order based on UUIDs to avoid potential misidentification of filtered out GPUs. If you still see us incorrectly landing on the K2200 and crashing, please set OLLAMA_DEBUG=1 to increase log verbosity and share an updated full server log showing the startup, and crash.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66098