[GH-ISSUE #8011] Underflow error when using GPU memory overhead #51639

Closed
opened 2026-04-28 20:40:53 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @ProjectMoon on GitHub (Dec 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/8011

What is the issue?

GPUs:

  • AMD RX 6800 XT (16 GB VRAM)
  • NVidia GTX 970 (4 GB VRAM)

I have discovered a very odd and very dangerous problem in ollama. I am running OpenWebUI on a machine that has a ROCm device (main GPU; 16 GB VRAM) and a CUDA device (ancient old NVidia GPU). The NVidia GPU has 4 GB of VRAM and I use it as a secondary GPU for small models like embedding, etc.

I'm using the CUDA variant of the OpenWebUI docker image, which it allows to run vector search re-ranking models on CUDA. This loads the reranking model (BGE reranker in my case) onto the Nvidia GPU.

All of this is fine and dandy. But the problem comes when ollama tries to run the actual main LLM I'm using (Qwen2.5 14b q5_K_M in this case).

For some reason, it seems to completely skip choosing the ROCm GPU as the GPU to load the model on, and tries to load it on the CUDA device, which promptly fails. It doesn't even fall back to CPU. This persists through restarts of ollama. No matter what, it will not consider the AMD GPU or CPU for loading any LLM, and essentially renders OpenWebUI non-functional.

I've narrowed it down to OpenWebUI loading the reranker model on the CUDA device. If OpenWebUI is shut down, everything starts working fine again in ollama. Everything also works fine in OpenWebUI until it loads the reranker model.

  • But presumably this isn't specifically because of OpenWebUI. I imagine it would happen with anything external to ollama that takes over the CUDA device.

I can provide debug logs if necessary.

OS

Linux, Docker

GPU

Nvidia, AMD

CPU

AMD

Ollama version

0.5.1,0.4.7

Originally created by @ProjectMoon on GitHub (Dec 9, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/8011 ### What is the issue? GPUs: - AMD RX 6800 XT (16 GB VRAM) - NVidia GTX 970 (4 GB VRAM) I have discovered a very odd and very dangerous problem in ollama. I am running OpenWebUI on a machine that has a ROCm device (main GPU; 16 GB VRAM) and a CUDA device (ancient old NVidia GPU). The NVidia GPU has 4 GB of VRAM and I use it as a secondary GPU for small models like embedding, etc. I'm using the CUDA variant of the OpenWebUI docker image, which it allows to run vector search re-ranking models on CUDA. This loads the reranking model (BGE reranker in my case) onto the Nvidia GPU. All of this is fine and dandy. But the problem comes when ollama tries to run the actual main LLM I'm using (Qwen2.5 14b q5_K_M in this case). For some reason, it seems to completely skip choosing the ROCm GPU as the GPU to load the model on, and tries to load it on the CUDA device, which promptly fails. It doesn't even fall back to CPU. This persists through restarts of ollama. No matter what, it will not consider the AMD GPU or CPU for loading any LLM, and essentially renders OpenWebUI non-functional. I've narrowed it down to OpenWebUI loading the reranker model on the CUDA device. If OpenWebUI is shut down, everything starts working fine again in ollama. Everything also works fine in OpenWebUI until it loads the reranker model. - But presumably this isn't specifically _because_ of OpenWebUI. I imagine it would happen with anything external to ollama that takes over the CUDA device. I can provide debug logs if necessary. ### OS Linux, Docker ### GPU Nvidia, AMD ### CPU AMD ### Ollama version 0.5.1,0.4.7
GiteaMirror added the bug label 2026-04-28 20:40:53 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 9, 2024):

Logs are necessary.

<!-- gh-comment-id:2527602040 --> @rick-github commented on GitHub (Dec 9, 2024): Logs are necessary.
Author
Owner

@ProjectMoon commented on GitHub (Dec 9, 2024):

ollama-debug-log.txt

Logs are attached. At first you can see it running the BGE embedding model on the NVidia GPU, and Qwen2.5 (Acree Supernova, to be more specific) on the AMD GPU. But then it starts trying to load Qwen2.5 on the CUDA device repeatedly.

<!-- gh-comment-id:2527624450 --> @ProjectMoon commented on GitHub (Dec 9, 2024): [ollama-debug-log.txt](https://github.com/user-attachments/files/18060536/ollama-debug-log.txt) Logs are attached. At first you can see it running the BGE embedding model on the NVidia GPU, and Qwen2.5 (Acree Supernova, to be more specific) on the AMD GPU. But then it starts trying to load Qwen2.5 on the CUDA device repeatedly.
Author
Owner

@ProjectMoon commented on GitHub (Dec 9, 2024):

Few more details on this: rocm-smi reports that VRAM is emptied after the model stops, so it's not like the main GPU is full. At least not on the hardware itself. Dunno what ollama thinks.

Edit: although from the logs, it seems like ollama knows there's plenty of space available on the AMD card.

<!-- gh-comment-id:2528019466 --> @ProjectMoon commented on GitHub (Dec 9, 2024): Few more details on this: `rocm-smi` reports that VRAM is emptied after the model stops, so it's not like the main GPU is full. At least not on the hardware itself. Dunno what ollama thinks. Edit: although from the logs, it seems like ollama knows there's plenty of space available on the AMD card.
Author
Owner

@rick-github commented on GitHub (Dec 9, 2024):

It's a bug in the memory estimation logic. You have OLLAMA_GPU_OVERHEAD=1G and at the point that ollama is trying to find space to fit the model, the cuda device is down to 33M:

time=2024-12-09T11:46:49.712+01:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-64fa45ff-fe00-d712-1796-ed74da57bfa7 library=cuda total="3.9 GiB" available="33.6 MiB"

The code checks to see how much VRAM is available with if (gpus[i].FreeMemory - overhead) < stuff... and because FreeMemory is 33M, there's an underflow and wraparound and ollama thinks it has 16.384 exabytes to play with.

<!-- gh-comment-id:2528497855 --> @rick-github commented on GitHub (Dec 9, 2024): It's a bug in the memory estimation logic. You have `OLLAMA_GPU_OVERHEAD=1G` and at the point that ollama is trying to find space to fit the model, the cuda device is down to 33M: ``` time=2024-12-09T11:46:49.712+01:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-64fa45ff-fe00-d712-1796-ed74da57bfa7 library=cuda total="3.9 GiB" available="33.6 MiB" ``` The code checks to see how much VRAM is available with [`if (gpus[i].FreeMemory - overhead) < stuff...`](https://github.com/ollama/ollama/blob/da09488fbfc437c55a94bc5374b0850d935ea09f/llm/memory.go#L185) and because `FreeMemory` is 33M, there's an underflow and wraparound and ollama thinks it has 16.384 exabytes to play with.
Author
Owner

@ProjectMoon commented on GitHub (Dec 10, 2024):

Well, 16k exabytes sounds like it would definitely be enough space to load a language model. 🤔

I updated the title of the bug for easier searching. Thanks for figuring it out! For now I've disabled the memory overhead and things work much better (at least until I run into context errors, which is why I had the overhead set in the first place).

<!-- gh-comment-id:2530930186 --> @ProjectMoon commented on GitHub (Dec 10, 2024): Well, 16k exabytes sounds like it would definitely be enough space to load a language model. :thinking: I updated the title of the bug for easier searching. Thanks for figuring it out! For now I've disabled the memory overhead and things work much better (at least until I run into context errors, which is why I had the overhead set in the first place).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51639