[GH-ISSUE #5809] error loading model: unable to allocate backend buffer when model size is > VRAM with multiple GPUs #65657

Closed
opened 2026-05-03 22:06:22 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @rafal11ck on GitHub (Jul 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5809

What is the issue?

ollama run llava:34b write me a poem

Error: llama runner process has terminated: signal: aborted (core dumped) error loading model: unable to allocate backend buffer
llama_load_model_from_file: exception loading model

Hardware

System has 2 discrete GPUs:

  • AMD RX 7600 XT (16 GB)
  • nvidia 1050 TI (4 GB)

RAM: 48 GB
CPU: AMD 7600X

struggle

I tried to manipulate CUDA_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES envvars. Setting either to -1 makes ollama run with GPU that's left.

Logs:
both.txt
amd_only.txt
nvidia_only.txt

part of both.txt

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 15863.15 MiB on device 0: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: exception loading model
terminate called after throwing an instance of 'std::runtime_error'
  what():  unable to allocate backend buffer
time=2024-07-20T10:45:15.144+02:00 level=ERROR source=sched.go:480 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error loading model: unable to allocate backend buffer"

This happens on every model that I tried and those size > 16GB when both are available.

OS

Linux

GPU

Nvidia, AMD

CPU

AMD

Ollama version

ollama version is 0.2.1

Originally created by @rafal11ck on GitHub (Jul 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5809 ### What is the issue? `ollama run llava:34b write me a poem` ``` Error: llama runner process has terminated: signal: aborted (core dumped) error loading model: unable to allocate backend buffer llama_load_model_from_file: exception loading model ``` # Hardware System has 2 discrete GPUs: - AMD RX 7600 XT (16 GB) - nvidia 1050 TI (4 GB) RAM: 48 GB CPU: AMD 7600X # struggle I tried to manipulate `CUDA_VISIBLE_DEVICES` and `HIP_VISIBLE_DEVICES` envvars. Setting either to `-1` makes ollama run with GPU that's left. Logs: [both.txt](https://github.com/user-attachments/files/16319474/both.txt) [amd_only.txt](https://github.com/user-attachments/files/16319475/amd_only.txt) [nvidia_only.txt](https://github.com/user-attachments/files/16319476/nvidia_only.txt) part of both.txt ``` ggml_backend_cuda_buffer_type_alloc_buffer: allocating 15863.15 MiB on device 0: cudaMalloc failed: out of memory llama_model_load: error loading model: unable to allocate backend buffer llama_load_model_from_file: exception loading model terminate called after throwing an instance of 'std::runtime_error' what(): unable to allocate backend buffer time=2024-07-20T10:45:15.144+02:00 level=ERROR source=sched.go:480 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error loading model: unable to allocate backend buffer" ``` This happens on every model that I tried and those size > 16GB when both are available. ### OS Linux ### GPU Nvidia, AMD ### CPU AMD ### Ollama version ollama version is 0.2.1
GiteaMirror added the memorybug labels 2026-05-03 22:06:22 -05:00
Author
Owner

@rafal11ck commented on GitHub (Jul 20, 2024):

Updated to 0.2.7 same result.

<!-- gh-comment-id:2241066509 --> @rafal11ck commented on GitHub (Jul 20, 2024): Updated to 0.2.7 same result.
Author
Owner

@julioarruda commented on GitHub (Jul 25, 2024):

same here

<!-- gh-comment-id:2251287086 --> @julioarruda commented on GitHub (Jul 25, 2024): same here
Author
Owner

@rafal11ck commented on GitHub (Feb 4, 2025):

Closing as stale, but seems to be kinda resolved

ollama version is 0.5.7

  • it now uses one GPU AMD RX 7600 XT one.
  • if it doesn't fit it offloads to CPU.
<!-- gh-comment-id:2633318397 --> @rafal11ck commented on GitHub (Feb 4, 2025): Closing as stale, but seems to be _kinda_ resolved ollama version is 0.5.7 - it now uses one GPU AMD RX 7600 XT one. - if it doesn't fit it offloads to CPU.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65657