[GH-ISSUE #8206] MultiGPU ROCm #5239

Open
opened 2026-04-12 16:22:51 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @Schwenn2002 on GitHub (Dec 21, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/8206

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

System:
CPU AMD Ryzen 9950X
RAM 128 GB DDR5
GPU0 AMD Radeon PRO W7900
GPU1 AMD Radeon RX7900XTX
ROCM: 6.3.1
Ubuntu 24.04 LTS (currently patched)

ERROR:
I start a large LLM (e.g. Llama-3.3-70B-Instruct-Q4_K_L) with open webui and a context window of 32678 and get the following error in ollama:
Dec 22 03:52:04 ollama ollama[6345]: time=2024-12-22T03:52:04.990Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
Dec 22 03:52:41 ollama ollama[6345]: ROCm error: out of memory
Dec 22 03:52:41 ollama ollama[6345]: llama/ggml-cuda/ggml-cuda.cu:96: ROCm error

=========================================== ROCm System Management Interface ==========================================
==================================================== Concise Info ====================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Avg) (Mem, Compute, ID)
========================================================== ==========================================================
0 1 0x7448, 54057 59.0°C 56.0W N/A, N/A, 0 651Mhz 96Mhz 20.0% auto 241.0W 0% 82%
1 2 0x744c, 53541 40.0°C 75.0W N/A, N/A, 0 1301Mhz 456Mhz 0% auto 327.0W 0% 39%
========================================================== ==========================================================
================================================ End of ROCm SMI Log ================================================

The VRAM on both cards is never fully utilized and the normal RAM is almost completely free. SWAP is not used.

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.5.4

Originally created by @Schwenn2002 on GitHub (Dec 21, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/8206 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? System: CPU AMD Ryzen 9950X RAM 128 GB DDR5 GPU0 AMD Radeon PRO W7900 GPU1 AMD Radeon RX7900XTX ROCM: 6.3.1 Ubuntu 24.04 LTS (currently patched) ERROR: I start a large LLM (e.g. Llama-3.3-70B-Instruct-Q4_K_L) with open webui and a context window of 32678 and get the following error in ollama: Dec 22 03:52:04 ollama ollama[6345]: time=2024-12-22T03:52:04.990Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" Dec 22 03:52:41 ollama ollama[6345]: ROCm error: out of memory Dec 22 03:52:41 ollama ollama[6345]: llama/ggml-cuda/ggml-cuda.cu:96: ROCm error =========================================== ROCm System Management Interface ========================================== ==================================================== Concise Info ==================================================== Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% (DID, GUID) (Edge) (Avg) (Mem, Compute, ID) ========================================================== ========================================================== 0 1 0x7448, 54057 59.0°C 56.0W N/A, N/A, 0 651Mhz 96Mhz 20.0% auto 241.0W 0% 82% 1 2 0x744c, 53541 40.0°C 75.0W N/A, N/A, 0 1301Mhz 456Mhz 0% auto 327.0W 0% 39% ========================================================== ========================================================== ================================================ End of ROCm SMI Log ================================================ The VRAM on both cards is never fully utilized and the normal RAM is almost completely free. SWAP is not used. ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.5.4
GiteaMirror added the gpuamdbug labels 2026-04-12 16:22:51 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 21, 2024):

Server logs may aid in debugging.

<!-- gh-comment-id:2558234331 --> @rick-github commented on GitHub (Dec 21, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@Schwenn2002 commented on GitHub (Dec 21, 2024):

in the LOG:
Dez 21 21:32:22 ollama ollama[29734]: llama_model_load: vocab only - skipping tensors
Dez 21 21:32:22 ollama ollama[29734]: ggml_cuda_compute_forward: ROPE failed
Dez 21 21:32:22 ollama ollama[29734]: ROCm error: no kernel image is available for execution on the device
Dez 21 21:32:22 ollama ollama[29734]: current device: 1, in function ggml_cuda_compute_forward at llama/ggml-cuda/ggml-cuda.cu:2218
Dez 21 21:32:22 ollama ollama[29734]: err
Dez 21 21:32:22 ollama ollama[29734]: llama/ggml-cuda/ggml-cuda.cu:96: ROCm error

<!-- gh-comment-id:2558246543 --> @Schwenn2002 commented on GitHub (Dec 21, 2024): in the LOG: Dez 21 21:32:22 ollama ollama[29734]: llama_model_load: vocab only - skipping tensors Dez 21 21:32:22 ollama ollama[29734]: ggml_cuda_compute_forward: ROPE failed Dez 21 21:32:22 ollama ollama[29734]: ROCm error: no kernel image is available for execution on the device Dez 21 21:32:22 ollama ollama[29734]: current device: 1, in function ggml_cuda_compute_forward at llama/ggml-cuda/ggml-cuda.cu:2218 Dez 21 21:32:22 ollama ollama[29734]: err Dez 21 21:32:22 ollama ollama[29734]: llama/ggml-cuda/ggml-cuda.cu:96: ROCm error
Author
Owner

@Schwenn2002 commented on GitHub (Dec 21, 2024):

an another log:
Dez 21 22:17:51 ollama ollama[50591]: ROCm error: out of memory
Dez 21 22:17:51 ollama ollama[50591]: current device: 1, in function alloc at llama/ggml-cuda/ggml-cuda.cu:301
Dez 21 22:17:51 ollama ollama[50591]: ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
Dez 21 22:17:51 ollama ollama[50591]: llama/ggml-cuda/ggml-cuda.cu:96: ROCm error

<!-- gh-comment-id:2558255171 --> @Schwenn2002 commented on GitHub (Dec 21, 2024): an another log: Dez 21 22:17:51 ollama ollama[50591]: ROCm error: out of memory Dez 21 22:17:51 ollama ollama[50591]: current device: 1, in function alloc at llama/ggml-cuda/ggml-cuda.cu:301 Dez 21 22:17:51 ollama ollama[50591]: ggml_cuda_device_malloc(&ptr, look_ahead_size, device) Dez 21 22:17:51 ollama ollama[50591]: llama/ggml-cuda/ggml-cuda.cu:96: ROCm error
Author
Owner

@rick-github commented on GitHub (Dec 21, 2024):

Full logs.

<!-- gh-comment-id:2558255405 --> @rick-github commented on GitHub (Dec 21, 2024): Full logs.
Author
Owner

@Schwenn2002 commented on GitHub (Dec 21, 2024):

The model running Llama-3.3-70B-Instruct-Q4_K_L runs over open webui (docker) and uses ollama on the same host. In the chat I only get the answer "Oops! No text generated from Ollama, Please try again.". The same chat runs with llama 3.1 8b fp16. The RAM is largely unused, over 100GB still free.

here are the log...

ollama.log

<!-- gh-comment-id:2558257526 --> @Schwenn2002 commented on GitHub (Dec 21, 2024): The model running Llama-3.3-70B-Instruct-Q4_K_L runs over open webui (docker) and uses ollama on the same host. In the chat I only get the answer "Oops! No text generated from Ollama, Please try again.". The same chat runs with llama 3.1 8b fp16. The RAM is largely unused, over 100GB still free. here are the log... [ollama.log](https://github.com/user-attachments/files/18220123/ollama.log)
Author
Owner

@rick-github commented on GitHub (Dec 22, 2024):

Does it improve it you restrict the number of parallel completions to one by setting OLLAMA_NUM_PARALLEL=1 in the server environment?

<!-- gh-comment-id:2558283122 --> @rick-github commented on GitHub (Dec 22, 2024): Does it improve it you restrict the number of parallel completions to one by setting `OLLAMA_NUM_PARALLEL=1` in the server environment?
Author
Owner

@Schwenn2002 commented on GitHub (Dec 22, 2024):

unfortunately the same result :-(

ollama_2.log

<!-- gh-comment-id:2558416210 --> @Schwenn2002 commented on GitHub (Dec 22, 2024): unfortunately the same result :-( [ollama_2.log](https://github.com/user-attachments/files/18221593/ollama_2.log)
Author
Owner

@Schwenn2002 commented on GitHub (Jan 2, 2025):

I have now set the following parameters in the file /etc/systemd/system/ollama.service.d/override.conf:

[Service]
Environment="HSA_OVERRIDE_GFX_VERSION=11.0.0"
Environment="OLLAMA_MAX_QUEUE=256"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_FLASH_ATTENTION=1"
#Environment="OLLAMA_LLM_LIBRARY=[rocm_v6 cpu_avx2]"
Environment="OLLAMA_DEBUG=0"

Since then, multi-GPU usage has worked. I understood that OLLAMA_FLASH_ATTENTION=1 is valid for NVIDIA, but I have the impression that the RAM allocation also influences ROCm.

<!-- gh-comment-id:2567475700 --> @Schwenn2002 commented on GitHub (Jan 2, 2025): I have now set the following parameters in the file /etc/systemd/system/ollama.service.d/override.conf: [Service] Environment="HSA_OVERRIDE_GFX_VERSION=11.0.0" Environment="OLLAMA_MAX_QUEUE=256" Environment="OLLAMA_NUM_PARALLEL=2" Environment="OLLAMA_MAX_LOADED_MODELS=2" Environment="OLLAMA_FLASH_ATTENTION=1" #_`Environment="OLLAMA_LLM_LIBRARY=[rocm_v6 cpu_avx2]"`_ Environment="OLLAMA_DEBUG=0" Since then, multi-GPU usage has worked. I understood that OLLAMA_FLASH_ATTENTION=1 is valid for NVIDIA, but I have the impression that the RAM allocation also influences ROCm.
Author
Owner

@rick-github commented on GitHub (Jan 2, 2025):

OLLAMA_LLM_LIBRARY="[rocm_v6 cpu_avx2]" is invalid, the variable needs to be a string prefix. It would be interesting to see the logs of successful completions.

<!-- gh-comment-id:2567822851 --> @rick-github commented on GitHub (Jan 2, 2025): `OLLAMA_LLM_LIBRARY="[rocm_v6 cpu_avx2]"` is invalid, the variable needs to be a string prefix. It would be interesting to see the logs of successful completions.
Author
Owner

@Schwenn2002 commented on GitHub (Jan 7, 2025):

I simply deleted the parameter OLLAMA_LLM_LIBRARY and use the default value.

<!-- gh-comment-id:2576261411 --> @Schwenn2002 commented on GitHub (Jan 7, 2025): I simply deleted the parameter OLLAMA_LLM_LIBRARY and use the default value.
Author
Owner

@realGWM commented on GitHub (Mar 25, 2026):

Hi!

I've bought a second GPU just for running ollama, and it seems that I've encountered the same bug :(

System specs:

EndeavourOS Linux x86_64
AMD Ryzen 7 7800X3D
Radeon RX 6800 XT
Radeon RX 9060 XT
ollama (installed from package ollama-rocm) version: 0.18.2

rocminfo: rocminfo.txt

I can run ollama on any single GPU and it works:
ollama-serve-rx9060xt.txt
ollama-serve-rx6800xt.txt

but whenever I try to run it on both, it fails with no kernel image is available for execution on the device ROCM error :(
ollama-serve-both.txt

I don't have ~/.ollama/logs/server.log file, probably because I'm running ollama serve manually?

Please let me know if I can provide any other logs / system info to help debug this! I'm also willing to test custom commits/builds or do anything else that might help in debugging this!

OLLAMA_NUM_PARALLEL=1 does not solve the issue unfortunately

<!-- gh-comment-id:4125486405 --> @realGWM commented on GitHub (Mar 25, 2026): Hi! I've bought a second GPU just for running ollama, and it seems that I've encountered the same bug :( System specs: ``` EndeavourOS Linux x86_64 AMD Ryzen 7 7800X3D Radeon RX 6800 XT Radeon RX 9060 XT ollama (installed from package ollama-rocm) version: 0.18.2 ``` rocminfo: [rocminfo.txt](https://github.com/user-attachments/files/26240609/rocminfo.txt) I can run ollama on any single GPU and it works: [ollama-serve-rx9060xt.txt](https://github.com/user-attachments/files/26240624/ollama-serve-rx9060xt.txt) [ollama-serve-rx6800xt.txt](https://github.com/user-attachments/files/26240625/ollama-serve-rx6800xt.txt) but whenever I try to run it on both, it fails with `no kernel image is available for execution on the device` ROCM error :( [ollama-serve-both.txt](https://github.com/user-attachments/files/26240654/ollama-serve-both.txt) I don't have `~/.ollama/logs/server.log` file, probably because I'm running `ollama serve` manually? Please let me know if I can provide any other logs / system info to help debug this! I'm also willing to test custom commits/builds or do anything else that might help in debugging this! `OLLAMA_NUM_PARALLEL=1` does not solve the issue unfortunately
Author
Owner

@rick-github commented on GitHub (Mar 25, 2026):

I don't know if this will work, but try using the Vulkan driver. Install ollama-vulkan and set OLLAMA_VULKAN=1 in the server environment.

<!-- gh-comment-id:4125568136 --> @rick-github commented on GitHub (Mar 25, 2026): I don't know if this will work, but try using the Vulkan driver. Install [ollama-vulkan](https://archlinux.org/packages/extra/x86_64/ollama-vulkan/) and set `OLLAMA_VULKAN=1` in the server environment.
Author
Owner

@realGWM commented on GitHub (Mar 25, 2026):

Thank you!

Vulkan version seems to be working and seems to properly use both GPUs!

<!-- gh-comment-id:4125674046 --> @realGWM commented on GitHub (Mar 25, 2026): Thank you! Vulkan version seems to be working and seems to properly use both GPUs!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5239