Multi-GPU (AMD) Performance Regression in Ollama with ROCm 6.3.1 #5646

Open
opened 2025-11-12 13:05:29 -06:00 by GiteaMirror · 10 comments
Owner

Originally created by @konian71 on GitHub (Jan 30, 2025).

What is the issue?

Description

For the past two weeks, my second and third GPUs have been freezing during inference when running Ollama. After the crash, the VRAM remains full, but the GPUs stop processing. However, the system itself remains stable, and a soft reboot is no longer possible—only a hard reboot restores functionality.

Additionally, GPU performance has drastically decreased. Previously, Qwen2.5-Coder 32B ran at ~17 tokens/sec, but now it barely reaches 4 tokens/sec. Even models that fully fit into VRAM are underperforming.

System Information

CPU: AMD Ryzen Threadripper 3960X
RAM: 256 GB DDR4-3200
GPUs: 3 x AMD Radeon RX 7900 XTX
OS: Ubuntu 24.04 LTS Server
Ollama Version: (latest, post-update)
ROCm Version: 6.3.1 (with ROCm-SMI 5.7.0)

Key Findings & Symptoms

GPU Workload Distribution is Broken
GPUs are recognized but remain idle during inference.
Power consumption stays under 100W per GPU, which is far too low.
GPU clock speeds (sclk) remain at 0 MHz, meaning no actual computation occurs.
CPU utilization is very high (~80-95%), even when models should be running entirely in VRAM.

Specific Models Trigger Freezes
Mistral-Large frequently crashes the GPUs, requiring a hard reboot.
Llama3.3-70B is extremely slow but at least remains stable.
DeepSeek R1 32B only uses ~50% of VRAM, yet the GPUs remain idle.

Multi-GPU Scaling is Failing
Performance does not improve with multiple GPUs.
Even when a model fully fits into VRAM, Ollama does not utilize GPU compute units.
Disabling multiple GPUs (CUDA_VISIBLE_DEVICES=0) sometimes improves stability.

Possible ROCm Regression
Previously (ROCm 5.7.1 or earlier), everything worked fine.
After Ollama updated, ROCm was also upgraded to 6.3.1 automatically.
It is unclear whether this issue is caused by Ollama’s inference engine or a ROCm 6.3.1 regression.
Downgrading ROCm is not trivial, as Ollama depends on its installed version.

Troubleshooting Attempts
Setting performance mode to compute (rocm-smi --setperflevel compute) → No effect
Manually setting GPU clocks (rocm-smi --setclk OD) → No effect
Checking GPU activity with rocm-smi -a → Compute Units remain inactive
Running Ollama with one GPU only (CUDA_VISIBLE_DEVICES=0) → Minor improvement, but still slow
Testing alternative models (DeepSeek R1 32B, Llama 3.3-70B, Qwen2.5-Coder 32B) → All underperform significantly
Checking alternative inference engines (vLLM, llama.cpp) → Pending tests

Next Steps & Questions
Is this a known issue with ROCm 6.3.1?
Has Ollama introduced a regression in its ROCm backend?
Would a downgrade to ROCm 5.7.1 restore previous performance?
Would switching to vLLM resolve the Multi-GPU scaling issues?
Are there workarounds to force GPU utilization properly under Ollama?
Any insights or suggestions would be greatly appreciated!

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.5.5 up to latest

Originally created by @konian71 on GitHub (Jan 30, 2025). ### What is the issue? *Description* For the past two weeks, my second and third GPUs have been freezing during inference when running Ollama. After the crash, the VRAM remains full, but the GPUs stop processing. However, the system itself remains stable, and a soft reboot is no longer possible—only a hard reboot restores functionality. Additionally, GPU performance has drastically decreased. Previously, Qwen2.5-Coder 32B ran at ~17 tokens/sec, but now it barely reaches 4 tokens/sec. Even models that fully fit into VRAM are underperforming. System Information CPU: AMD Ryzen Threadripper 3960X RAM: 256 GB DDR4-3200 GPUs: 3 x AMD Radeon RX 7900 XTX OS: Ubuntu 24.04 LTS Server Ollama Version: (latest, post-update) ROCm Version: 6.3.1 (with ROCm-SMI 5.7.0) *Key Findings & Symptoms* **GPU Workload Distribution is Broken** GPUs are recognized but remain idle during inference. Power consumption stays under 100W per GPU, which is far too low. GPU clock speeds (sclk) remain at 0 MHz, meaning no actual computation occurs. CPU utilization is very high (~80-95%), even when models should be running entirely in VRAM. **Specific Models Trigger Freezes** Mistral-Large frequently crashes the GPUs, requiring a hard reboot. Llama3.3-70B is extremely slow but at least remains stable. DeepSeek R1 32B only uses ~50% of VRAM, yet the GPUs remain idle. **Multi-GPU Scaling is Failing** Performance does not improve with multiple GPUs. Even when a model fully fits into VRAM, Ollama does not utilize GPU compute units. Disabling multiple GPUs (CUDA_VISIBLE_DEVICES=0) sometimes improves stability. **Possible ROCm Regressio**n Previously (ROCm 5.7.1 or earlier), everything worked fine. After Ollama updated, ROCm was also upgraded to 6.3.1 automatically. It is unclear whether this issue is caused by Ollama’s inference engine or a ROCm 6.3.1 regression. Downgrading ROCm is not trivial, as Ollama depends on its installed version. **Troubleshooting Attempts** Setting performance mode to compute (rocm-smi --setperflevel compute) → No effect Manually setting GPU clocks (rocm-smi --setclk OD) → No effect Checking GPU activity with rocm-smi -a → Compute Units remain inactive Running Ollama with one GPU only (CUDA_VISIBLE_DEVICES=0) → Minor improvement, but still slow Testing alternative models (DeepSeek R1 32B, Llama 3.3-70B, Qwen2.5-Coder 32B) → All underperform significantly Checking alternative inference engines (vLLM, llama.cpp) → Pending tests **Next Steps & Questions** Is this a known issue with ROCm 6.3.1? Has Ollama introduced a regression in its ROCm backend? Would a downgrade to ROCm 5.7.1 restore previous performance? Would switching to vLLM resolve the Multi-GPU scaling issues? Are there workarounds to force GPU utilization properly under Ollama? Any insights or suggestions would be greatly appreciated! ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.5.5 up to latest
GiteaMirror added the buggpuamd labels 2025-11-12 13:05:29 -06:00
Author
Owner

@rick-github commented on GitHub (Jan 30, 2025):

Server logs may aid in debugging.

@rick-github commented on GitHub (Jan 30, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@konian71 commented on GitHub (Jan 30, 2025):

Here it is:
ollama.log

root@ki: ollama ps
NAME                               ID              SIZE     PROCESSOR    UNTIL
qwen2.5-coder:32b-instruct-q8_0    f37bbf27ec01    54 GB    100% GPU     8 seconds from now

root@ki: rocm-smi
========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU% 
0    45.0c           80.0W   527Mhz  1249Mhz  0%   auto  283.0W   68%   28%  
1    41.0c           68.0W   27Mhz   1249Mhz  0%   auto  283.0W   70%   0%   
2    41.0c           68.0W   26Mhz   1249Mhz  0%   auto  283.0W   70%   0%   
====================================================================================
=============================== End of ROCm SMI Log ================================

Speed:
response_token/s = 3,84

Funny detail
Fans of all GPUs stopped during inference

@konian71 commented on GitHub (Jan 30, 2025): **Here it is:** [ollama.log](https://github.com/user-attachments/files/18609359/ollama.log) ``` root@ki: ollama ps NAME ID SIZE PROCESSOR UNTIL qwen2.5-coder:32b-instruct-q8_0 f37bbf27ec01 54 GB 100% GPU 8 seconds from now root@ki: rocm-smi ========================= ROCm System Management Interface ========================= =================================== Concise Info =================================== GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 45.0c 80.0W 527Mhz 1249Mhz 0% auto 283.0W 68% 28% 1 41.0c 68.0W 27Mhz 1249Mhz 0% auto 283.0W 70% 0% 2 41.0c 68.0W 26Mhz 1249Mhz 0% auto 283.0W 70% 0% ==================================================================================== =============================== End of ROCm SMI Log ================================ ``` **Speed:** response_token/s = 3,84 **Funny detail** Fans of all GPUs stopped during inference
Author
Owner

@konian71 commented on GitHub (Feb 2, 2025):

I removed the second and third GPU from my system and now it is running stable. With only one GPU, my qwen2.5-coder:32b-instruct-q8_0 produces about 4 tokens per second - even though 50% of the processing is done by the CPU. After reinstalling, I did not install ROCm; instead, ollama uses its own AMD library. With three GPUs, I got 17 tokens per second on qwen2.5-coder:32b-instruct-q8_0, but not without errors according to the log file. However, if I reinstall the AMD drivers, my performance will probably drop to 4 tokens per second, despite 100% GPU utilization. Now, with one GPU, the utilization is about 50% GPU and 50% CPU, which gives 4 tokens/second.

I tested mistral-large with one GPU and it ran slow but stable. 2 GPUs are already failing and inference terminates unexpectedly and one (the second) GPU gets stuck on shutdown with the following error: [drm] evicting device resources failed and AMDGPU 000:23:00:0: andgpu: Failed to disallow df cstate

@konian71 commented on GitHub (Feb 2, 2025): I removed the second and third GPU from my system and now it is running stable. With only one GPU, my qwen2.5-coder:32b-instruct-q8_0 produces about 4 tokens per second - even though 50% of the processing is done by the CPU. After reinstalling, I did not install ROCm; instead, ollama uses its own AMD library. With three GPUs, I got 17 tokens per second on qwen2.5-coder:32b-instruct-q8_0, but not without errors according to the log file. However, if I reinstall the AMD drivers, my performance will probably drop to 4 tokens per second, despite 100% GPU utilization. Now, with one GPU, the utilization is about 50% GPU and 50% CPU, which gives 4 tokens/second. I tested mistral-large with one GPU and it ran slow but stable. 2 GPUs are already failing and inference terminates unexpectedly and one (the second) GPU gets stuck on shutdown with the following error: _**[drm] evicting device resources failed and AMDGPU 000:23:00:0: andgpu: Failed to disallow df cstate**_
Author
Owner

@konian71 commented on GitHub (Feb 2, 2025):

Hi all,

Just wanted to give a quick update: I reconnected all three AMD 7900XTX GPUs on my Ubuntu 24.04 minimal setup (just ran the Ollama install script, no additional ROCm) and inference has been running for four hours without crashes. Also noticed that the output is now consistently ~10-17 tokens/sec for a LLM with ~70GB in VRAM and 0% CPU.

The following LLMs are now running stable:

  • qwen2.5-coder:32b-instruct.q8_0
  • llama3.3:70b-instruct_q8_0
  • deepseek-R1:70b-llama-distill_q8_0
  • deepseek-R1:32b-qwen-distill_q8_0
  • phi4:14b-q8_0

mistral-large:123b-instruct_q8_0 was 100% stable with one GPU, but not with two or three, so it's probably an issue with the AMD multi-GPU driver. Since my desired LLMs are now running fine, I'll leave the setup as is.

The problem is not completely solved, but I wanted to share these findings with you. Maybe it will help someone.

@konian71 commented on GitHub (Feb 2, 2025): Hi all, Just wanted to give a quick update: I reconnected all **three AMD 7900XTX GPUs on my Ubuntu 24.04 minimal** setup (_just ran the Ollama install script, no additional ROCm_) and inference has been running for four hours without crashes. Also noticed that the output is now consistently ~10-17 tokens/sec for a LLM with ~70GB in VRAM and 0% CPU. The following LLMs are now running stable: - qwen2.5-coder:32b-instruct.q8_0 - llama3.3:70b-instruct_q8_0 - deepseek-R1:70b-llama-distill_q8_0 - deepseek-R1:32b-qwen-distill_q8_0 - phi4:14b-q8_0 mistral-large:123b-instruct_q8_0 was 100% stable with one GPU, but not with two or three, so it's probably an issue with the AMD multi-GPU driver. Since my desired LLMs are now running fine, I'll leave the setup as is. The problem is not completely solved, but I wanted to share these findings with you. Maybe it will help someone.
Author
Owner

@melroy89 commented on GitHub (Feb 3, 2025):

A bit off-topic, but shouldn't Ollama not update to ROCm 6.3.2 directly?

@melroy89 commented on GitHub (Feb 3, 2025): A bit off-topic, but shouldn't Ollama not update to [ROCm 6.3.2](https://rocm.docs.amd.com/en/docs-6.3.2/about/release-notes.html) directly?
Author
Owner

@QyInvoLing commented on GitHub (Feb 12, 2025):

Hi all,

Just wanted to give a quick update: I reconnected all three AMD 7900XTX GPUs on my Ubuntu 24.04 minimal setup (just ran the Ollama install script, no additional ROCm) and inference has been running for four hours without crashes. Also noticed that the output is now consistently ~10-17 tokens/sec for a LLM with ~70GB in VRAM and 0% CPU.

The following LLMs are now running stable:

  • qwen2.5-coder:32b-instruct.q8_0
  • llama3.3:70b-instruct_q8_0
  • deepseek-R1:70b-llama-distill_q8_0
  • deepseek-R1:32b-qwen-distill_q8_0
  • phi4:14b-q8_0

mistral-large:123b-instruct_q8_0 was 100% stable with one GPU, but not with two or three, so it's probably an issue with the AMD multi-GPU driver. Since my desired LLMs are now running fine, I'll leave the setup as is.

The problem is not completely solved, but I wanted to share these findings with you. Maybe it will help someone.

How many tokens can you get when running deepseek-R1:70b-llama-distill_q8_0 using three 7900xtx?

@QyInvoLing commented on GitHub (Feb 12, 2025): > Hi all, > > Just wanted to give a quick update: I reconnected all **three AMD 7900XTX GPUs on my Ubuntu 24.04 minimal** setup (_just ran the Ollama install script, no additional ROCm_) and inference has been running for four hours without crashes. Also noticed that the output is now consistently ~10-17 tokens/sec for a LLM with ~70GB in VRAM and 0% CPU. > > The following LLMs are now running stable: > > * qwen2.5-coder:32b-instruct.q8_0 > * llama3.3:70b-instruct_q8_0 > * deepseek-R1:70b-llama-distill_q8_0 > * deepseek-R1:32b-qwen-distill_q8_0 > * phi4:14b-q8_0 > > mistral-large:123b-instruct_q8_0 was 100% stable with one GPU, but not with two or three, so it's probably an issue with the AMD multi-GPU driver. Since my desired LLMs are now running fine, I'll leave the setup as is. > > The problem is not completely solved, but I wanted to share these findings with you. Maybe it will help someone. How many tokens can you get when running deepseek-R1:70b-llama-distill_q8_0 using three 7900xtx?
Author
Owner

@konian71 commented on GitHub (Feb 13, 2025):

How many tokens can you get when running deepseek-R1:70b-llama-distill_q8_0 using three 7900xtx?

70 billion parameters are too much for 72GB of VRAM, which makes it slow, with CPU/GPU utilization at 7%/93%. I estimate around 4 tokens per second. I can't test it properly at the moment because my setup currently has only two GPUs.

@konian71 commented on GitHub (Feb 13, 2025): > How many tokens can you get when running deepseek-R1:70b-llama-distill_q8_0 using three 7900xtx? 70 billion parameters are too much for 72GB of VRAM, which makes it slow, with CPU/GPU utilization at 7%/93%. I estimate around 4 tokens per second. I can't test it properly at the moment because my setup currently has only two GPUs.
Author
Owner

@JoshuaBowerman commented on GitHub (Feb 19, 2025):

I'm also seeing similar issues after updating with a 7900 XTX. Massively reduced performance, including with pytorch. I'm guessing this is a problem with a newer version of ROCM or maybe an AMDGPU driver change. I'm not seeing any reduction in graphics performance, so I think the hardware is fine. Also seeing high CPU utilization even for extremely small models that are 100% in VRAM.

@JoshuaBowerman commented on GitHub (Feb 19, 2025): I'm also seeing similar issues after updating with a 7900 XTX. Massively reduced performance, including with pytorch. I'm guessing this is a problem with a newer version of ROCM or maybe an AMDGPU driver change. I'm not seeing any reduction in graphics performance, so I think the hardware is fine. Also seeing high CPU utilization even for extremely small models that are 100% in VRAM.
Author
Owner

@JoshuaBowerman commented on GitHub (Feb 20, 2025):

It seems that for me, Ollama was claiming to be using the GPU when it was actually using the CPU. ollama ps showed 100% GPU when the model was definitely loaded into ram and running on the cpu. rocm-smi showed 0 VRAM or GPU usage during inference, and ram and cpu was being used by Ollama.

I'm on arch, uninstalling the ollama package and installing ollama-rocm instead fixed the issue for me. Not sure why ollama claimed to be using the GPU when it was definitely not.

You're on Ubuntu and my issue is likely unrelated but I figured I'd add this comment here anyways in case someone comes across this with the same issue as me.

@JoshuaBowerman commented on GitHub (Feb 20, 2025): It seems that for me, Ollama was claiming to be using the GPU when it was actually using the CPU. `ollama ps` showed 100% GPU when the model was definitely loaded into ram and running on the cpu. `rocm-smi` showed 0 VRAM or GPU usage during inference, and ram and cpu was being used by Ollama. I'm on arch, uninstalling the `ollama` package and installing `ollama-rocm` instead fixed the issue for me. Not sure why ollama claimed to be using the GPU when it was definitely not. You're on Ubuntu and my issue is likely unrelated but I figured I'd add this comment here anyways in case someone comes across this with the same issue as me.
Author
Owner

@rick-github commented on GitHub (Feb 20, 2025):

The %GPU displayed by ollama ps is figured out based on the GPU detected before the runner is started. When ollama starts a runner and finds it's not available, it falls back to the CPU runner but doesn't update the %GPU value.

@rick-github commented on GitHub (Feb 20, 2025): The %GPU displayed by `ollama ps` is figured out based on the GPU detected before the runner is started. When ollama starts a runner and finds it's not available, it falls back to the CPU runner but doesn't update the %GPU value.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#5646