Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details #7519

Closed
opened 2025-11-12 14:09:28 -06:00 by GiteaMirror · 2 comments
Owner

Originally created by @chandujr on GitHub (Jul 11, 2025).

What is the issue?

OS: Nobara Linux 42
RAM: 16GB, VRAM: 12GB
CPU: AMD Ryzen 9 5980HX
GPU: AMD Radeon RX 6800M (dGPU)
Ollama: v0.9.5, installed using the manual method along with ROCm package.

I'm getting this error as soon as I ask a query to any model (deepseek-r1:1.5b in this case):

Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details

This is the ollama ps output after the crash:

NAME                ID              SIZE      PROCESSOR    UNTIL
deepseek-r1:1.5b    e0979632db5a    2.0 GB    100% GPU     2 minutes from now

This is the ollama.service file in /etc/systemd/system/ollama.service:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=$PATH"
Environment="OLLAMA_HOST=0.0.0.0"
Environment="HSA_OVERRIDE_GFX_VERSION=10.3.1"

[Install]
WantedBy=multi-user.target

This is the rocm-smi -a output:

============================ ROCm System Management Interface ============================
============================== Version of System Component ===============================
Driver version: 6.15.5-200.nobara.fc42.x86_64
==========================================================================================
=========================================== ID ===========================================
GPU[0]		: Device Name: 		Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT]
GPU[0]		: Device ID: 		0x73df
GPU[0]		: Device Rev: 		0xc3
GPU[0]		: Subsystem ID: 	Radeon RX 6800M
GPU[0]		: GUID: 		45052
GPU[1]		: Device Name: 		Cezanne [Radeon Vega Series / Radeon Vega Mobile Series]
GPU[1]		: Device ID: 		0x1638
GPU[1]		: Device Rev: 		0xc7
GPU[1]		: Subsystem ID: 	Radeon Vega 8
GPU[1]		: GUID: 		8510
==========================================================================================
======================================= Unique ID ========================================
GPU[0]		: Unique ID: N/A
GPU[1]		: Unique ID: N/A
==========================================================================================
========================================= VBIOS ==========================================
GPU[0]		: VBIOS version: SWBRT86018.001
GPU[1]		: VBIOS version: 113-CEZANNE-018
==========================================================================================
====================================== Temperature =======================================
GPU[0]		: Temperature (Sensor edge) (C): 49.0
GPU[0]		: Temperature (Sensor junction) (C): 50.0
GPU[0]		: Temperature (Sensor memory) (C): 48.0
GPU[1]		: Temperature (Sensor edge) (C): 56.0
==========================================================================================
=============================== Current clock frequencies ================================
GPU[0]		: dcefclk clock level: 0: (417Mhz)
GPU[0]		: fclk clock level: 1: (840Mhz)
GPU[0]		: mclk clock level: 0: (96Mhz)
GPU[0]		: sclk clock level: 0: (0Mhz)
GPU[0]		: socclk clock level: 1: (533Mhz)
GPU[0]		: pcie clock level: 1 (8.0GT/s x8)
Exception caught: map::at
GPU[1]		: fclk clock level: 1: (1600Mhz)
GPU[1]		: mclk clock level: 1: (1600Mhz)
GPU[1]		: sclk clock level: 1: (400Mhz)
GPU[1]		: socclk clock level: 0: (400Mhz)
==========================================================================================
=================================== Current Fan Metric ===================================
GPU[0]		: Fan Level: 122 (48%)
GPU[0]		: Fan RPM: 0
GPU[1]		: Not supported
==========================================================================================
================================= Show Performance Level =================================
GPU[0]		: Performance Level: auto
GPU[1]		: Performance Level: auto
==========================================================================================
==================================== OverDrive Level =====================================
GPU[0]		: get_overdrive_level_sclk, Not supported on the given system
GPU[1]		: get_overdrive_level_sclk, Not supported on the given system
==========================================================================================
==================================== OverDrive Level =====================================
GPU[0]		: get_mem_overdrive_level_mclk, Not supported on the given system
GPU[1]		: get_mem_overdrive_level_mclk, Not supported on the given system
==========================================================================================
======================================= Power Cap ========================================
GPU[0]		: Max Graphics Package Power (W): 130.0
GPU[1]		: get_power_cap, Not supported on the given system
GPU[1]		: Max Graphics Package Power Unsupported
==========================================================================================
================================== Show Power Profiles ===================================
GPU[0]		: 1. Available power profile (#1 of 7): CUSTOM
GPU[0]		: 2. Available power profile (#2 of 7): VIDEO
GPU[0]		: 3. Available power profile (#3 of 7): POWER SAVING
GPU[0]		: 4. Available power profile (#4 of 7): COMPUTE
GPU[0]		: 5. Available power profile (#5 of 7): VR
GPU[0]		: 6. Available power profile (#6 of 7): 3D FULL SCREEN
GPU[0]		: 7. Available power profile (#7 of 7): BOOTUP DEFAULT*
python3: /builddir/build/BUILD/rocm-smi-6.3.1-build/rocm_smi_lib-rocm-6.3.1/src/rocm_smi.cc:1226: rsmi_status_t get_power_profiles(uint32_t, rsmi_power_profile_status_t*, std::map<rsmi_power_profile_preset_masks_t, unsigned int>*): Assertion `p->current != RSMI_PWR_PROF_PRST_INVALID' failed.
fish: Job 1, 'rocm-smi -a' terminated by signal SIGABRT (Abort)

This is the server logs I got during the model run and crash (journalctl -u ollama --no-pager --follow --pager-end):
https://gist.github.com/chandujr/b38687a786ae88f00559dabb6ba4cab1 (gist because of character limit here)

Relevant log output


OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.9.5

Originally created by @chandujr on GitHub (Jul 11, 2025). ### What is the issue? **OS:** Nobara Linux 42 **RAM:** 16GB, VRAM: 12GB **CPU:** AMD Ryzen 9 5980HX **GPU:** AMD Radeon RX 6800M (dGPU) **Ollama:** v0.9.5, installed using the [manual method](https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install) along with [ROCm package](https://github.com/ollama/ollama/blob/main/docs/linux.md#amd-gpu-install). I'm getting this error as soon as I ask a query to any model (`deepseek-r1:1.5b` in this case): ``` Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details ``` This is the `ollama ps` output after the crash: ``` NAME ID SIZE PROCESSOR UNTIL deepseek-r1:1.5b e0979632db5a 2.0 GB 100% GPU 2 minutes from now ``` This is the ollama.service file in `/etc/systemd/system/ollama.service`: ``` [Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=/usr/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=$PATH" Environment="OLLAMA_HOST=0.0.0.0" Environment="HSA_OVERRIDE_GFX_VERSION=10.3.1" [Install] WantedBy=multi-user.target ``` This is the `rocm-smi -a` output: ``` ============================ ROCm System Management Interface ============================ ============================== Version of System Component =============================== Driver version: 6.15.5-200.nobara.fc42.x86_64 ========================================================================================== =========================================== ID =========================================== GPU[0] : Device Name: Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] GPU[0] : Device ID: 0x73df GPU[0] : Device Rev: 0xc3 GPU[0] : Subsystem ID: Radeon RX 6800M GPU[0] : GUID: 45052 GPU[1] : Device Name: Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] GPU[1] : Device ID: 0x1638 GPU[1] : Device Rev: 0xc7 GPU[1] : Subsystem ID: Radeon Vega 8 GPU[1] : GUID: 8510 ========================================================================================== ======================================= Unique ID ======================================== GPU[0] : Unique ID: N/A GPU[1] : Unique ID: N/A ========================================================================================== ========================================= VBIOS ========================================== GPU[0] : VBIOS version: SWBRT86018.001 GPU[1] : VBIOS version: 113-CEZANNE-018 ========================================================================================== ====================================== Temperature ======================================= GPU[0] : Temperature (Sensor edge) (C): 49.0 GPU[0] : Temperature (Sensor junction) (C): 50.0 GPU[0] : Temperature (Sensor memory) (C): 48.0 GPU[1] : Temperature (Sensor edge) (C): 56.0 ========================================================================================== =============================== Current clock frequencies ================================ GPU[0] : dcefclk clock level: 0: (417Mhz) GPU[0] : fclk clock level: 1: (840Mhz) GPU[0] : mclk clock level: 0: (96Mhz) GPU[0] : sclk clock level: 0: (0Mhz) GPU[0] : socclk clock level: 1: (533Mhz) GPU[0] : pcie clock level: 1 (8.0GT/s x8) Exception caught: map::at GPU[1] : fclk clock level: 1: (1600Mhz) GPU[1] : mclk clock level: 1: (1600Mhz) GPU[1] : sclk clock level: 1: (400Mhz) GPU[1] : socclk clock level: 0: (400Mhz) ========================================================================================== =================================== Current Fan Metric =================================== GPU[0] : Fan Level: 122 (48%) GPU[0] : Fan RPM: 0 GPU[1] : Not supported ========================================================================================== ================================= Show Performance Level ================================= GPU[0] : Performance Level: auto GPU[1] : Performance Level: auto ========================================================================================== ==================================== OverDrive Level ===================================== GPU[0] : get_overdrive_level_sclk, Not supported on the given system GPU[1] : get_overdrive_level_sclk, Not supported on the given system ========================================================================================== ==================================== OverDrive Level ===================================== GPU[0] : get_mem_overdrive_level_mclk, Not supported on the given system GPU[1] : get_mem_overdrive_level_mclk, Not supported on the given system ========================================================================================== ======================================= Power Cap ======================================== GPU[0] : Max Graphics Package Power (W): 130.0 GPU[1] : get_power_cap, Not supported on the given system GPU[1] : Max Graphics Package Power Unsupported ========================================================================================== ================================== Show Power Profiles =================================== GPU[0] : 1. Available power profile (#1 of 7): CUSTOM GPU[0] : 2. Available power profile (#2 of 7): VIDEO GPU[0] : 3. Available power profile (#3 of 7): POWER SAVING GPU[0] : 4. Available power profile (#4 of 7): COMPUTE GPU[0] : 5. Available power profile (#5 of 7): VR GPU[0] : 6. Available power profile (#6 of 7): 3D FULL SCREEN GPU[0] : 7. Available power profile (#7 of 7): BOOTUP DEFAULT* python3: /builddir/build/BUILD/rocm-smi-6.3.1-build/rocm_smi_lib-rocm-6.3.1/src/rocm_smi.cc:1226: rsmi_status_t get_power_profiles(uint32_t, rsmi_power_profile_status_t*, std::map<rsmi_power_profile_preset_masks_t, unsigned int>*): Assertion `p->current != RSMI_PWR_PROF_PRST_INVALID' failed. fish: Job 1, 'rocm-smi -a' terminated by signal SIGABRT (Abort) ``` This is the server logs I got during the model run and crash (`journalctl -u ollama --no-pager --follow --pager-end`): https://gist.github.com/chandujr/b38687a786ae88f00559dabb6ba4cab1 (gist because of character limit here) ### Relevant log output ```shell ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.9.5
GiteaMirror added the bug label 2025-11-12 14:09:28 -06:00
Author
Owner

@rick-github commented on GitHub (Jul 11, 2025):

Jul 11 20:43:51 chandupc ollama[1839]: ggml_cuda_compute_forward: RMS_NORM failed
Jul 11 20:43:51 chandupc ollama[1839]: ROCm error: invalid device function
Jul 11 20:43:51 chandupc ollama[1839]:   current device: 0, in function ggml_cuda_compute_forward at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2366
Jul 11 20:43:51 chandupc ollama[1839]:   err
Jul 11 20:43:51 chandupc ollama[1839]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:76: ROCm error
Jul 11 20:43:51 chandupc ollama[1839]: Memory critical error by agent node-0 (Agent handle: 0x5614cbbc6260) on address 0x7f5534200000. Reason: Memory in use.
Jul 11 20:43:51 chandupc ollama[1839]: SIGABRT: abort

Could be the same as #11123.

@rick-github commented on GitHub (Jul 11, 2025): ``` Jul 11 20:43:51 chandupc ollama[1839]: ggml_cuda_compute_forward: RMS_NORM failed Jul 11 20:43:51 chandupc ollama[1839]: ROCm error: invalid device function Jul 11 20:43:51 chandupc ollama[1839]: current device: 0, in function ggml_cuda_compute_forward at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2366 Jul 11 20:43:51 chandupc ollama[1839]: err Jul 11 20:43:51 chandupc ollama[1839]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:76: ROCm error Jul 11 20:43:51 chandupc ollama[1839]: Memory critical error by agent node-0 (Agent handle: 0x5614cbbc6260) on address 0x7f5534200000. Reason: Memory in use. Jul 11 20:43:51 chandupc ollama[1839]: SIGABRT: abort ``` Could be the same as #11123.
Author
Owner

@chandujr commented on GitHub (Jul 11, 2025):

I solved the problem. I did two things:

One is that, in the /etc/systemd/system/ollama.service file, the HSA_OVERRIDE_GFX_VERSION should have been 10.3.0 instead of 10.3.1. After updating this file, restarted ollama:

sudo systemctl daemon-reload
sudo systemctl restart ollama

But, side-by-side I was also fixing a completely different docker error in the system. Turns out my current user was not added to the kvm group which is a requirement for virtualization support in the case of docker. But I'm not sure whether it is required for ollama. Anyway, after adding the user to kvm and restarting the system, everything is working properly.

@chandujr commented on GitHub (Jul 11, 2025): I solved the problem. I did two things: One is that, in the `/etc/systemd/system/ollama.service` file, the `HSA_OVERRIDE_GFX_VERSION` should have been `10.3.0` instead of `10.3.1`. After updating this file, restarted ollama: ``` sudo systemctl daemon-reload sudo systemctl restart ollama ``` But, side-by-side I was also fixing a completely different docker error in the system. Turns out my current user was not added to the `kvm` group which is a [requirement for virtualization support](https://docs.docker.com/desktop/setup/install/linux/#kvm-virtualization-support) in the case of docker. But I'm not sure whether it is required for ollama. Anyway, after adding the user to `kvm` and restarting the system, everything is working properly.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#7519