inconsistent CUDA error on codellama on an AMD iGPU (gfx1103, unsupported, with override) #3207

Closed
opened 2025-11-12 11:28:26 -06:00 by GiteaMirror · 12 comments
Owner

Originally created by @myyc on GitHub (Jun 16, 2024).

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

i'm trying to use codellama. i get inconsistent experience depending on the question and i have no idea what is causing it. if i ask simple questions like "what is the capital of luxembourg" i get an answer right away. even longer answers (e.g. "where is germany") seem fine. when i ask coding questions though, such as "give me a shell command to find all files in /path created in the past five minutes" i get this

time=2024-06-16T15:00:22.313+02:00 level=WARN source=types.go:395 msg="invalid option provided" option=""
ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: shared object initialization failed
  current device: 0, in function ggml_cuda_compute_forward at /build/ollama/src/ollama-rocm/llm/llama.cpp/ggml-cuda.cu:2360
  err
GGML_ASSERT: /build/ollama/src/ollama-rocm/llm/llama.cpp/ggml-cuda.cu:100: !"CUDA error"

i'm running ollama 0.1.44 with HSA_OVERRIDE_GFX_VERSION=11.0.2 (11.0.0 seems ok also). all on arch linux so everything is up to date (as of 16th june 2024)

edit: implicitly mentioned but of course it is a bit strange to me that the error relates to CUDA since i don't have a nvidia card nor a CUDA launcher in the temp files. maybe this is how the ollama codebase works though? no idea

any clue?

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.1.44

Originally created by @myyc on GitHub (Jun 16, 2024). Originally assigned to: @dhiltgen on GitHub. ### What is the issue? i'm trying to use codellama. i get inconsistent experience depending on the question and i have no idea what is causing it. if i ask simple questions like "what is the capital of luxembourg" i get an answer right away. even longer answers (e.g. "where is germany") seem fine. when i ask coding questions though, such as "give me a shell command to find all files in /path created in the past five minutes" i get this ``` time=2024-06-16T15:00:22.313+02:00 level=WARN source=types.go:395 msg="invalid option provided" option="" ggml_cuda_compute_forward: RMS_NORM failed CUDA error: shared object initialization failed current device: 0, in function ggml_cuda_compute_forward at /build/ollama/src/ollama-rocm/llm/llama.cpp/ggml-cuda.cu:2360 err GGML_ASSERT: /build/ollama/src/ollama-rocm/llm/llama.cpp/ggml-cuda.cu:100: !"CUDA error" ``` i'm running ollama 0.1.44 with `HSA_OVERRIDE_GFX_VERSION=11.0.2` (11.0.0 seems ok also). all on arch linux so everything is up to date (as of 16th june 2024) **edit**: implicitly mentioned but of course it is a bit strange to me that the error relates to CUDA since i don't have a nvidia card nor a CUDA launcher in the temp files. maybe this is how the ollama codebase works though? no idea any clue? ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.1.44
GiteaMirror added the bugamd labels 2025-11-12 11:28:26 -06:00
Author
Owner

@Speedway1 commented on GitHub (Jun 16, 2024):

This is the Ollama forum.

You are using llamacpp. You should be on a different forum. However, I can tell you that Ollama doesn't seem to work on LLMs that are spread across more than 1 AMD GPU.

Llamacpp has no problem with this, though, it works well.

Make sure that you set the visible devices:

ROCR_VISIBLE_DEVICES=0,1

In my case I want to use the 2 Radeon 7900 XTX cards in my machine.

It's important to get the balance between the VRAM and the number of layers offloaded. For example on my setup (see below), I can run a context window of only 6000 tokens and offload only 60 layers onto the AMD cards. This is my run command for the "Rhea" 70b model that is presently at the #1 spot on Huggingface's leader board:
llama.cpp/main -m /home/tmp/Rhea-72b-v0.5_q4_k_m.gguf -n 6000 -c 6000 -ngl 60 -i

You can see that I use the " -i " flag to drop it into interactive mode. Note that this takes up all the GPUs VRAM even though I am using the quantised model. AMD is really missing a trick here. They can't catch Nvidia on chip advancement, but they could really take the lead in a massive way by simply offering today's GPUs (7900 range) with triple or more of the VRAM. Suddenly everyone will be wanting their GPUs who want to run LLMs locally and that's a heck of a lot of people.

Here is my setup:

llamacpp@TH-AI2:~$ /opt/rocm/bin/rocm-smi

========================================== ROCm System Management Interface ==========================================
==================================================== Concise Info ====================================================
Device [Model : Revision] Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
Name (20 chars) (Edge) (Avg) (Mem, Compute)

0 [0x5304 : 0xc8] 47.0°C 79.0W N/A, N/A 189Mhz 1249Mhz 20.0% auto 327.0W 86% 9%
0x744c
1 [0x5304 : 0xc8] 48.0°C 81.0W N/A, N/A 228Mhz 1249Mhz 20.0% auto 327.0W 85% 9%
0x744c
2 [0x8877 : 0xc3] 36.0°C 9.155W N/A, N/A None 1800Mhz 0% auto Unsupported 15% 0%
0x164e

@Speedway1 commented on GitHub (Jun 16, 2024): This is the Ollama forum. You are using llamacpp. You should be on a different forum. However, I can tell you that Ollama doesn't seem to work on LLMs that are spread across more than 1 AMD GPU. Llamacpp has no problem with this, though, it works well. Make sure that you set the visible devices: ROCR_VISIBLE_DEVICES=0,1 In my case I want to use the 2 Radeon 7900 XTX cards in my machine. It's important to get the balance between the VRAM and the number of layers offloaded. For example on my setup (see below), I can run a context window of only 6000 tokens and offload only 60 layers onto the AMD cards. This is my run command for the "Rhea" 70b model that is presently at the #1 spot on Huggingface's leader board: ` llama.cpp/main -m /home/tmp/Rhea-72b-v0.5_q4_k_m.gguf -n 6000 -c 6000 -ngl 60 -i ` You can see that I use the " -i " flag to drop it into interactive mode. Note that this takes up all the GPUs VRAM even though I am using the quantised model. AMD is really missing a trick here. They can't catch Nvidia on chip advancement, but they could really take the lead in a massive way by simply offering today's GPUs (7900 range) with triple or more of the VRAM. Suddenly everyone will be wanting their GPUs who want to run LLMs locally and that's a heck of a lot of people. Here is my setup: llamacpp@TH-AI2:~$ /opt/rocm/bin/rocm-smi ========================================== ROCm System Management Interface ========================================== ==================================================== Concise Info ==================================================== Device [Model : Revision] Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% Name (20 chars) (Edge) (Avg) (Mem, Compute) ====================================================================================================================== 0 [0x5304 : 0xc8] 47.0°C 79.0W N/A, N/A 189Mhz 1249Mhz 20.0% auto 327.0W 86% 9% 0x744c 1 [0x5304 : 0xc8] 48.0°C 81.0W N/A, N/A 228Mhz 1249Mhz 20.0% auto 327.0W 85% 9% 0x744c 2 [0x8877 : 0xc3] 36.0°C 9.155W N/A, N/A None 1800Mhz 0% auto Unsupported 15% 0% 0x164e
Author
Owner

@myyc commented on GitHub (Jun 17, 2024):

This is the Ollama forum.

You are using llamacpp. You should be on a different forum. However, I can tell you that Ollama doesn't seem to work on LLMs that are spread across more than 1 AMD GPU.

for the record, i am running ollama with ollama serve, i am of course not in control of what this command runs

@myyc commented on GitHub (Jun 17, 2024): > This is the Ollama forum. > > You are using llamacpp. You should be on a different forum. However, I can tell you that Ollama doesn't seem to work on LLMs that are spread across more than 1 AMD GPU. for the record, i am running ollama with `ollama serve`, i am of course not in control of what this command runs
Author
Owner

@dhiltgen commented on GitHub (Jun 18, 2024):

Searching around online, I see some reports of others hitting this in other contexts on llama.cpp and solving the problem by upgrading their amdgpu driver to the latest version. The root cause is likely an interaction between llama.cpp's ggml layer, ROCm, and the amdgpu driver.

The reason "cuda" is mentioned in the logs is how llama.cpp leverages an adapter library in ROCm that mimics the cuda API to make it easier to port GPU code from nvidia to amd.

The latest release (0.1.45) bumps our bundled ROCm to a newer version, so it's possible that might resolve this.

@dhiltgen commented on GitHub (Jun 18, 2024): Searching around online, I see some reports of others hitting this in other contexts on llama.cpp and solving the problem by upgrading their amdgpu driver to the latest version. The root cause is likely an interaction between llama.cpp's ggml layer, ROCm, and the amdgpu driver. The reason "cuda" is mentioned in the logs is how llama.cpp leverages an adapter library in ROCm that mimics the cuda API to make it easier to port GPU code from nvidia to amd. The latest release (0.1.45) bumps our bundled ROCm to a newer version, so it's possible that might resolve this.
Author
Owner

@Speedway1 commented on GitHub (Jun 18, 2024):

Searching around online, I see some reports of others hitting this in other contexts on llama.cpp and solving the problem by upgrading their amdgpu driver to the latest version. The root cause is likely an interaction between llama.cpp's ggml layer, ROCm, and the amdgpu driver.

The reason "cuda" is mentioned in the logs is how llama.cpp leverages an adapter library in ROCm that mimics the cuda API to make it easier to port GPU code from nvidia to amd.

The latest release (0.1.45) bumps our bundled ROCm to a newer version, so it's possible that might resolve this.

Thank you for this, will give it a go.

@Speedway1 commented on GitHub (Jun 18, 2024): > Searching around online, I see some reports of others hitting this in other contexts on llama.cpp and solving the problem by upgrading their amdgpu driver to the latest version. The root cause is likely an interaction between llama.cpp's ggml layer, ROCm, and the amdgpu driver. > > The reason "cuda" is mentioned in the logs is how llama.cpp leverages an adapter library in ROCm that mimics the cuda API to make it easier to port GPU code from nvidia to amd. > > The latest release (0.1.45) bumps our bundled ROCm to a newer version, so it's possible that might resolve this. Thank you for this, will give it a go.
Author
Owner

@Speedway1 commented on GitHub (Jun 18, 2024):

Doing an update I only get 0.1.44 not 0.1.45. Is this not yet released?

@Speedway1 commented on GitHub (Jun 18, 2024): Doing an update I only get 0.1.44 not 0.1.45. Is this not yet released?
Author
Owner

@dhiltgen commented on GitHub (Jun 19, 2024):

0.1.45 is in pre-release right now. Once we fix a few more bugs we'll finalize it and it will be promoted to the latest release. Until then, you can target a specific version on linux with https://github.com/ollama/ollama/blob/main/docs/linux.md#installing-specific-versions or download the Mac/Windows artifacts from https://github.com/ollama/ollama/releases

@dhiltgen commented on GitHub (Jun 19, 2024): 0.1.45 is in pre-release right now. Once we fix a few more bugs we'll finalize it and it will be promoted to the latest release. Until then, you can target a specific version on linux with https://github.com/ollama/ollama/blob/main/docs/linux.md#installing-specific-versions or download the Mac/Windows artifacts from https://github.com/ollama/ollama/releases
Author
Owner

@myyc commented on GitHub (Jun 21, 2024):

@dhiltgen thanks for the info. i tried compiling a bunch of rocm packages (took way more than i thought it would) and i'm still hitting that issue. either way, as you said, digging further i realised that it's likely that the issue is fully on amd's side (either rocm or the drivers). for the record, rocm 6.1.2 still doesn't have gfx1103. mystery as to why.

@myyc commented on GitHub (Jun 21, 2024): @dhiltgen thanks for the info. i tried compiling a bunch of rocm packages (took way more than i thought it would) and i'm still hitting that issue. either way, as you said, digging further i realised that it's likely that the issue is fully on amd's side (either rocm or the drivers). for the record, rocm 6.1.2 still doesn't have `gfx1103`. mystery as to why.
Author
Owner

@Gingeropolous commented on GitHub (Jul 2, 2024):

Hi, I just installed 0.1.45 and still got the same error:

user@gpu1:~$ ollama run codellama
Error: llama runner process has terminated: signal: aborted (core dumped) CUDA error: invalid argument
current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:3018
hipMemGetInfo(free, total)
GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu💯 !"CUDA error"

though with the comment of ollama having issues with multiple GPUs, I think I may switch to getting llama.cpp to run (i intend to throw multiple of these vega64s i have from my mining days onto this thing).

@Gingeropolous commented on GitHub (Jul 2, 2024): Hi, I just installed 0.1.45 and still got the same error: user@gpu1:~$ ollama run codellama Error: llama runner process has terminated: signal: aborted (core dumped) CUDA error: invalid argument current device: 0, in function ggml_backend_cuda_get_device_memory at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:3018 hipMemGetInfo(free, total) GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:100: !"CUDA error" though with the comment of ollama having issues with multiple GPUs, I think I may switch to getting llama.cpp to run (i intend to throw multiple of these vega64s i have from my mining days onto this thing).
Author
Owner

@Gingeropolous commented on GitHub (Jul 3, 2024):

BTW, I had to manually install ROCM using this:

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html#rocm-install-quick

(and the post install instructions).

I was under the assumption that the install.sh on ollama would do all of this stuff, but that doesn't seem to be the case.

OS: Ubuntu 22
GPU: AMD Vega64
CPU: AMD ryzen 3900x
Ollama version: 0.1.45

@Gingeropolous commented on GitHub (Jul 3, 2024): BTW, I had to manually install ROCM using this: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html#rocm-install-quick (and the post install instructions). I was under the assumption that the install.sh on ollama would do all of this stuff, but that doesn't seem to be the case. OS: Ubuntu 22 GPU: AMD Vega64 CPU: AMD ryzen 3900x Ollama version: 0.1.45
Author
Owner

@darwinvelez58 commented on GitHub (Jul 12, 2024):

any update for this issue?

@darwinvelez58 commented on GitHub (Jul 12, 2024): any update for this issue?
Author
Owner

@myyc commented on GitHub (Jul 12, 2024):

any update for this issue?

as far as i'm concerned it looks like the issue is on AMD's side as the error logs are from ROCm. until it gets updated to properly support newer GPUs i'd think this sort of behaviour is bound to happen.

@myyc commented on GitHub (Jul 12, 2024): > any update for this issue? as far as i'm concerned it looks like the issue is on AMD's side as the error logs are from ROCm. until it gets updated to properly support newer GPUs i'd think this sort of behaviour is bound to happen.
Author
Owner

@myyc commented on GitHub (Jul 26, 2024):

update.

as expected with these sorts of issues, turns out this wasn't an ollama bug at all. the bios setting that increases VRAM on my laptop got wiped. i ticked it back and i did not experience that issue again. this was my solution to this, hopefully it helps others too.

@myyc commented on GitHub (Jul 26, 2024): update. as expected with these sorts of issues, turns out this wasn't an ollama bug at all. the bios setting that increases VRAM on my laptop got wiped. i ticked it back and i did not experience that issue again. this was my solution to this, hopefully it helps others too.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#3207