[GH-ISSUE #6685] AMD 7900XTX fails with "Could not initialize Tensile host: No devices found" #29967

Closed
opened 2026-04-22 09:20:34 -05:00 by GiteaMirror · 51 comments
Owner

Originally created by @svaningelgem on GitHub (Sep 7, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6685

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I installed the AMD drivers with https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/native-install/ubuntu.html ✔️

OS: Ubuntu 24.04.1 LTS
ROCm: ROCm version: 6.2.0
CPU: AMD Ryzen 9 7950X3D
GPU: Radeon RX 7900 XTX
model: llama3.1

Started with:
docker run --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm

Then tried to start llama3.1 with (I pulled it first successfully):
OLLAMA_DEBUG=1 ollama run llama3.1

Log file:
ollama.log

It looks like it is detecting the GPU correctly at the start of the container, but somehow fails to use it?

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.3.9

Originally created by @svaningelgem on GitHub (Sep 7, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6685 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I installed the AMD drivers with https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/native-install/ubuntu.html :heavy_check_mark: OS: `Ubuntu 24.04.1 LTS` ROCm: `ROCm version: 6.2.0` CPU: `AMD Ryzen 9 7950X3D` GPU: `Radeon RX 7900 XTX` model: `llama3.1` Started with: `docker run --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm` Then tried to start llama3.1 with (I pulled it first successfully): `OLLAMA_DEBUG=1 ollama run llama3.1` Log file: [ollama.log](https://github.com/user-attachments/files/16917396/ollama.log) It looks like it is detecting the GPU correctly at the start of the container, but somehow fails to use it? ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.3.9
GiteaMirror added the dockerbug labels 2026-04-22 09:20:34 -05:00
Author
Owner

@svaningelgem commented on GitHub (Sep 7, 2024):

Other (closed) issues I found concerning similar behavior: #4798 , #6165 (but no real responses with a solution I could try in there)

<!-- gh-comment-id:2335136633 --> @svaningelgem commented on GitHub (Sep 7, 2024): Other (closed) issues I found concerning similar behavior: #4798 , #6165 (but no real responses with a solution I could try in there)
Author
Owner

@rick-github commented on GitHub (Sep 7, 2024):

What's the output of ls -la /dev/kfd* /dev/dri*? Have you tried running the container with --device as mentioned in the docs?

<!-- gh-comment-id:2335189726 --> @rick-github commented on GitHub (Sep 7, 2024): What's the output of `ls -la /dev/kfd* /dev/dri*`? Have you tried running the container with `--device` as mentioned in the [docs](https://github.com/ollama/ollama/blob/main/docs/docker.md#amd-gpu)?
Author
Owner

@svaningelgem commented on GitHub (Sep 7, 2024):

Ok, I tried it:

Output of the ls:

$ find /dev/kfd* /dev/dri* | xargs ls -ld
drwxr-xr-x  3 root root        100 sep  7 16:43 /dev/dri
drwxr-xr-x  2 root root         80 sep  7 16:43 /dev/dri/by-path
lrwxrwxrwx  1 root root          8 sep  7 16:43 /dev/dri/by-path/pci-0000:03:00.0-card -> ../card1
lrwxrwxrwx  1 root root         13 sep  7 16:43 /dev/dri/by-path/pci-0000:03:00.0-render -> ../renderD128
crw-rw----+ 1 root video  226,   1 sep  7 16:43 /dev/dri/card1
crw-rw----+ 1 root render 226, 128 sep  7 16:43 /dev/dri/renderD128
crw-rw----  1 root video  235,   0 sep  7 16:43 /dev/kfd

Output of the updated command:

docker run --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --replace --name ollama ollama/ollama:rocm |& tee -a "/var/log/ollama/$(date +%Y-%m-%d).log"

2024-09-07.log

It looks the same to me as before at first glance tho.

<!-- gh-comment-id:2335374853 --> @svaningelgem commented on GitHub (Sep 7, 2024): Ok, I tried it: Output of the ls: ``` $ find /dev/kfd* /dev/dri* | xargs ls -ld drwxr-xr-x 3 root root 100 sep 7 16:43 /dev/dri drwxr-xr-x 2 root root 80 sep 7 16:43 /dev/dri/by-path lrwxrwxrwx 1 root root 8 sep 7 16:43 /dev/dri/by-path/pci-0000:03:00.0-card -> ../card1 lrwxrwxrwx 1 root root 13 sep 7 16:43 /dev/dri/by-path/pci-0000:03:00.0-render -> ../renderD128 crw-rw----+ 1 root video 226, 1 sep 7 16:43 /dev/dri/card1 crw-rw----+ 1 root render 226, 128 sep 7 16:43 /dev/dri/renderD128 crw-rw---- 1 root video 235, 0 sep 7 16:43 /dev/kfd ``` Output of the updated command: ``` docker run --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --replace --name ollama ollama/ollama:rocm |& tee -a "/var/log/ollama/$(date +%Y-%m-%d).log" ``` [2024-09-07.log](https://github.com/user-attachments/files/16918785/2024-09-07.log) It looks the same to me as before at first glance tho.
Author
Owner

@Froggy232 commented on GitHub (Sep 7, 2024):

Hi,
I'm in a very similar situation I think, except I use podman and fedora silverblue.
I have the same error messages, and also this one : error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
I run silverblue 41 beta, but I had the exact same problem few weeks ago on fedora silverblue 40. Running on the CPU image work well.
Thanks for your help!.

Edit : Sorry, in my case, it was from SELinux, I had to set container_use_devices to on.
Thanks for your soft!

<!-- gh-comment-id:2335990857 --> @Froggy232 commented on GitHub (Sep 7, 2024): Hi, I'm in a very similar situation I think, except I use podman and fedora silverblue. I have the same error messages, and also this one : `error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"` I run silverblue 41 beta, but I had the exact same problem few weeks ago on fedora silverblue 40. Running on the CPU image work well. Thanks for your help!. Edit : Sorry, in my case, it was from SELinux, I had to set `container_use_devices` to on. Thanks for your soft!
Author
Owner

@dhiltgen commented on GitHub (Sep 9, 2024):

@svaningelgem do following these instructions resolve the issue, or is there something else preventing GPU access on your system?

<!-- gh-comment-id:2338548942 --> @dhiltgen commented on GitHub (Sep 9, 2024): @svaningelgem do following [these instructions](https://github.com/ollama/ollama/blob/main/docs/gpu.md#container-permission) resolve the issue, or is there something else preventing GPU access on your system?
Author
Owner

@svaningelgem commented on GitHub (Sep 9, 2024):

Hi @dhiltgen , it seems to me that I don't need to force anything as my GPU is supported by default:

time=2024-09-07T14:58:25.149Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2 cuda_v11]"
time=2024-09-07T14:58:25.150Z level=INFO source=gpu.go:200 msg="looking for compatible GPUs"
time=2024-09-07T14:58:25.163Z level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=0 gpu_type=gfx1100
time=2024-09-07T14:58:25.163Z level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx1100 driver=6.8 name=1002:744c total="24.0 GiB" available="23.3 GiB"

I also have only 1 gpu in my system (unless it sees my CPU also a a GPU, but I don't see that appearing in the logs).

rocminfo:

$ /opt/rocm/bin/rocminfo | grep gfx
  Name:                    gfx1100 

Full rocminfo output: rocminfo.log

<!-- gh-comment-id:2338579314 --> @svaningelgem commented on GitHub (Sep 9, 2024): Hi @dhiltgen , it seems to me that I don't need to force anything as my GPU is supported by default: ``` time=2024-09-07T14:58:25.149Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2 cuda_v11]" time=2024-09-07T14:58:25.150Z level=INFO source=gpu.go:200 msg="looking for compatible GPUs" time=2024-09-07T14:58:25.163Z level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=0 gpu_type=gfx1100 time=2024-09-07T14:58:25.163Z level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx1100 driver=6.8 name=1002:744c total="24.0 GiB" available="23.3 GiB" ``` I also have only 1 gpu in my system (unless it sees my CPU also a a GPU, but I don't see that appearing in the logs). **rocminfo**: ``` $ /opt/rocm/bin/rocminfo | grep gfx Name: gfx1100 ``` Full rocminfo output: [rocminfo.log](https://github.com/user-attachments/files/16933731/rocminfo.log)
Author
Owner

@dhiltgen commented on GitHub (Sep 9, 2024):

@svaningelgem to clarify, the startup messages you're seeing are based on Ollama code looking in sysfs to discover the GPUs. This is different from performing inference where C++ code is using ROCm libraries to access the device directly. I don't have a system handy to test at the moment, but it's plausible SElinux may only be involved in the device access via ROCm, and not the sysfs discovery at startup. If you haven't tried those steps, I would give them a try so we can rule them out as a possible root cause.

<!-- gh-comment-id:2338984285 --> @dhiltgen commented on GitHub (Sep 9, 2024): @svaningelgem to clarify, the startup messages you're seeing are based on Ollama code looking in sysfs to discover the GPUs. This is different from performing inference where C++ code is using ROCm libraries to access the device directly. I don't have a system handy to test at the moment, but it's plausible SElinux may only be involved in the device access via ROCm, and not the sysfs discovery at startup. If you haven't tried those steps, I would give them a try so we can rule them out as a possible root cause.
Author
Owner

@svaningelgem commented on GitHub (Sep 10, 2024):

SElinux is (afaik) a feature of Redhat linux systems. On Ubuntu it's AppArmor. That might interfer too, but I couldn't immediately find out anything there. It seems that if it would be blocked, it'd be just not being able to be reported. However, it does get reported as an AMD GPU... So that would lead me to conclude it's not blocked.

Could you maybe tell me what I can try and should iterate and I'll do this when I get back home? Maybe also tell me on how to enabled debug logging, as I don't see anything wrong till it just crashes with my initial message...

Because as far as I see, the gfx1100 should be fine. [FYI: Linux kernel patches]

Thanks!

<!-- gh-comment-id:2339493576 --> @svaningelgem commented on GitHub (Sep 10, 2024): SElinux is (afaik) a feature of Redhat linux systems. On Ubuntu it's AppArmor. That might interfer too, but I couldn't immediately find out anything there. It seems that if it would be blocked, it'd be just not being able to be reported. However, it does get reported as an AMD GPU... So that would lead me to conclude it's not blocked. Could you maybe tell me what I can try and should iterate and I'll do this when I get back home? Maybe also tell me on how to enabled debug logging, as I don't see anything wrong till it just crashes with my initial message... Because as far as I see, the gfx1100 should be fine. [FYI: [Linux kernel patches](https://lists.freedesktop.org/archives/amd-gfx/2022-April/078400.html)] Thanks!
Author
Owner

@svaningelgem commented on GitHub (Sep 10, 2024):

Hmmm, I got this list from ChatGPT (so take it with a grain of salt as I couldn't verify the contents):

RDNA 3 (GFX1100 series)

  • GFX1102 - Radeon RX 7900 XTX, RX 7900 XT
  • GFX1103 - Radeon RX 7800 XT, RX 7700 XT

RDNA 2 (GFX1030 series)

  • GFX1030 - Radeon RX 6900 XT, RX 6800 XT, RX 6800
  • GFX1031 - Radeon RX 6700 XT
  • GFX1032 - Radeon RX 6600, RX 6600 XT

RDNA 1 (GFX1010 series)

  • GFX1010 - Radeon RX 5700 XT, RX 5700
  • GFX1011 - Radeon RX 5600 XT
  • GFX1012 - Radeon RX 5500 XT

Vega (GFX900 series)

  • GFX906 - Radeon VII, Radeon Instinct MI50, MI60
  • GFX900 - Radeon RX Vega 64, Vega 56

Polaris (GFX803/804 series)

  • GFX804 - Radeon RX 590, RX 580, RX 570 (Polaris 20)
  • GFX803 - Radeon RX 480, RX 470 (Polaris 10)

Navi 1X (GFX1010/1011)

  • GFX1010 - Radeon RX 5700 Series
  • GFX1011 - Radeon RX 5500 Series

Navi 2X (GFX1030/1031)

  • GFX1030 - Radeon RX 6800 Series, RX 6900 Series
  • GFX1031 - Radeon RX 6700 Series

Navi 3X (GFX1100)

  • GFX1102 - Radeon RX 7900 Series
<!-- gh-comment-id:2339496852 --> @svaningelgem commented on GitHub (Sep 10, 2024): Hmmm, I got this list from ChatGPT (so take it with a grain of salt as I couldn't verify the contents): ### **RDNA 3 (GFX1100 series)** - **GFX1102** - Radeon RX 7900 XTX, RX 7900 XT - **GFX1103** - Radeon RX 7800 XT, RX 7700 XT ### **RDNA 2 (GFX1030 series)** - **GFX1030** - Radeon RX 6900 XT, RX 6800 XT, RX 6800 - **GFX1031** - Radeon RX 6700 XT - **GFX1032** - Radeon RX 6600, RX 6600 XT ### **RDNA 1 (GFX1010 series)** - **GFX1010** - Radeon RX 5700 XT, RX 5700 - **GFX1011** - Radeon RX 5600 XT - **GFX1012** - Radeon RX 5500 XT ### **Vega (GFX900 series)** - **GFX906** - Radeon VII, Radeon Instinct MI50, MI60 - **GFX900** - Radeon RX Vega 64, Vega 56 ### **Polaris (GFX803/804 series)** - **GFX804** - Radeon RX 590, RX 580, RX 570 (Polaris 20) - **GFX803** - Radeon RX 480, RX 470 (Polaris 10) ### **Navi 1X (GFX1010/1011)** - **GFX1010** - Radeon RX 5700 Series - **GFX1011** - Radeon RX 5500 Series ### **Navi 2X (GFX1030/1031)** - **GFX1030** - Radeon RX 6800 Series, RX 6900 Series - **GFX1031** - Radeon RX 6700 Series ### **Navi 3X (GFX1100)** - **GFX1102** - Radeon RX 7900 Series
Author
Owner

@svaningelgem commented on GitHub (Sep 10, 2024):

Ok, my current command line:

docker run -e HSA_OVERRIDE_GFX_VERSION=gfx1102 -e OLLAMA_DEBUG=true --gpus=all --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --replace --name ollama ollama/ollama:rocm

Failure log: 2024-09-10.log

I also tried with "11.0.2" instead of "gfx1102", but that also got the same result.

<!-- gh-comment-id:2339563467 --> @svaningelgem commented on GitHub (Sep 10, 2024): Ok, my current command line: ``` docker run -e HSA_OVERRIDE_GFX_VERSION=gfx1102 -e OLLAMA_DEBUG=true --gpus=all --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --replace --name ollama ollama/ollama:rocm ``` Failure log: [2024-09-10.log](https://github.com/user-attachments/files/16939218/2024-09-10.log) I also tried with `"11.0.2"` instead of `"gfx1102"`, but that also got the same result.
Author
Owner

@svaningelgem commented on GitHub (Sep 10, 2024):

Ok, I used this auto-detection script:

#!/bin/bash

# Comprehensive list of GFX versions to try for AMD 7900 XTX
# Starting with GFX11 (RDNA 3) and including earlier versions for compatibility
GFX_VERSIONS=(
    # GFX11 (RDNA 3) - Primary target for 7900 XTX
    "11.0.0" "11.0.1" "11.0.2" "11.0.3" "11.1.0" "11.1.1" "11.1.2"
    # GFX10 (RDNA 2 and 1) - For potential backwards compatibility
    "10.3.0" "10.3.1" "10.3.2" "10.3.3" "10.3.4"
    "10.1.0" "10.1.1" "10.1.2"
    "10.0.0" "10.0.1" "10.0.2" "10.0.3"
    # GFX9 (Vega) - Included for extended backwards compatibility testing
    "9.0.0" "9.0.1" "9.0.2" "9.0.3" "9.0.4" "9.0.5" "9.0.6" "9.0.7" "9.0.8" "9.0.9"
    # Earlier versions included for thoroughness, though less likely to be optimal
    "8.1.0" "8.0.0" "7.0.0"
)

# Function to check if ollama is ready
check_ollama() {
  curl -s http://localhost:11434/api/version > /dev/null
  return $?
}

# Function to run ollama and test it
run_ollama() {
  # Wait for ollama to be ready
  while ! check_ollama; do
    echo "Waiting for ollama to be ready..."
    sleep 5
  done

  # Run ollama and test it
  if docker exec -it ollama ollama run llama3.1 "Hello, how are you?"; then
    echo "Ollama run successful"
    touch /tmp/ollama_success
  else
    echo "Ollama run failed"
    rm -f /tmp/ollama_success
  fi
}

for GFX_VERSION in "${GFX_VERSIONS[@]}"; do
    echo "Trying GFX version: $GFX_VERSION"

    # Stop and remove existing container if it exists
    docker stop ollama 2>/dev/null
    docker rm ollama 2>/dev/null

    # Run the Docker container
    docker run -d \
      -e HSA_OVERRIDE_GFX_VERSION=$GFX_VERSION \
      -e OLLAMA_DEBUG=true \
      --gpus=all \
      --device /dev/kfd \
      --device /dev/dri \
      -v ollama:/root/.ollama \
      -p 11434:11434 \
      --name ollama \
      ollama/ollama:rocm

    # Save the log with the version in background
    docker logs -f ollama |& tee "/var/log/ollama/${GFX_VERSION}.log" &
    LOG_PID=$!

    # Run ollama test in background
    run_ollama &
    OLLAMA_PID=$!

    # Wait for the Ollama run to complete or timeout after 5 minutes
    timeout 300 tail --pid=$OLLAMA_PID -f /dev/null

    # Check if Ollama run was successful
    if [ -f "/tmp/ollama_success" ]; then
        echo "Ollama run successful with GFX version $GFX_VERSION" >> /tmp/auto.txt
        kill $LOG_PID
        break
    else
        echo "Ollama run failed or timed out with GFX version $GFX_VERSION. Trying next version..." >> /tmp/auto.txt
        kill $LOG_PID
        kill $OLLAMA_PID 2>/dev/null
    fi
done

# Clean up
rm -f /tmp/ollama_success

if [ ! -f "/tmp/ollama_success" ]; then
    echo "Failed to find a working GFX version."
else
    echo "Successfully found a working GFX version: $GFX_VERSION"
fi

But all of them failed...

What else could I try?

<!-- gh-comment-id:2339576944 --> @svaningelgem commented on GitHub (Sep 10, 2024): Ok, I used this auto-detection script: ```bash #!/bin/bash # Comprehensive list of GFX versions to try for AMD 7900 XTX # Starting with GFX11 (RDNA 3) and including earlier versions for compatibility GFX_VERSIONS=( # GFX11 (RDNA 3) - Primary target for 7900 XTX "11.0.0" "11.0.1" "11.0.2" "11.0.3" "11.1.0" "11.1.1" "11.1.2" # GFX10 (RDNA 2 and 1) - For potential backwards compatibility "10.3.0" "10.3.1" "10.3.2" "10.3.3" "10.3.4" "10.1.0" "10.1.1" "10.1.2" "10.0.0" "10.0.1" "10.0.2" "10.0.3" # GFX9 (Vega) - Included for extended backwards compatibility testing "9.0.0" "9.0.1" "9.0.2" "9.0.3" "9.0.4" "9.0.5" "9.0.6" "9.0.7" "9.0.8" "9.0.9" # Earlier versions included for thoroughness, though less likely to be optimal "8.1.0" "8.0.0" "7.0.0" ) # Function to check if ollama is ready check_ollama() { curl -s http://localhost:11434/api/version > /dev/null return $? } # Function to run ollama and test it run_ollama() { # Wait for ollama to be ready while ! check_ollama; do echo "Waiting for ollama to be ready..." sleep 5 done # Run ollama and test it if docker exec -it ollama ollama run llama3.1 "Hello, how are you?"; then echo "Ollama run successful" touch /tmp/ollama_success else echo "Ollama run failed" rm -f /tmp/ollama_success fi } for GFX_VERSION in "${GFX_VERSIONS[@]}"; do echo "Trying GFX version: $GFX_VERSION" # Stop and remove existing container if it exists docker stop ollama 2>/dev/null docker rm ollama 2>/dev/null # Run the Docker container docker run -d \ -e HSA_OVERRIDE_GFX_VERSION=$GFX_VERSION \ -e OLLAMA_DEBUG=true \ --gpus=all \ --device /dev/kfd \ --device /dev/dri \ -v ollama:/root/.ollama \ -p 11434:11434 \ --name ollama \ ollama/ollama:rocm # Save the log with the version in background docker logs -f ollama |& tee "/var/log/ollama/${GFX_VERSION}.log" & LOG_PID=$! # Run ollama test in background run_ollama & OLLAMA_PID=$! # Wait for the Ollama run to complete or timeout after 5 minutes timeout 300 tail --pid=$OLLAMA_PID -f /dev/null # Check if Ollama run was successful if [ -f "/tmp/ollama_success" ]; then echo "Ollama run successful with GFX version $GFX_VERSION" >> /tmp/auto.txt kill $LOG_PID break else echo "Ollama run failed or timed out with GFX version $GFX_VERSION. Trying next version..." >> /tmp/auto.txt kill $LOG_PID kill $OLLAMA_PID 2>/dev/null fi done # Clean up rm -f /tmp/ollama_success if [ ! -f "/tmp/ollama_success" ]; then echo "Failed to find a working GFX version." else echo "Successfully found a working GFX version: $GFX_VERSION" fi ``` But all of them failed... What else could I try?
Author
Owner

@TheRedCyclops commented on GitHub (Sep 10, 2024):

Have you tried with a system install? that worked for me, although it's not ideal
ps: this is also happening on Arch Linux without SELinux or AppArmor and using a GFX version that has been verified to work on other AI applications and on a system install of ollama, relevant discord thread
ps2: the --gpus all option only seems to be relevant for nvidia GPUs.

<!-- gh-comment-id:2341123353 --> @TheRedCyclops commented on GitHub (Sep 10, 2024): Have you tried with a system install? that worked for me, although it's not ideal ps: this is also happening on Arch Linux without SELinux or AppArmor and using a GFX version that has been verified to work on other AI applications and on a system install of ollama, [relevant discord thread](https://discord.com/channels/1128867683291627614/1281320841124253850) ps2: the --gpus all option only seems to be relevant for nvidia GPUs.
Author
Owner

@svaningelgem commented on GitHub (Sep 10, 2024):

@Glich440 : no, not yet. I'll give it a try, but that's not really what I want to do... I'd like to run it from within a container.
I'll update this comment once I tried via a system install.

<!-- gh-comment-id:2341216235 --> @svaningelgem commented on GitHub (Sep 10, 2024): @Glich440 : no, not yet. I'll give it a try, but that's not really what I want to do... I'd like to run it from within a container. I'll update this comment once I tried via a system install.
Author
Owner

@dhiltgen commented on GitHub (Sep 10, 2024):

@svaningelgem you could try setting AMD_LOG_LEVEL to 2 or 3 and see if some more useful details emerge from ROCm. https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/debugging.html#hip-environment-variable-summary

I'm also curious if this has regressed in newer versions of Ollama, or if all ~recent version fail in the same way.

<!-- gh-comment-id:2341283380 --> @dhiltgen commented on GitHub (Sep 10, 2024): @svaningelgem you could try setting AMD_LOG_LEVEL to 2 or 3 and see if some more useful details emerge from ROCm. https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/debugging.html#hip-environment-variable-summary I'm also curious if this has regressed in newer versions of Ollama, or if all ~recent version fail in the same way.
Author
Owner

@TheRedCyclops commented on GitHub (Sep 10, 2024):

I have also tested with the 0.3.7-rocm and 0.3.1-rocm tags, it fails in the same way

<!-- gh-comment-id:2341297817 --> @TheRedCyclops commented on GitHub (Sep 10, 2024): I have also tested with the 0.3.7-rocm and 0.3.1-rocm tags, it fails in the same way
Author
Owner

@dhiltgen commented on GitHub (Sep 10, 2024):

@Glich440 if you can identify which version it was working on for your setup that will help us isolate what changed and is leading to the regression.

<!-- gh-comment-id:2341304442 --> @dhiltgen commented on GitHub (Sep 10, 2024): @Glich440 if you can identify which version it was working on for your setup that will help us isolate what changed and is leading to the regression.
Author
Owner

@TheRedCyclops commented on GitHub (Sep 10, 2024):

on the native system it seems like it works with any version, I have never actually gotten the gpu to work with the docker container

<!-- gh-comment-id:2341310783 --> @TheRedCyclops commented on GitHub (Sep 10, 2024): on the native system it seems like it works with any version, I have never actually gotten the gpu to work with the docker container
Author
Owner

@TheRedCyclops commented on GitHub (Sep 10, 2024):

I have now tested the 0.2.1-rocm tag and I get a slightly different error message, from:
Error: llama runner process has terminated: error:Could not initialize Tensile host: No devices found
to:
Error: llama runner process has terminated: signal: aborted (core dumped) error:Could not initialize Tensile host: No devices found
but it still seems to be the same error

<!-- gh-comment-id:2341320181 --> @TheRedCyclops commented on GitHub (Sep 10, 2024): I have now tested the 0.2.1-rocm tag and I get a slightly different error message, from: `Error: llama runner process has terminated: error:Could not initialize Tensile host: No devices found` to: `Error: llama runner process has terminated: signal: aborted (core dumped) error:Could not initialize Tensile host: No devices found` but it still seems to be the same error
Author
Owner

@TheRedCyclops commented on GitHub (Sep 10, 2024):

Wait, huge development, when I use this command docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 -e HSA_OVERRIDE_GFX_VERSION="10.3.0" --name ollama-test2 ollama/ollama:rocm it actually runs on the gpu!
now I just don't understand why that works but this fails:

name: ollama-debugging
services:
  ollama:
    image: ollama/ollama:rocm
    container_name: ollama-testing
    environment:
      - HSA_OVERRIDE_GFX_VERSION="10.3.0"
      - OLLAMA_DEBUG=true
      - AMD_LOG_LEVEL=2
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    volumes:
      - ./data:/root/.ollama
    expose:
      - 11434:11434
    restart: unless-stopped
<!-- gh-comment-id:2341354030 --> @TheRedCyclops commented on GitHub (Sep 10, 2024): Wait, huge development, when I use this command `docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 -e HSA_OVERRIDE_GFX_VERSION="10.3.0" --name ollama-test2 ollama/ollama:rocm` it actually runs on the gpu! now I just don't understand why that works but this fails: ``` name: ollama-debugging services: ollama: image: ollama/ollama:rocm container_name: ollama-testing environment: - HSA_OVERRIDE_GFX_VERSION="10.3.0" - OLLAMA_DEBUG=true - AMD_LOG_LEVEL=2 devices: - /dev/kfd:/dev/kfd - /dev/dri:/dev/dri volumes: - ./data:/root/.ollama expose: - 11434:11434 restart: unless-stopped ```
Author
Owner

@svaningelgem commented on GitHub (Sep 10, 2024):

AMD_LOG_LEVEL

I do get indeed a little bit more info:

:3:rocdevice.cpp            :468 : 0079107519 us: [pid:37    tid:0x79adf78bf340] Initializing HSA stack.
:1:rocdevice.cpp            :478 : 0079107773 us: [pid:37    tid:0x79adf78bf340] hsa_init failed with 1008
:1:runtime.cpp              :78  : 0079107777 us: [pid:37    tid:0x79adf78bf340] Runtime initialization failed
:3:hip_device_runtime.cpp   :638 : 0079107787 us: [pid:37    tid:0x79adf78bf340]  hipGetDeviceCount ( 0x7ffd06c92c9c ) 
:3:hip_device_runtime.cpp   :640 : 0079107789 us: [pid:37    tid:0x79adf78bf340] hipGetDeviceCount: Returned hipErrorNoDevice : 

rocBLAS error: Could not initialize Tensile host: No devices found

To me it doesn't seems useful, but I hope to you it does? ;-)

<!-- gh-comment-id:2341360321 --> @svaningelgem commented on GitHub (Sep 10, 2024): > AMD_LOG_LEVEL I do get indeed a little bit more info: ``` :3:rocdevice.cpp :468 : 0079107519 us: [pid:37 tid:0x79adf78bf340] Initializing HSA stack. :1:rocdevice.cpp :478 : 0079107773 us: [pid:37 tid:0x79adf78bf340] hsa_init failed with 1008 :1:runtime.cpp :78 : 0079107777 us: [pid:37 tid:0x79adf78bf340] Runtime initialization failed :3:hip_device_runtime.cpp :638 : 0079107787 us: [pid:37 tid:0x79adf78bf340] hipGetDeviceCount ( 0x7ffd06c92c9c ) :3:hip_device_runtime.cpp :640 : 0079107789 us: [pid:37 tid:0x79adf78bf340] hipGetDeviceCount: Returned hipErrorNoDevice : rocBLAS error: Could not initialize Tensile host: No devices found ``` To me it doesn't seems useful, but I hope to you it does? ;-)
Author
Owner

@svaningelgem commented on GitHub (Sep 10, 2024):

This is what is loaded inside the pod:

[root@0a507b630d4c lib]# lsmod | grep amd
edac_mce_amd           28672  0 
kvm_amd               208896  0 
kvm                  1404928  1 kvm_amd
ccp                   143360  1 kvm_amd
gpio_amdpt             16384  0 
amdgpu              19636224  11 
amddrm_ttm_helper      12288  1 amdgpu
amdttm                110592  2 amdgpu,amddrm_ttm_helper
amddrm_buddy           20480  1 amdgpu
amdxcp                 12288  1 amdgpu
drm_exec               12288  1 amdgpu
drm_suballoc_helper    16384  1 amdgpu
amd_sched              61440  1 amdgpu
amdkcl                 32768  3 amd_sched,amdttm,amdgpu
drm_display_helper    237568  1 amdgpu
video                  73728  3 asus_wmi,amdgpu,asus_nb_wmi
i2c_algo_bit           16384  1 amdgpu

I was thinking that maybe the kernel module wasn't loaded, but that seems not the case.

<!-- gh-comment-id:2341370933 --> @svaningelgem commented on GitHub (Sep 10, 2024): This is what is loaded inside the pod: ``` [root@0a507b630d4c lib]# lsmod | grep amd edac_mce_amd 28672 0 kvm_amd 208896 0 kvm 1404928 1 kvm_amd ccp 143360 1 kvm_amd gpio_amdpt 16384 0 amdgpu 19636224 11 amddrm_ttm_helper 12288 1 amdgpu amdttm 110592 2 amdgpu,amddrm_ttm_helper amddrm_buddy 20480 1 amdgpu amdxcp 12288 1 amdgpu drm_exec 12288 1 amdgpu drm_suballoc_helper 16384 1 amdgpu amd_sched 61440 1 amdgpu amdkcl 32768 3 amd_sched,amdttm,amdgpu drm_display_helper 237568 1 amdgpu video 73728 3 asus_wmi,amdgpu,asus_nb_wmi i2c_algo_bit 16384 1 amdgpu ``` I was thinking that maybe the kernel module wasn't loaded, but that seems not the case.
Author
Owner

@dhiltgen commented on GitHub (Sep 10, 2024):

@svaningelgem poking around online, it seems like the kfd driver might be involved. Anything interesting on the host in sudo dmesg | grep kfd ?

<!-- gh-comment-id:2341379186 --> @dhiltgen commented on GitHub (Sep 10, 2024): @svaningelgem poking around online, it seems like the kfd driver might be involved. Anything interesting on the host in `sudo dmesg | grep kfd` ?
Author
Owner

@svaningelgem commented on GitHub (Sep 10, 2024):

docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 -e HSA_OVERRIDE_GFX_VERSION="10.3.0" --name ollama-test2 ollama/ollama:rocm

Sorry mate... Not working at my side :( (the only thing i removed is the "-d" to not run it in the background)

I tried the exact same command, and ... nothing, still the same error.

<!-- gh-comment-id:2341384325 --> @svaningelgem commented on GitHub (Sep 10, 2024): > docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 -e HSA_OVERRIDE_GFX_VERSION="10.3.0" --name ollama-test2 ollama/ollama:rocm Sorry mate... Not working at my side :( (the only thing i removed is the "-d" to not run it in the background) I tried the exact same command, and ... nothing, still the same error.
Author
Owner

@svaningelgem commented on GitHub (Sep 10, 2024):

sudo dmesg | grep kfd

[    3.980681] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    3.980697] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[    3.980980] kfd kfd: amdgpu: added device 1002:744c
<!-- gh-comment-id:2341387426 --> @svaningelgem commented on GitHub (Sep 10, 2024): > sudo dmesg | grep kfd ``` [ 3.980681] kfd kfd: amdgpu: Allocated 3969056 bytes on gart [ 3.980697] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1 [ 3.980980] kfd kfd: amdgpu: added device 1002:744c ```
Author
Owner

@dhiltgen commented on GitHub (Sep 10, 2024):

  /**
   * The HSA runtime failed to allocate the necessary resources. This error
   * may also occur when the HSA runtime needs to spawn threads or create
   * internal OS-specific events.
   */
  HSA_STATUS_ERROR_OUT_OF_RESOURCES = 0x1008,

It seems like it may be permissions related to the kfd device. Are you attempting to run deprivielged by any chance?

<!-- gh-comment-id:2341387687 --> @dhiltgen commented on GitHub (Sep 10, 2024): ``` /** * The HSA runtime failed to allocate the necessary resources. This error * may also occur when the HSA runtime needs to spawn threads or create * internal OS-specific events. */ HSA_STATUS_ERROR_OUT_OF_RESOURCES = 0x1008, ``` It seems like it may be permissions related to the kfd device. Are you attempting to run deprivielged by any chance?
Author
Owner

@dhiltgen commented on GitHub (Sep 10, 2024):

Try adding --privileged to the docker run and see if that resolves it?

<!-- gh-comment-id:2341400602 --> @dhiltgen commented on GitHub (Sep 10, 2024): Try adding `--privileged` to the `docker run` and see if that resolves it?
Author
Owner

@svaningelgem commented on GitHub (Sep 10, 2024):

Adding --privileged to the docker run command didn't solve anything. Still the same.
I'm now trying with sudo on top of the --privileged command (but it seems to be copying the container to the root user, so it can take a while ;)).

The command now is:
sudo docker run -e HSA_OVERRIDE_GFX_VERSION=11.0.0 -e AMD_LOG_LEVEL=3 -e OLLAMA_DEBUG=true --privileged --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --replace --name ollama ollama/ollama:rocm

<!-- gh-comment-id:2341410178 --> @svaningelgem commented on GitHub (Sep 10, 2024): Adding `--privileged` to the docker run command didn't solve anything. Still the same. I'm now trying with `sudo` on top of the `--privileged` command (but it seems to be copying the container to the root user, so it can take a while ;)). The command now is: `sudo docker run -e HSA_OVERRIDE_GFX_VERSION=11.0.0 -e AMD_LOG_LEVEL=3 -e OLLAMA_DEBUG=true --privileged --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --replace --name ollama ollama/ollama:rocm`
Author
Owner

@dhiltgen commented on GitHub (Sep 10, 2024):

I've been able to reproduce the same failure mode by running on a linux host after removing my account from video and render groups, so it seems like this is most likely a permission problem somewhere in the docker container runtime.

llm_load_print_meta: max token length = 18
:3:rocdevice.cpp            :468 : 1550426468 us: [pid:4574  tid:0x7fedb08b8340] Initializing HSA stack.
:1:rocdevice.cpp            :478 : 1550426510 us: [pid:4574  tid:0x7fedb08b8340] hsa_init failed with 1008
:1:runtime.cpp              :78  : 1550426513 us: [pid:4574  tid:0x7fedb08b8340] Runtime initialization failed
:3:hip_device_runtime.cpp   :638 : 1550426518 us: [pid:4574  tid:0x7fedb08b8340]  hipGetDeviceCount ( 0x7ffd160d91fc )
:3:hip_device_runtime.cpp   :640 : 1550426521 us: [pid:4574  tid:0x7fedb08b8340] hipGetDeviceCount: Returned hipErrorNoDevice :

rocBLAS error: Could not initialize Tensile host: No devices found
time=2024-09-10T16:21:43.039Z level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: error:Could not initialize Tensile host: No devices found"

We used to have a check at startup on ROCm system to verify permissions, but it seems that code got accidentally removed at some point in our GPU discovery refactoring work. I'll add a check back in so we can fail fast with a more helpful error message that permissions are not set up correctly for the ollama serve command to access the radeon device.

<!-- gh-comment-id:2341418553 --> @dhiltgen commented on GitHub (Sep 10, 2024): I've been able to reproduce the same failure mode by running on a linux host after removing my account from `video` and `render` groups, so it seems like this is most likely a permission problem somewhere in the docker container runtime. ``` llm_load_print_meta: max token length = 18 :3:rocdevice.cpp :468 : 1550426468 us: [pid:4574 tid:0x7fedb08b8340] Initializing HSA stack. :1:rocdevice.cpp :478 : 1550426510 us: [pid:4574 tid:0x7fedb08b8340] hsa_init failed with 1008 :1:runtime.cpp :78 : 1550426513 us: [pid:4574 tid:0x7fedb08b8340] Runtime initialization failed :3:hip_device_runtime.cpp :638 : 1550426518 us: [pid:4574 tid:0x7fedb08b8340] hipGetDeviceCount ( 0x7ffd160d91fc ) :3:hip_device_runtime.cpp :640 : 1550426521 us: [pid:4574 tid:0x7fedb08b8340] hipGetDeviceCount: Returned hipErrorNoDevice : rocBLAS error: Could not initialize Tensile host: No devices found time=2024-09-10T16:21:43.039Z level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: error:Could not initialize Tensile host: No devices found" ``` We used to have a check at startup on ROCm system to verify permissions, but it seems that code got accidentally removed at some point in our GPU discovery refactoring work. I'll add a check back in so we can fail fast with a more helpful error message that permissions are not set up correctly for the `ollama serve` command to access the radeon device.
Author
Owner

@TheRedCyclops commented on GitHub (Sep 10, 2024):

I ended up using composerize and got this, it seems to work:

name: <your project name>
services:
    ollama:
        devices:
            - /dev/kfd
            - /dev/dri
        volumes:
            - ./ollama:/root/.ollama
        ports:
            - 11434:11434
        environment:
            - HSA_OVERRIDE_GFX_VERSION=10.3.0
        container_name: ollama-test2
        image: ollama/ollama:rocm

My issue is solved, hope you can solve it too @svaningelgem

<!-- gh-comment-id:2341419198 --> @TheRedCyclops commented on GitHub (Sep 10, 2024): I ended up using composerize and got this, it seems to work: ``` name: <your project name> services: ollama: devices: - /dev/kfd - /dev/dri volumes: - ./ollama:/root/.ollama ports: - 11434:11434 environment: - HSA_OVERRIDE_GFX_VERSION=10.3.0 container_name: ollama-test2 image: ollama/ollama:rocm ``` My issue is solved, hope you can solve it too @svaningelgem
Author
Owner

@svaningelgem commented on GitHub (Sep 10, 2024):

@dhiltgen :

(base) root@LinuxPC:/var/lib/containers# adduser steven video
info: The user `steven' is already a member of `video'.
(base) root@LinuxPC:/var/lib/containers# adduser steven render
info: The user `steven' is already a member of `render'.

But indeed it seems to be pointing to a permission issue

<!-- gh-comment-id:2341429728 --> @svaningelgem commented on GitHub (Sep 10, 2024): @dhiltgen : ``` (base) root@LinuxPC:/var/lib/containers# adduser steven video info: The user `steven' is already a member of `video'. (base) root@LinuxPC:/var/lib/containers# adduser steven render info: The user `steven' is already a member of `render'. ``` But indeed it seems to be pointing to a permission issue
Author
Owner

@svaningelgem commented on GitHub (Sep 10, 2024):

Might it be an issue that on my machine I am running rocm6.2 & on the pod it's rocm6.0 ? I doubt, but you never know...

<!-- gh-comment-id:2341556583 --> @svaningelgem commented on GitHub (Sep 10, 2024): Might it be an issue that on my machine I am running rocm6.2 & on the pod it's rocm6.0 ? I doubt, but you never know...
Author
Owner

@rtaic-coder commented on GitHub (Sep 10, 2024):

I am wondering if this is because of rootless container. I get this error while running rcominfo inside the container:

docker exec -it ollama rocminfo
ROCk module version 6.8.5 is loaded
Unable to open /dev/kfd read-write: Permission denied

I tried to add render,video,nogroup groups to the user running the ollama inside container.

<!-- gh-comment-id:2341716618 --> @rtaic-coder commented on GitHub (Sep 10, 2024): I am wondering if this is because of rootless container. I get this error while running `rcominfo` inside the container: ```bash docker exec -it ollama rocminfo ROCk module version 6.8.5 is loaded Unable to open /dev/kfd read-write: Permission denied ``` I tried to add render,video,nogroup groups to the user running the ollama inside container.
Author
Owner

@rtaic-coder commented on GitHub (Sep 10, 2024):

@svaningelgem I thought the same thing, so I built my own ollama image locally with rcom 6.2 but still giving me same error.

<!-- gh-comment-id:2341882254 --> @rtaic-coder commented on GitHub (Sep 10, 2024): @svaningelgem I thought the same thing, so I built my own ollama image locally with rcom 6.2 but still giving me same error.
Author
Owner

@svaningelgem commented on GitHub (Sep 10, 2024):

docker exec -it ollama rocminfo
ROCk module version 6.8.5 is loaded
Unable to open /dev/kfd read-write: Permission denied

I get something more:

(base) steven@LinuxPC:~$ docker exec -it ollama rocminfo
ROCk module version 6.8.5 is loaded
Unable to open /dev/kfd read-write: Permission denied
root is not member of "video" group, the default DRM access group. Users must be a member of the "video" group or another DRM access group in order for ROCm applications to run successfully.

So to me it looks like the root user INSIDE the container needs to be a member of the video group?

Tried it with:

# adduser root video
info: Adding user `root' to group `video' ...

On the host, but that didn't change anything ;)

After: usermod -aG video root, at least the warning went away:

 docker exec -it ollama rocminfo
ROCk module version 6.8.5 is loaded
Unable to open /dev/kfd read-write: Permission denied
root is member of video group

Still isn't right yet...

<!-- gh-comment-id:2341889644 --> @svaningelgem commented on GitHub (Sep 10, 2024): > ```shell > docker exec -it ollama rocminfo > ROCk module version 6.8.5 is loaded > Unable to open /dev/kfd read-write: Permission denied > ``` I get something more: ``` (base) steven@LinuxPC:~$ docker exec -it ollama rocminfo ROCk module version 6.8.5 is loaded Unable to open /dev/kfd read-write: Permission denied root is not member of "video" group, the default DRM access group. Users must be a member of the "video" group or another DRM access group in order for ROCm applications to run successfully. ``` So to me it looks like the root user INSIDE the container needs to be a member of the video group? Tried it with: ``` # adduser root video info: Adding user `root' to group `video' ... ``` On the host, but that didn't change anything ;) After: `usermod -aG video root`, at least the warning went away: ``` docker exec -it ollama rocminfo ROCk module version 6.8.5 is loaded Unable to open /dev/kfd read-write: Permission denied root is member of video group ``` Still isn't right yet...
Author
Owner

@rtaic-coder commented on GitHub (Sep 10, 2024):

@svaningelgem Since I am using dev-ubuntu-24.04:6.2-complete as base of my image. Ubuntu has nogroup as owner of DRM. So my error was about nogroup. So I add nogroup, video and render groups to root user inside container. So I don't get the group error rather only permission denied error.

Bottom line it doesn't work.

<!-- gh-comment-id:2341900532 --> @rtaic-coder commented on GitHub (Sep 10, 2024): @svaningelgem Since I am using `dev-ubuntu-24.04:6.2-complete` as base of my image. Ubuntu has `nogroup` as owner of DRM. So my error was about nogroup. So I add nogroup, video and render groups to root user inside container. So I don't get the group error rather only permission denied error. Bottom line it doesn't work.
Author
Owner

@svaningelgem commented on GitHub (Sep 10, 2024):

A bit more info:

$ podman info | grep -i apparmor
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +WASM:wasmedge +YAJL
    apparmorEnabled: false

==> app armor is NOT enabled on my system, so that isn't interfering with anything either.

<!-- gh-comment-id:2341913371 --> @svaningelgem commented on GitHub (Sep 10, 2024): A bit more info: ``` $ podman info | grep -i apparmor +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +WASM:wasmedge +YAJL apparmorEnabled: false ``` ==> app armor is NOT enabled on my system, so that isn't interfering with anything either.
Author
Owner

@dhiltgen commented on GitHub (Sep 10, 2024):

@svaningelgem it sounds like you're using podman. In that case try:

podman run --rm  -it  --device=/dev/kfd   --device=/dev/dri  --ipc=host   ...

If that doesn't clear it up, there are some other suggestions on https://github.com/ROCm/ROCm/issues/1549 that might help you find a configuration that gets the permissions wired up correctly so the container can access /dev/kfd

Once we find the solution, I'll update our docs to include that as well.

<!-- gh-comment-id:2342292311 --> @dhiltgen commented on GitHub (Sep 10, 2024): @svaningelgem it sounds like you're using podman. In that case try: ``` podman run --rm -it --device=/dev/kfd --device=/dev/dri --ipc=host ... ``` If that doesn't clear it up, there are some other suggestions on https://github.com/ROCm/ROCm/issues/1549 that might help you find a configuration that gets the permissions wired up correctly so the container can access /dev/kfd Once we find the solution, I'll update our docs to include that as well.
Author
Owner

@svaningelgem commented on GitHub (Sep 11, 2024):

This is from within the pod:

[root@LinuxPC /]# id && groups
uid=0(root) gid=0(root) groups=0(root)
root
[root@LinuxPC /]# ls -la /dev/kfd
crw-rw---- 1 65534 65534 235, 0 Sep 10 19:47 /dev/kfd
[root@LinuxPC /]# chown root:video /dev/kfd
chown: changing ownership of ‘/dev/kfd’: Operation not permitted

So it looks like the kfd device has a wrong user/group assigned. Checking now how I can make the "root" user part of it.

I also tried via subgid, but that didn't work out either. I'll try a little bit with base rocm images to see if things work out first and take it from there.

<!-- gh-comment-id:2342599814 --> @svaningelgem commented on GitHub (Sep 11, 2024): This is from within the pod: ``` [root@LinuxPC /]# id && groups uid=0(root) gid=0(root) groups=0(root) root [root@LinuxPC /]# ls -la /dev/kfd crw-rw---- 1 65534 65534 235, 0 Sep 10 19:47 /dev/kfd [root@LinuxPC /]# chown root:video /dev/kfd chown: changing ownership of ‘/dev/kfd’: Operation not permitted ``` So it looks like the kfd device has a wrong user/group assigned. Checking now how I can make the "root" user part of it. I also tried via subgid, but that didn't work out either. I'll try a little bit with base rocm images to see if things work out first and take it from there.
Author
Owner

@svaningelgem commented on GitHub (Sep 11, 2024):

Ok, update: I tried with this command:
docker run -it --device=/dev/kfd --group-add daemon rocm/pytorch:latest rocminfo

This made it output stuff (and not the permission error anymore).
When I check the device in this pytorch image, I see this:

$ docker run -it --device=/dev/kfd --group-add daemon rocm/pytorch:latest ls -l /dev/kfd
crw-rw---- 1 nobody daemon 235, 0 Sep 11 07:18 /dev/kfd

When I do the same in the ollama image, I get:

$ docker run -it --device=/dev/kfd --group-add daemon --replace --name ollama --entrypoint /bin/bash ollama/ollama:rocm -c "ls -l /dev/kfd"
crw-rw---- 1 65534 bin 235, 0 Sep 11 07:18 /dev/kfd

So to me it looks like there is something wrong with the user/group of this device inside the ollama image.

<!-- gh-comment-id:2343013046 --> @svaningelgem commented on GitHub (Sep 11, 2024): Ok, update: I tried with this command: `docker run -it --device=/dev/kfd --group-add daemon rocm/pytorch:latest rocminfo` This made it output stuff (and not the permission error anymore). When I check the device in this pytorch image, I see this: ``` $ docker run -it --device=/dev/kfd --group-add daemon rocm/pytorch:latest ls -l /dev/kfd crw-rw---- 1 nobody daemon 235, 0 Sep 11 07:18 /dev/kfd ``` When I do the same in the ollama image, I get: ``` $ docker run -it --device=/dev/kfd --group-add daemon --replace --name ollama --entrypoint /bin/bash ollama/ollama:rocm -c "ls -l /dev/kfd" crw-rw---- 1 65534 bin 235, 0 Sep 11 07:18 /dev/kfd ``` So to me it looks like there is something wrong with the user/group of this device inside the ollama image.
Author
Owner

@TheRedCyclops commented on GitHub (Sep 11, 2024):

have you tried adding root to the bin group?

<!-- gh-comment-id:2343067590 --> @TheRedCyclops commented on GitHub (Sep 11, 2024): have you tried adding root to the bin group?
Author
Owner

@svaningelgem commented on GitHub (Sep 11, 2024):

Well, today it showed as "bin", yesterday it was "65534"... But bin is wrong anyhow. It should be video or render I presume?
I also noticed that the API calls to the service are colored now, so likely it's a new rocm image.

Ok, rocminfo doesn't give the error anymore, but it's only showing the CPU, not the GPU...

<!-- gh-comment-id:2343071920 --> @svaningelgem commented on GitHub (Sep 11, 2024): Well, today it showed as "bin", yesterday it was "65534"... But bin is wrong anyhow. It should be video or render I presume? I also noticed that the API calls to the service are colored now, so likely it's a new rocm image. Ok, rocminfo doesn't give the error anymore, but it's only showing the CPU, not the GPU...
Author
Owner

@TheRedCyclops commented on GitHub (Sep 11, 2024):

Have you also forwarded /dev/dri?
and also what are the permisions and owner of /dev/kfd on your base system?

<!-- gh-comment-id:2343095218 --> @TheRedCyclops commented on GitHub (Sep 11, 2024): Have you also forwarded /dev/dri? and also what are the permisions and owner of /dev/kfd on your base system?
Author
Owner

@svaningelgem commented on GitHub (Sep 11, 2024):

Yeah, I did notice that one as well:

So the smallest command line that works becomes:

$ docker run -it --device=/dev/kfd --device=/dev/dri --group-add daemon rocm/pytorch:latest rocminfo | grep GPU
  Uuid:                    GPU-afcb395b37c2835e               
  Device Type:             GPU     
$ docker run -it --device=/dev/kfd --device=/dev/dri --group-add daemon rocm/pytorch:latest ls -ld /dev/kfd /dev/dri /dev/dri/*
ls: cannot access '/dev/dri/by-path': No such file or directory
drwxr-xr-x  2 root   root         80 Sep 11 09:18 /dev/dri
crw-rw----+ 1 nobody daemon 226,   1 Sep 11 09:06 /dev/dri/card1
crw-rw----+ 1 nobody bin    226, 128 Sep 11 09:06 /dev/dri/renderD128
crw-rw----  1 nobody daemon 235,   0 Sep 11 09:06 /dev/kfd

In ollama:

$ docker run -it --device=/dev/kfd --device=/dev/dri --group-add bin --replace --name ollama --entrypoint /bin/bash ollama/ollama:rocm -c "ls -ld /dev/kfd /dev/dri /dev/dri/*"
drwxr-xr-x  2 root  root         80 Sep 11 09:19 /dev/dri
crw-rw----+ 1 65534 bin    226,   1 Sep 11 09:06 /dev/dri/card1
crw-rw----+ 1 65534 daemon 226, 128 Sep 11 09:06 /dev/dri/renderD128
crw-rw----  1 65534 bin    235,   0 Sep 11 09:06 /dev/kfd

on my base system:

(base) steven@LinuxPC:~$ ls -ld /dev/kfd /dev/dri /dev/dri/*
drwxr-xr-x  3 root root        100 sep 11 11:06 /dev/dri
drwxr-xr-x  2 root root         80 sep 11 11:06 /dev/dri/by-path
crw-rw----+ 1 root video  226,   1 sep 11 11:06 /dev/dri/card1
crw-rw----+ 1 root render 226, 128 sep 11 11:06 /dev/dri/renderD128
crw-rw----  1 root video  235,   0 sep 11 11:06 /dev/kfd

And we have liftoff 🚀 !!

(base) steven@LinuxPC:~$ ollama run llama3.1
>>> hello
Hello! How are you today? Is there something I can help you with or would you like to chat?

>>> Send a message (/? for help)

Used command line:

docker run -it --replace \
	-v ollama:/root/.ollama \
	--device /dev/kfd --device /dev/dri \
	--group-add bin \
	-e AMD_LOG_LEVEL=3 -e OLLAMA_DEBUG=true \
	-p 11434:11434 \
	--name ollama \
	ollama/ollama:rocm
<!-- gh-comment-id:2343110896 --> @svaningelgem commented on GitHub (Sep 11, 2024): Yeah, I did notice that one as well: So the smallest command line that works becomes: ``` $ docker run -it --device=/dev/kfd --device=/dev/dri --group-add daemon rocm/pytorch:latest rocminfo | grep GPU Uuid: GPU-afcb395b37c2835e Device Type: GPU $ docker run -it --device=/dev/kfd --device=/dev/dri --group-add daemon rocm/pytorch:latest ls -ld /dev/kfd /dev/dri /dev/dri/* ls: cannot access '/dev/dri/by-path': No such file or directory drwxr-xr-x 2 root root 80 Sep 11 09:18 /dev/dri crw-rw----+ 1 nobody daemon 226, 1 Sep 11 09:06 /dev/dri/card1 crw-rw----+ 1 nobody bin 226, 128 Sep 11 09:06 /dev/dri/renderD128 crw-rw---- 1 nobody daemon 235, 0 Sep 11 09:06 /dev/kfd ``` In ollama: ``` $ docker run -it --device=/dev/kfd --device=/dev/dri --group-add bin --replace --name ollama --entrypoint /bin/bash ollama/ollama:rocm -c "ls -ld /dev/kfd /dev/dri /dev/dri/*" drwxr-xr-x 2 root root 80 Sep 11 09:19 /dev/dri crw-rw----+ 1 65534 bin 226, 1 Sep 11 09:06 /dev/dri/card1 crw-rw----+ 1 65534 daemon 226, 128 Sep 11 09:06 /dev/dri/renderD128 crw-rw---- 1 65534 bin 235, 0 Sep 11 09:06 /dev/kfd ``` on my base system: ``` (base) steven@LinuxPC:~$ ls -ld /dev/kfd /dev/dri /dev/dri/* drwxr-xr-x 3 root root 100 sep 11 11:06 /dev/dri drwxr-xr-x 2 root root 80 sep 11 11:06 /dev/dri/by-path crw-rw----+ 1 root video 226, 1 sep 11 11:06 /dev/dri/card1 crw-rw----+ 1 root render 226, 128 sep 11 11:06 /dev/dri/renderD128 crw-rw---- 1 root video 235, 0 sep 11 11:06 /dev/kfd ``` And we have liftoff :rocket: !! ``` (base) steven@LinuxPC:~$ ollama run llama3.1 >>> hello Hello! How are you today? Is there something I can help you with or would you like to chat? >>> Send a message (/? for help) ``` Used command line: ``` docker run -it --replace \ -v ollama:/root/.ollama \ --device /dev/kfd --device /dev/dri \ --group-add bin \ -e AMD_LOG_LEVEL=3 -e OLLAMA_DEBUG=true \ -p 11434:11434 \ --name ollama \ ollama/ollama:rocm ```
Author
Owner

@svaningelgem commented on GitHub (Sep 11, 2024):

Thanks @dhiltgen , but could you maybe also add the group name of the driver, so you know what to --group-add.
You would assume video, or render. But in my case it was "bin" that was necessary.

So it'd be advantageous to have this knowledge added already in the logs. And you don't have to bash into the pod to know that.

<!-- gh-comment-id:2344421785 --> @svaningelgem commented on GitHub (Sep 11, 2024): Thanks @dhiltgen , but could you maybe also add the group name of the driver, so you know what to `--group-add`. You would assume `video`, or `render`. But in my case it was "bin" that was necessary. So it'd be advantageous to have this knowledge added already in the logs. And you don't have to bash into the pod to know that.
Author
Owner

@rtaic-coder commented on GitHub (Sep 12, 2024):

Yeah, I did notice that one as well:

So the smallest command line that works becomes:

$ docker run -it --device=/dev/kfd --device=/dev/dri --group-add daemon rocm/pytorch:latest rocminfo | grep GPU
  Uuid:                    GPU-afcb395b37c2835e               
  Device Type:             GPU     
$ docker run -it --device=/dev/kfd --device=/dev/dri --group-add daemon rocm/pytorch:latest ls -ld /dev/kfd /dev/dri /dev/dri/*
ls: cannot access '/dev/dri/by-path': No such file or directory
drwxr-xr-x  2 root   root         80 Sep 11 09:18 /dev/dri
crw-rw----+ 1 nobody daemon 226,   1 Sep 11 09:06 /dev/dri/card1
crw-rw----+ 1 nobody bin    226, 128 Sep 11 09:06 /dev/dri/renderD128
crw-rw----  1 nobody daemon 235,   0 Sep 11 09:06 /dev/kfd

In ollama:

$ docker run -it --device=/dev/kfd --device=/dev/dri --group-add bin --replace --name ollama --entrypoint /bin/bash ollama/ollama:rocm -c "ls -ld /dev/kfd /dev/dri /dev/dri/*"
drwxr-xr-x  2 root  root         80 Sep 11 09:19 /dev/dri
crw-rw----+ 1 65534 bin    226,   1 Sep 11 09:06 /dev/dri/card1
crw-rw----+ 1 65534 daemon 226, 128 Sep 11 09:06 /dev/dri/renderD128
crw-rw----  1 65534 bin    235,   0 Sep 11 09:06 /dev/kfd

on my base system:

(base) steven@LinuxPC:~$ ls -ld /dev/kfd /dev/dri /dev/dri/*
drwxr-xr-x  3 root root        100 sep 11 11:06 /dev/dri
drwxr-xr-x  2 root root         80 sep 11 11:06 /dev/dri/by-path
crw-rw----+ 1 root video  226,   1 sep 11 11:06 /dev/dri/card1
crw-rw----+ 1 root render 226, 128 sep 11 11:06 /dev/dri/renderD128
crw-rw----  1 root video  235,   0 sep 11 11:06 /dev/kfd

And we have liftoff 🚀 !!

(base) steven@LinuxPC:~$ ollama run llama3.1
>>> hello
Hello! How are you today? Is there something I can help you with or would you like to chat?

>>> Send a message (/? for help)

Used command line:

docker run -it --replace \
	-v ollama:/root/.ollama \
	--device /dev/kfd --device /dev/dri \
	--group-add bin \
	-e AMD_LOG_LEVEL=3 -e OLLAMA_DEBUG=true \
	-p 11434:11434 \
	--name ollama \
	ollama/ollama:rocm

In my case, owner of kfd device is 65534

docker run -dit --device=/dev/kfd --device=/dev/dri --group-add bin --rm --name ollama ollama/ollama:rocm
docker exec -it ollama bash
[root@4e2fa1b63144 /]# ls -ld /dev/dri/* /dev/kfd
crw-rw---- 1 65534 65534 226,   0 Sep 12 01:01 /dev/dri/card0
crw-rw---- 1 65534 65534 226, 128 Sep 12 01:01 /dev/dri/renderD128
crw-rw---- 1 65534 65534 235,   0 Sep 12 01:01 /dev/kfd

So adding bin group doesn't really work.

<!-- gh-comment-id:2345073244 --> @rtaic-coder commented on GitHub (Sep 12, 2024): > Yeah, I did notice that one as well: > > So the smallest command line that works becomes: > > ``` > $ docker run -it --device=/dev/kfd --device=/dev/dri --group-add daemon rocm/pytorch:latest rocminfo | grep GPU > Uuid: GPU-afcb395b37c2835e > Device Type: GPU > $ docker run -it --device=/dev/kfd --device=/dev/dri --group-add daemon rocm/pytorch:latest ls -ld /dev/kfd /dev/dri /dev/dri/* > ls: cannot access '/dev/dri/by-path': No such file or directory > drwxr-xr-x 2 root root 80 Sep 11 09:18 /dev/dri > crw-rw----+ 1 nobody daemon 226, 1 Sep 11 09:06 /dev/dri/card1 > crw-rw----+ 1 nobody bin 226, 128 Sep 11 09:06 /dev/dri/renderD128 > crw-rw---- 1 nobody daemon 235, 0 Sep 11 09:06 /dev/kfd > ``` > > In ollama: > > ``` > $ docker run -it --device=/dev/kfd --device=/dev/dri --group-add bin --replace --name ollama --entrypoint /bin/bash ollama/ollama:rocm -c "ls -ld /dev/kfd /dev/dri /dev/dri/*" > drwxr-xr-x 2 root root 80 Sep 11 09:19 /dev/dri > crw-rw----+ 1 65534 bin 226, 1 Sep 11 09:06 /dev/dri/card1 > crw-rw----+ 1 65534 daemon 226, 128 Sep 11 09:06 /dev/dri/renderD128 > crw-rw---- 1 65534 bin 235, 0 Sep 11 09:06 /dev/kfd > ``` > > on my base system: > > ``` > (base) steven@LinuxPC:~$ ls -ld /dev/kfd /dev/dri /dev/dri/* > drwxr-xr-x 3 root root 100 sep 11 11:06 /dev/dri > drwxr-xr-x 2 root root 80 sep 11 11:06 /dev/dri/by-path > crw-rw----+ 1 root video 226, 1 sep 11 11:06 /dev/dri/card1 > crw-rw----+ 1 root render 226, 128 sep 11 11:06 /dev/dri/renderD128 > crw-rw---- 1 root video 235, 0 sep 11 11:06 /dev/kfd > ``` > > And we have liftoff 🚀 !! > > ``` > (base) steven@LinuxPC:~$ ollama run llama3.1 > >>> hello > Hello! How are you today? Is there something I can help you with or would you like to chat? > > >>> Send a message (/? for help) > ``` > > Used command line: > > ``` > docker run -it --replace \ > -v ollama:/root/.ollama \ > --device /dev/kfd --device /dev/dri \ > --group-add bin \ > -e AMD_LOG_LEVEL=3 -e OLLAMA_DEBUG=true \ > -p 11434:11434 \ > --name ollama \ > ollama/ollama:rocm > ``` In my case, owner of kfd device is 65534 ```bash docker run -dit --device=/dev/kfd --device=/dev/dri --group-add bin --rm --name ollama ollama/ollama:rocm docker exec -it ollama bash [root@4e2fa1b63144 /]# ls -ld /dev/dri/* /dev/kfd crw-rw---- 1 65534 65534 226, 0 Sep 12 01:01 /dev/dri/card0 crw-rw---- 1 65534 65534 226, 128 Sep 12 01:01 /dev/dri/renderD128 crw-rw---- 1 65534 65534 235, 0 Sep 12 01:01 /dev/kfd ``` So adding `bin` group doesn't really work.
Author
Owner

@svaningelgem commented on GitHub (Sep 12, 2024):

You can use --group-add 65534, but that didn't do anything in my case. It really needs to be an available group.

So the final fix for this would be to have it assigned to a group (like the tensorflow image I showed in my comment)

<!-- gh-comment-id:2345201833 --> @svaningelgem commented on GitHub (Sep 12, 2024): You can use `--group-add 65534`, but that didn't do anything in my case. It really needs to be an available group. So the final fix for this would be to have it assigned to a group (like the tensorflow image I showed in [my comment](https://github.com/ollama/ollama/issues/6685#issuecomment-2343110896))
Author
Owner

@svaningelgem commented on GitHub (Sep 12, 2024):

@rtaic-coder : as this ticket is closed, maybe reference it in another issue to raise awareness to your case?

<!-- gh-comment-id:2345212284 --> @svaningelgem commented on GitHub (Sep 12, 2024): @rtaic-coder : as this ticket is closed, maybe reference it in another issue to raise awareness to your case?
Author
Owner

@dhiltgen commented on GitHub (Sep 12, 2024):

I updated the troubleshooting section for AMD GPUs here - https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#amd-gpu-discovery

My understanding was the --group-add needed to match the group of the device on the host, not inside the container. Is that not the case? Is there some mapping taking place in these deprivileged scenarios where GID on the host differs from the GID inside the container?

<!-- gh-comment-id:2346645468 --> @dhiltgen commented on GitHub (Sep 12, 2024): I updated the troubleshooting section for AMD GPUs here - https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#amd-gpu-discovery My understanding was the `--group-add` needed to match the group of the device on the host, not inside the container. Is that not the case? Is there some mapping taking place in these deprivileged scenarios where GID on the host differs from the GID inside the container?
Author
Owner

@svaningelgem commented on GitHub (Sep 20, 2024):

@dhiltgen , no, the --group-add had to match the group of the device inside the container. Which is kind of logical when you think about it: the container is its own system, and has a certain rights to the file. The rights on the host does not really matter in that case, because the pod is separated from it.

I think the underlying issue is: why is the group-name not the same for the pod and the host? (as demonstrated with the pytorch pod above, they have a fixed "daemon" group in place, whereas it seems ollama's one is kind of dynamic...)

<!-- gh-comment-id:2362696486 --> @svaningelgem commented on GitHub (Sep 20, 2024): @dhiltgen , no, the `--group-add` had to match the group of the device inside the container. Which is kind of logical when you think about it: the container is its own system, and has a certain rights to the file. The rights on the host does not really matter in that case, because the pod is separated from it. I think the underlying issue is: *why is the group-name not the same for the pod and the host?* (as demonstrated with the pytorch pod above, they have a fixed "daemon" group in place, whereas it seems ollama's one is kind of dynamic...)
Author
Owner

@dhiltgen commented on GitHub (Sep 24, 2024):

@svaningelgem can you try using the numeric group ID instead of the name? On your host ls -n /dev/kfd /dev/dri/renderD128 should show it.

This is somewhat similar to #5986

<!-- gh-comment-id:2372421740 --> @dhiltgen commented on GitHub (Sep 24, 2024): @svaningelgem can you try using the numeric group ID instead of the name? On your host `ls -n /dev/kfd /dev/dri/renderD128 ` should show it. This is somewhat similar to #5986
Author
Owner

@rtaic-coder commented on GitHub (Sep 26, 2024):

I ended up switching from rootless docker to sudo docker since I spent countless hour to make is work in rootless scenario no matter what I do it give the same error in rootless. Thanks for all the help.

<!-- gh-comment-id:2378132433 --> @rtaic-coder commented on GitHub (Sep 26, 2024): I ended up switching from rootless docker to sudo docker since I spent countless hour to make is work in rootless scenario no matter what I do it give the same error in rootless. Thanks for all the help.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#29967