[GH-ISSUE #3425] Allow override of amdgpu version check #48621

Closed
opened 2026-04-28 08:56:58 -05:00 by GiteaMirror · 12 comments
Owner

Originally created by @sebastian-philipp on GitHub (Mar 31, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3425

Originally assigned to: @dhiltgen on GitHub.

What are you trying to do?

HSA_OVERRIDE_GFX_VERSION="11.0.2" /usr/local/bin/ollama serve 
...
time=2024-03-31T15:09:39.342+02:00 level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"

Though I have the distribution kernel package:

 ll /sys/module/amdgpu/rhelversion 
-r--r--r--. 1 root root 4,0K 31. Mär 15:13 /sys/module/amdgpu/rhelversion

And yes, I have seen the if we see users crash and burn with the upstreamed kernel comment in the code. But I'd at least like try out the distribution driver. Right now it's a nightmare to try out the distribution driver, as you need to enter the rabbit hole of patching and re-compiling the source code following https://github.com/ollama/ollama/blob/main/docs/development.md#linux-rocm-amd .

How should we solve this?

I'd like to see a AMDGPU_VERSION_OVERRIDE environment variable to override this /sys/module/amdgpu/version version check.

What is the impact of not solving this?

giving up on GPU support

Anything else?

No response

Originally created by @sebastian-philipp on GitHub (Mar 31, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3425 Originally assigned to: @dhiltgen on GitHub. ### What are you trying to do? ``` HSA_OVERRIDE_GFX_VERSION="11.0.2" /usr/local/bin/ollama serve ... time=2024-03-31T15:09:39.342+02:00 level=WARN source=amd_linux.go:53 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" ``` Though I have the distribution kernel package: ``` ll /sys/module/amdgpu/rhelversion -r--r--r--. 1 root root 4,0K 31. Mär 15:13 /sys/module/amdgpu/rhelversion ``` And yes, I have seen the `if we see users crash and burn with the upstreamed kernel` comment in the code. But I'd at least like try out the distribution driver. Right now it's a nightmare to try out the distribution driver, as you need to enter the rabbit hole of patching and re-compiling the source code following https://github.com/ollama/ollama/blob/main/docs/development.md#linux-rocm-amd . ### How should we solve this? I'd like to see a `AMDGPU_VERSION_OVERRIDE` environment variable to override this `/sys/module/amdgpu/version` version check. ### What is the impact of not solving this? giving up on GPU support ### Anything else? _No response_
GiteaMirror added the amd label 2026-04-28 08:56:58 -05:00
Author
Owner

@sebastian-philipp commented on GitHub (Mar 31, 2024):

one step further:

$ xxd /usr/local/bin/ollama | sed 's/6f2a 2f73 7973 2f6d 6f64 756c 652f 616d/6f2a 2f74 6d70 2f6d 6f64 756c 652f 616d/' | xxd -r > ollama_tmp
$ chmod +x ollama_tmp                                                                                                                       
$ mkdir -p /tmp/module/amdgpu
$ cp /sys/module/amdgpu/rhelversion /tmp/module/amdgpu/version
$ sudo ./ollama_tmp serve
time=2024-03-31T17:18:03.234+02:00 level=INFO source=amd_linux.go:50 msg="AMD Driver: 9.99"

one step further:

time=2024-03-31T17:28:17.243+02:00 level=INFO source=amd_linux.go:50 msg="AMD Driver: 9.99"
time=2024-03-31T17:28:17.243+02:00 level=INFO source=amd_linux.go:88 msg="detected amdgpu versions [gfx1103]"
time=2024-03-31T17:28:17.243+02:00 level=INFO source=amd_linux.go:246 msg="[0] amdgpu totalMemory 1024M"
time=2024-03-31T17:28:17.243+02:00 level=INFO source=amd_linux.go:247 msg="[0] amdgpu freeMemory  1024M"
time=2024-03-31T17:28:17.243+02:00 level=INFO source=llm.go:119 msg="not enough vram available, falling back to CPU only"

Looks like I'm stuck with #2637 then?

<!-- gh-comment-id:2028795738 --> @sebastian-philipp commented on GitHub (Mar 31, 2024): one step further: ``` $ xxd /usr/local/bin/ollama | sed 's/6f2a 2f73 7973 2f6d 6f64 756c 652f 616d/6f2a 2f74 6d70 2f6d 6f64 756c 652f 616d/' | xxd -r > ollama_tmp $ chmod +x ollama_tmp $ mkdir -p /tmp/module/amdgpu $ cp /sys/module/amdgpu/rhelversion /tmp/module/amdgpu/version $ sudo ./ollama_tmp serve time=2024-03-31T17:18:03.234+02:00 level=INFO source=amd_linux.go:50 msg="AMD Driver: 9.99" ``` one step further: ``` time=2024-03-31T17:28:17.243+02:00 level=INFO source=amd_linux.go:50 msg="AMD Driver: 9.99" time=2024-03-31T17:28:17.243+02:00 level=INFO source=amd_linux.go:88 msg="detected amdgpu versions [gfx1103]" time=2024-03-31T17:28:17.243+02:00 level=INFO source=amd_linux.go:246 msg="[0] amdgpu totalMemory 1024M" time=2024-03-31T17:28:17.243+02:00 level=INFO source=amd_linux.go:247 msg="[0] amdgpu freeMemory 1024M" time=2024-03-31T17:28:17.243+02:00 level=INFO source=llm.go:119 msg="not enough vram available, falling back to CPU only" ``` Looks like I'm stuck with #2637 then?
Author
Owner

@dhiltgen commented on GitHub (Apr 1, 2024):

We don't block on failure to detect the AMD driver version, but as you point out, we do warn, since our experience has been that often the ROCm library has problems operating properly on the older upstream driver. You didn't mention what type of GPU you have. Typically what I've seen is the iGPU's report 512M of VRAM, but 1G is quite small for a discrete GPU. If you do have an integrated GPU, you're correct that we don't currently support them and are tracking that feature enhancement with #2637

<!-- gh-comment-id:2030459731 --> @dhiltgen commented on GitHub (Apr 1, 2024): We don't block on failure to detect the AMD driver version, but as you point out, we do warn, since our experience has been that often the ROCm library has problems operating properly on the older upstream driver. You didn't mention what type of GPU you have. Typically what I've seen is the iGPU's report 512M of VRAM, but 1G is quite small for a discrete GPU. If you do have an integrated GPU, you're correct that we don't currently support them and are tracking that feature enhancement with #2637
Author
Owner

@sebastian-philipp commented on GitHub (Apr 5, 2024):

experience has been that often the ROCm library has problems operating properly on the older distribution driver.

I trust your experience and I don't think we should use the upstream drivers by default. I'd just enjoy to have a way to try out the upstream drives without binary-patching ollama. Mostly, as it is much easier and safer to user the distribution drivers.

You didn't mention what type of GPU you have

Oh yes, I later discovered that my GPU (Radeon 780M) is not supported anyway, which felt like an anticlimax.

<!-- gh-comment-id:2040240709 --> @sebastian-philipp commented on GitHub (Apr 5, 2024): > experience has been that often the ROCm library has problems operating properly on the older distribution driver. I trust your experience and I don't think we should use the upstream drivers by default. I'd just enjoy to have a way to try out the upstream drives without binary-patching ollama. Mostly, as it is much easier and safer to user the distribution drivers. > You didn't mention what type of GPU you have Oh yes, I later discovered that my GPU (Radeon 780M) is not supported anyway, which felt like an anticlimax.
Author
Owner

@dhiltgen commented on GitHub (Apr 5, 2024):

Mostly, as it is much easier and safer to user the distribution drivers.

Unfortunately what I've seen is ROCm fails in various ways - crashes, hangs, etc. So in the context of running LLMs, the most reliable pattern seems to be to run the latest drivers from AMD.

If I'm not mistaken, your GPU is a gfx1103 which is "close" to an existing supported GPU type, so you should give the overrides a try and see if it works. https://github.com/ollama/ollama/blob/main/docs/gpu.md#overrides. Perhaps HSA_OVERRIDE_GFX_VERSION="11.0.0" might work. (or 11.0.1, or 11.02)

Unless this is an integrated GPU.... in which case, see #2637

<!-- gh-comment-id:2040730345 --> @dhiltgen commented on GitHub (Apr 5, 2024): > Mostly, as it is much easier and safer to user the distribution drivers. Unfortunately what I've seen is ROCm fails in various ways - crashes, hangs, etc. So in the context of running LLMs, the most reliable pattern seems to be to run the latest drivers from AMD. If I'm not mistaken, your GPU is a gfx1103 which is "close" to an existing supported GPU type, so you should give the overrides a try and see if it works. https://github.com/ollama/ollama/blob/main/docs/gpu.md#overrides. Perhaps `HSA_OVERRIDE_GFX_VERSION="11.0.0"` might work. (or 11.0.1, or 11.02) Unless this is an integrated GPU.... in which case, see #2637
Author
Owner

@sebastian-philipp commented on GitHub (Apr 6, 2024):

Mostly, as it is much easier and safer to user the distribution drivers.

Unfortunately what I've seen is ROCm fails in various ways - crashes, hangs, etc. So in the context of running LLMs, the most reliable pattern seems to be to run the latest drivers from AMD.

Fully agree with this recommendation.

If I'm not mistaken, your GPU is a gfx1103 which is "close" to an existing supported GPU type, so you should give the overrides a try and see if it works. https://github.com/ollama/ollama/blob/main/docs/gpu.md#overrides. Perhaps HSA_OVERRIDE_GFX_VERSION="11.0.0" might work. (or 11.0.1, or 11.02)

Yeah, I did that after binary patching ollama and I was able to get the ROCm it working, but it failed with

amdgpu totalMemory 1024M
amdgpu freeMemory  1024M"
not enough vram available, falling back to CPU only

Unless this is an integrated GPU.... in which case, see #2637

This.

<!-- gh-comment-id:2041061962 --> @sebastian-philipp commented on GitHub (Apr 6, 2024): > > Mostly, as it is much easier and safer to user the distribution drivers. > > Unfortunately what I've seen is ROCm fails in various ways - crashes, hangs, etc. So in the context of running LLMs, the most reliable pattern seems to be to run the latest drivers from AMD. Fully agree with this recommendation. > > If I'm not mistaken, your GPU is a gfx1103 which is "close" to an existing supported GPU type, so you should give the overrides a try and see if it works. https://github.com/ollama/ollama/blob/main/docs/gpu.md#overrides. Perhaps `HSA_OVERRIDE_GFX_VERSION="11.0.0"` might work. (or 11.0.1, or 11.02) Yeah, I did that after binary patching ollama and I was able to get the ROCm it working, but it failed with ``` amdgpu totalMemory 1024M amdgpu freeMemory 1024M" not enough vram available, falling back to CPU only ``` > > Unless this is an integrated GPU.... in which case, see #2637 This.
Author
Owner

@dhiltgen commented on GitHub (Apr 8, 2024):

Dup of #2637

<!-- gh-comment-id:2043638955 --> @dhiltgen commented on GitHub (Apr 8, 2024): Dup of #2637
Author
Owner

@vorburger commented on GitHub (May 19, 2024):

We don't block on failure to detect the AMD driver version, but as you point out, we do warn

This initially confused 🥹 the hell out of me... I didn't understand, until searching and finding and reading this, that it can work despite this, and that the FOLLOWING message about the missing libraries was the real issue to get it working, and that this warning was sort of a "red herring".

I've posted about this on https://github.com/vorburger/vorburger.ch-Notes/blob/develop/ml/ollama1.md

So, for some reason, on Fedora 40, there is no /sys/module/amdgpu/version, but there is a /sys/module/amdgpu/rhelversion (and it contains "9.99", on Fedora's 6.8.9-300.fc40.x86_64 Kernel). Would it make sense to have Ollama check for that, or is that pointless?

<!-- gh-comment-id:2119382185 --> @vorburger commented on GitHub (May 19, 2024): > We don't block on failure to detect the AMD driver version, but as you point out, we do warn This initially confused 🥹 the hell out of me... I didn't understand, until searching and finding and reading this, that it can work despite this, and that the FOLLOWING message about the missing libraries was the real issue to get it working, and that this warning was sort of a "red herring". I've posted about this on https://github.com/vorburger/vorburger.ch-Notes/blob/develop/ml/ollama1.md So, for some reason, on Fedora 40, there is no `/sys/module/amdgpu/version`, but there is a `/sys/module/amdgpu/rhelversion` (and it contains "9.99", on Fedora's `6.8.9-300.fc40.x86_64` Kernel). Would it make sense to have Ollama check for that, or is that pointless?
Author
Owner

@dhiltgen commented on GitHub (May 20, 2024):

@vorburger that doesn't look like an AMD downstream version number. I would expect a 6.x.y assuming you're running ROCm v6 based setup, so I think that's most likely an upstream bundled linux kernel driver, not the AMD downstream driver. Our experience has been that ROCm can be a bit finicky with the upstream kernel driver, so we recommend running the downstream amd driver, hence this warning.

<!-- gh-comment-id:2121228877 --> @dhiltgen commented on GitHub (May 20, 2024): @vorburger that doesn't look like an AMD downstream version number. I would expect a 6.x.y assuming you're running ROCm v6 based setup, so I think that's most likely an upstream bundled linux kernel driver, not the AMD downstream driver. Our experience has been that ROCm can be a bit finicky with the upstream kernel driver, so we recommend running the downstream amd driver, hence this warning.
Author
Owner

@sebastian-philipp commented on GitHub (Oct 17, 2024):

With latest kernel 6.11.3-200.fc40.x86_64, the /sys/module/amdgpu/rhelversion file is now gone as well.

$ LANG=C ll /sys/module/amdgpu/     
total 0
-r--r--r--. 1 root root 4.0K Oct 17 17:32 coresize
drwxr-xr-x. 2 root root    0 Oct 17 17:32 drivers
drwxr-xr-x. 2 root root    0 Oct 17 17:32 holders
-r--r--r--. 1 root root 4.0K Oct 17 17:32 initsize
-r--r--r--. 1 root root 4.0K Oct 17 17:32 initstate
drwxr-xr-x. 2 root root    0 Oct 17 17:32 notes
drwxr-xr-x. 2 root root    0 Oct 17 17:32 parameters
-r--r--r--. 1 root root 4.0K Oct 17 17:32 refcnt
drwxr-xr-x. 2 root root    0 Oct 17 17:32 sections
-r--r--r--. 1 root root 4.0K Oct 17 17:32 taint
--w-------. 1 root root 4.0K Oct 17 17:32 uevent
<!-- gh-comment-id:2419876440 --> @sebastian-philipp commented on GitHub (Oct 17, 2024): With latest kernel 6.11.3-200.fc40.x86_64, the `/sys/module/amdgpu/rhelversion` file is now gone as well. ``` $ LANG=C ll /sys/module/amdgpu/ total 0 -r--r--r--. 1 root root 4.0K Oct 17 17:32 coresize drwxr-xr-x. 2 root root 0 Oct 17 17:32 drivers drwxr-xr-x. 2 root root 0 Oct 17 17:32 holders -r--r--r--. 1 root root 4.0K Oct 17 17:32 initsize -r--r--r--. 1 root root 4.0K Oct 17 17:32 initstate drwxr-xr-x. 2 root root 0 Oct 17 17:32 notes drwxr-xr-x. 2 root root 0 Oct 17 17:32 parameters -r--r--r--. 1 root root 4.0K Oct 17 17:32 refcnt drwxr-xr-x. 2 root root 0 Oct 17 17:32 sections -r--r--r--. 1 root root 4.0K Oct 17 17:32 taint --w-------. 1 root root 4.0K Oct 17 17:32 uevent ```
Author
Owner

@dhiltgen commented on GitHub (Oct 17, 2024):

As mentioned above, we never hard-fail on a missing amdgpu version file in sysfs, we simply warn, because the upstream kernel driver is often much older and sometimes buggier than the downstream amdgpu driver, so we're trying to encourage users to install the downstream driver if possible. Unfortunately the distro coverage is limited, so this isn't always an option.

@sebastian-philipp if you're having problems (doesn't detect the GPU, crashes, etc.) please open a new issue with server logs so we can investigate.

<!-- gh-comment-id:2419956207 --> @dhiltgen commented on GitHub (Oct 17, 2024): As mentioned above, we never hard-fail on a missing amdgpu version file in sysfs, we simply warn, because the upstream kernel driver is often much older and sometimes buggier than the downstream amdgpu driver, so we're trying to encourage users to install the downstream driver if possible. Unfortunately the distro coverage is limited, so this isn't always an option. @sebastian-philipp if you're having problems (doesn't detect the GPU, crashes, etc.) please open a new issue with server logs so we can investigate.
Author
Owner

@alphaonex86 commented on GitHub (Nov 17, 2024):

Same problem here, rocm is perfectly working but not used due to lack of /sys/module/amdgpu/version

<!-- gh-comment-id:2480925903 --> @alphaonex86 commented on GitHub (Nov 17, 2024): Same problem here, rocm is perfectly working but not used due to lack of /sys/module/amdgpu/version
Author
Owner

@dhiltgen commented on GitHub (Nov 18, 2024):

Same problem here, rocm is perfectly working but not used due to lack of /sys/module/amdgpu/version

This check only warns that you may be using an old driver and might have runtime bugs as a result since ROCm tends to work better on the latest amd driver. It does not prevent loading. If you aren't seeing the GPU being used, there's something else going on. Please open a new issue and include server logs so we can investigate.

<!-- gh-comment-id:2483967939 --> @dhiltgen commented on GitHub (Nov 18, 2024): > Same problem here, rocm is perfectly working but not used due to lack of /sys/module/amdgpu/version This check only warns that you may be using an old driver and might have runtime bugs as a result since ROCm tends to work better on the latest amd driver. It does not prevent loading. If you aren't seeing the GPU being used, there's something else going on. Please open a new issue and include server logs so we can investigate.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#48621