[GH-ISSUE #15601] Vulkan/AMD performance: vendored llama.cpp (b7437, Dec 2025) missing Wave32 FA (#19625) and graphics queue (#20551) — ~56% t/s gap vs standalone llama.cpp #35718

Open
opened 2026-04-22 20:24:04 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @sagar-kale on GitHub (Apr 15, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15601

Summary

Ollama's vendored llama.cpp is currently at b7437 (Dec 16, 2025). Two significant Vulkan/AMD performance PRs landed in llama.cpp after that and have not yet been picked up by Ollama:

PR Description Merged into llama.cpp
ggml-org/llama.cpp#19625 Vulkan: scalar flash attention refactor + Wave32 on AMD Feb 24, 2026
ggml-org/llama.cpp#20551 Vulkan: use graphics queue on AMD Mar 15, 2026

Measured Impact

Benchmarked on the same hardware, same model, same flags (-ngl 99 -fa 1 --no-mmap):

Setup gemma4:26b Q4_K_XL tg128 gemma4:e4b Q4_K_XL tg128
Ollama v0.20.5 (llama.cpp b7437) ~34 t/s ~34 t/s
llama.cpp b8765 (has both PRs) 52.3 t/s 56.2 t/s
Windows LM Studio (same hardware) ~56 t/s ~56 t/s

That's a ~56% throughput improvement from two Vulkan-specific commits that Ollama simply hasn't vendored yet. Standalone llama.cpp b8765 on Linux/Vulkan is now at parity with Windows LM Studio on the same machine. This is not a hardware/driver issue — the gap disappears entirely when running standalone llama.cpp.

Token speed vs context depth (llama.cpp b8765, tg128)

For reference, full context-depth profile on this hardware:

Context depth gemma4:26b gemma4:e4b
d0 (fresh) 52.3 t/s 56.2 t/s
d8k 45.6 t/s ~50 t/s
d32k 40.1 t/s 42.5 t/s
d64k 35.1 t/s 35.0 t/s
d128k 17.0 t/s 26.1 t/s

With Ollama (b7437) you're stuck at ~34 t/s even at d0 — below what standalone llama.cpp delivers at d64k.

Current Workaround

Due to this gap, I switched from Ollama to llama-swap + llama.cpp built from source. llama-swap is a lightweight proxy that hot-swaps llama-server instances on a single port, making it a drop-in Ollama replacement (same port 11434, OpenAI-compatible API).

Setup:

# Build llama.cpp from source with Vulkan
git clone https://github.com/ggml-org/llama.cpp
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)

# Run llama-server directly
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
  ./build/bin/llama-server \
  --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --port 11434 -ngl 99 -fa on --no-mmap

This works, but it's a significant workaround — model management, multi-model serving, and automatic updates all have to be handled manually. Ollama fixing this would make the workaround unnecessary.

System

Hardware Minisforum MS-S1 Max (AMD Ryzen AI MAX+ 395 / Radeon 8060S, Strix Halo)
GPU arch gfx1151, 128 GB unified memory (iGPU shares system RAM)
OS Ubuntu 24.04.4 LTS
Kernel 6.19.11
Vulkan driver RADV (Mesa 25.2.8), radeon_icd.json
Ollama version v0.20.5 (llama.cpp b7437, Dec 16 2025)
Standalone llama.cpp b8765 (Apr 2026), built from source with Vulkan

Steps to Reproduce

# With Ollama (v0.20.5)
ollama run gemma4:26b
# observe ~34 t/s in generation

# With standalone llama.cpp b8765 (same model, same quant, same hardware)
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
  llama-bench -m gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -ngl 99 -mmp 0 -fa 1 -p 0 -n 128
# observe 52+ t/s

Request

Please update the vendored llama.cpp to a commit that includes both PRs (any commit ≥ b8500 / after Mar 15, 2026 should include both). The ROCm 7.2.1 update in v0.20.7 is appreciated, but the Vulkan path (which is what iGPU/APU users rely on — ROCm doesn't support Strix Halo yet) is still stuck on December code.

Users with AMD APUs (Strix Halo, Phoenix, Hawk Point) running Vulkan are leaving ~56% performance on the table compared to what's already available in upstream llama.cpp.

Originally created by @sagar-kale on GitHub (Apr 15, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15601 ## Summary Ollama's vendored llama.cpp is currently at **b7437 (Dec 16, 2025)**. Two significant Vulkan/AMD performance PRs landed in llama.cpp after that and have not yet been picked up by Ollama: | PR | Description | Merged into llama.cpp | |---|---|---| | [ggml-org/llama.cpp#19625](https://github.com/ggml-org/llama.cpp/pull/19625) | Vulkan: scalar flash attention refactor + Wave32 on AMD | Feb 24, 2026 | | [ggml-org/llama.cpp#20551](https://github.com/ggml-org/llama.cpp/pull/20551) | Vulkan: use graphics queue on AMD | Mar 15, 2026 | ## Measured Impact Benchmarked on the same hardware, same model, same flags (`-ngl 99 -fa 1 --no-mmap`): | Setup | gemma4:26b Q4_K_XL tg128 | gemma4:e4b Q4_K_XL tg128 | |---|---|---| | Ollama v0.20.5 (llama.cpp b7437) | ~34 t/s | ~34 t/s | | llama.cpp b8765 (has both PRs) | **52.3 t/s** | **56.2 t/s** | | Windows LM Studio (same hardware) | ~56 t/s | ~56 t/s | That's a **~56% throughput improvement** from two Vulkan-specific commits that Ollama simply hasn't vendored yet. Standalone llama.cpp b8765 on Linux/Vulkan is now at parity with Windows LM Studio on the same machine. This is not a hardware/driver issue — the gap disappears entirely when running standalone llama.cpp. ### Token speed vs context depth (llama.cpp b8765, tg128) For reference, full context-depth profile on this hardware: | Context depth | gemma4:26b | gemma4:e4b | |---|---|---| | d0 (fresh) | 52.3 t/s | 56.2 t/s | | d8k | 45.6 t/s | ~50 t/s | | d32k | 40.1 t/s | 42.5 t/s | | d64k | 35.1 t/s | 35.0 t/s | | d128k | 17.0 t/s | 26.1 t/s | With Ollama (b7437) you're stuck at ~34 t/s even at d0 — below what standalone llama.cpp delivers at d64k. ## Current Workaround Due to this gap, I switched from Ollama to **[llama-swap](https://github.com/mostlygeek/llama-swap) + llama.cpp built from source**. llama-swap is a lightweight proxy that hot-swaps llama-server instances on a single port, making it a drop-in Ollama replacement (same port 11434, OpenAI-compatible API). Setup: ```bash # Build llama.cpp from source with Vulkan git clone https://github.com/ggml-org/llama.cpp cmake -B build -DGGML_VULKAN=ON cmake --build build --config Release -j$(nproc) # Run llama-server directly VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \ ./build/bin/llama-server \ --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \ --port 11434 -ngl 99 -fa on --no-mmap ``` This works, but it's a significant workaround — model management, multi-model serving, and automatic updates all have to be handled manually. Ollama fixing this would make the workaround unnecessary. ## System | | | |---|---| | **Hardware** | Minisforum MS-S1 Max (AMD Ryzen AI MAX+ 395 / Radeon 8060S, Strix Halo) | | **GPU arch** | gfx1151, 128 GB unified memory (iGPU shares system RAM) | | **OS** | Ubuntu 24.04.4 LTS | | **Kernel** | 6.19.11 | | **Vulkan driver** | RADV (Mesa 25.2.8), `radeon_icd.json` | | **Ollama version** | v0.20.5 (llama.cpp b7437, Dec 16 2025) | | **Standalone llama.cpp** | b8765 (Apr 2026), built from source with Vulkan | ## Steps to Reproduce ```bash # With Ollama (v0.20.5) ollama run gemma4:26b # observe ~34 t/s in generation # With standalone llama.cpp b8765 (same model, same quant, same hardware) VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \ llama-bench -m gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -ngl 99 -mmp 0 -fa 1 -p 0 -n 128 # observe 52+ t/s ``` ## Request Please update the vendored llama.cpp to a commit that includes both PRs (any commit ≥ b8500 / after Mar 15, 2026 should include both). The ROCm 7.2.1 update in v0.20.7 is appreciated, but the Vulkan path (which is what iGPU/APU users rely on — ROCm doesn't support Strix Halo yet) is still stuck on December code. Users with AMD APUs (Strix Halo, Phoenix, Hawk Point) running Vulkan are leaving ~56% performance on the table compared to what's already available in upstream llama.cpp.
Author
Owner

@sagar-kale commented on GitHub (Apr 15, 2026):

Bump test: ec98e2002 → b8797 (Apr 15, 2026)

Tested what it takes to vendor-bump llama.cpp to include both target PRs. Using the Makefile.sync / apply-patches mechanism against the latest llama.cpp HEAD (8dc530b86, Apr 15 2026):

18 of 36 patches applied cleanly. 18 failed.

Failed patches

Patch Failure type Files
0001-ggml-backend-malloc Content conflict ggml-sycl.cpp
0003-clip-unicode Content conflict tools/mtmd/clip.cpp
0004-solar-pro Content conflict llama-model.h, llama-arch.h/cpp
0005-fix-deepseek-deseret-regex Content conflict unicode.cpp
0009-remove-amx Content conflict ggml/CMakeLists.txt
0015-ggml-Export-GPU-UUIDs sha1 error ggml-backend.h
0018-ggml-Add-batch-size-hint sha1 error ggml-backend.h
0020-ggml-No-alloc-mode sha1 error ggml-backend.h
0021-decode-disable-output_all Content conflict llama-context.cpp
0022-ggml-Enable-resetting-backend-devices sha1 error ggml-backend.h
0024-GPU-discovery-enhancements sha1 error ggml-backend.h
0025-NVML-fallback-unified-memory sha1 error mem_nvml.cpp (moved?)
0026-report-LoadLibrary-failures sha1 error ggml-backend-reg.cpp
0027-interleave-multi-rope Content conflict rope_funcs.glsl, rope.cu
0028-Add-memory-detection-DXGI-PDH sha1 error ggml/CMakeLists.txt
0032-ggml-enable-MLA-flash-attention Content conflict ggml-metal-device.m, fattn* CUDA
0033-ggml-metal-solve_tri Content conflict ggml-metal.metal, ggml-metal-device.m
0036-backport-kernels-gemma4 Content conflict ggml-metal.metal, fattn.cu, fattn-mma-f16.cuh

Failure pattern

8 "sha1 lacking" errors all cluster around ggml/include/ggml-backend.h — this file has been substantially rewritten upstream, so git can't construct a 3-way merge base for any of the Ollama patches that touch it. These would need manual re-implementation.

10 content conflicts are more tractable — surrounding context shifted but the files exist and the intent of each patch is clear.


Posting this for maintainer awareness — the bump is doable but non-trivial given ggml-backend.h churn. Happy to test any candidate bump commit on gfx1151 hardware once it's ready.

<!-- gh-comment-id:4251263389 --> @sagar-kale commented on GitHub (Apr 15, 2026): ## Bump test: ec98e2002 → b8797 (Apr 15, 2026) Tested what it takes to vendor-bump llama.cpp to include both target PRs. Using the `Makefile.sync` / `apply-patches` mechanism against the latest llama.cpp HEAD (`8dc530b86`, Apr 15 2026): **18 of 36 patches applied cleanly. 18 failed.** ### Failed patches | Patch | Failure type | Files | |---|---|---| | 0001-ggml-backend-malloc | Content conflict | `ggml-sycl.cpp` | | 0003-clip-unicode | Content conflict | `tools/mtmd/clip.cpp` | | 0004-solar-pro | Content conflict | `llama-model.h`, `llama-arch.h/cpp` | | 0005-fix-deepseek-deseret-regex | Content conflict | `unicode.cpp` | | 0009-remove-amx | Content conflict | `ggml/CMakeLists.txt` | | 0015-ggml-Export-GPU-UUIDs | sha1 error | `ggml-backend.h` | | 0018-ggml-Add-batch-size-hint | sha1 error | `ggml-backend.h` | | 0020-ggml-No-alloc-mode | sha1 error | `ggml-backend.h` | | 0021-decode-disable-output_all | Content conflict | `llama-context.cpp` | | 0022-ggml-Enable-resetting-backend-devices | sha1 error | `ggml-backend.h` | | 0024-GPU-discovery-enhancements | sha1 error | `ggml-backend.h` | | 0025-NVML-fallback-unified-memory | sha1 error | `mem_nvml.cpp` (moved?) | | 0026-report-LoadLibrary-failures | sha1 error | `ggml-backend-reg.cpp` | | 0027-interleave-multi-rope | Content conflict | `rope_funcs.glsl`, `rope.cu` | | 0028-Add-memory-detection-DXGI-PDH | sha1 error | `ggml/CMakeLists.txt` | | 0032-ggml-enable-MLA-flash-attention | Content conflict | `ggml-metal-device.m`, `fattn*` CUDA | | 0033-ggml-metal-solve_tri | Content conflict | `ggml-metal.metal`, `ggml-metal-device.m` | | 0036-backport-kernels-gemma4 | Content conflict | `ggml-metal.metal`, `fattn.cu`, `fattn-mma-f16.cuh` | ### Failure pattern **8 "sha1 lacking" errors** all cluster around `ggml/include/ggml-backend.h` — this file has been substantially rewritten upstream, so git can't construct a 3-way merge base for any of the Ollama patches that touch it. These would need manual re-implementation. **10 content conflicts** are more tractable — surrounding context shifted but the files exist and the intent of each patch is clear. --- Posting this for maintainer awareness — the bump is doable but non-trivial given `ggml-backend.h` churn. Happy to test any candidate bump commit on gfx1151 hardware once it's ready.
Author
Owner

@rick-github commented on GitHub (Apr 15, 2026):

While waiting for the vendor sync try ROCm.

Setup gemma4:26b-a4b-it-q4_K_M gemma4:e4b-it-q4_K_M
ollama (ROCm) 51.85 t/s 53.18 t/s
Hardware NucBox_EVO-X2 AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
GPU arch gfx1151
GPT/VRAM 15864M/98304M
Available (ROCm) 111.5 GiB
OS Linux Mint 22.3
Kernel 6.11.0-29-generic
ROCm driver 7.2.1
Ollama version 0.20.5-rocm
<!-- gh-comment-id:4252412474 --> @rick-github commented on GitHub (Apr 15, 2026): While waiting for the [vendor sync](https://github.com/ollama/ollama/pull/14864) try ROCm. | Setup | gemma4:26b-a4b-it-q4_K_M| gemma4:e4b-it-q4_K_M | | -- | -- | -- | | ollama (ROCm) | 51.85 t/s| 53.18 t/s| | | | |---|---| | **Hardware** | NucBox_EVO-X2 AMD RYZEN AI MAX+ 395 w/ Radeon 8060S | | **GPU arch** | gfx1151| | **GPT/VRAM** | 15864M/98304M| | **Available (ROCm)** | 111.5 GiB | | **OS** | Linux Mint 22.3 | | **Kernel** | 6.11.0-29-generic | | **ROCm driver** | 7.2.1 | | **Ollama version** | 0.20.5-rocm |
Author
Owner

@chejh-amd commented on GitHub (Apr 16, 2026):

Great benchmarking work — the ~56% gap you measured lines up with what we'd expect from ggml-org/llama.cpp#19625 (Wave32 FA) and ggml-org/llama.cpp#20551 (graphics queue) being absent in the vendored build.

As rick-github noted, the ROCm path on Ollama 0.20.5 already gets you into that 51–53 t/s range on gfx1151 since it's not gated by the vendor sync. If you'd like to try it: Ollama ships a -rocm variant, and ROCm 7.2.1 supports Strix Halo (gfx1151).

<!-- gh-comment-id:4257996763 --> @chejh-amd commented on GitHub (Apr 16, 2026): Great benchmarking work — the ~56% gap you measured lines up with what we'd expect from [ggml-org/llama.cpp#19625](https://github.com/ggml-org/llama.cpp/pull/19625) (Wave32 FA) and [ggml-org/llama.cpp#20551](https://github.com/ggml-org/llama.cpp/pull/20551) (graphics queue) being absent in the vendored build. As rick-github noted, the ROCm path on Ollama 0.20.5 already gets you into that 51–53 t/s range on gfx1151 since it's not gated by the vendor sync. If you'd like to try it: Ollama ships a -rocm variant, and ROCm 7.2.1 supports Strix Halo (gfx1151).
Author
Owner

@sagar-kale commented on GitHub (Apr 16, 2026):

Thanks @chejh-amd and @rick-github — I gave ROCm a proper go but kept hitting a wall. Here's what I tried:

Attempt 1 — native ollama-linux-amd64-rocm tarball (v0.20.7)
Downloaded the full ROCm-specific build (~944MB, extracts ~2.5GB of ROCm libs into /usr/local/lib/ollama/rocm/), started Ollama on a separate port so it wouldn't conflict with my existing setup. GPU gets detected fine (Radeon 8060S Graphics, compute=gfx1151) but then hangs for exactly 30 seconds during GGML_CUDA_INIT and times out:

failure during GPU discovery ... error="failed to finish discovery before timeout"
inference compute id=cpu library=cpu

Attempt 2 — Docker (ollama/ollama:rocm)
Tried the Docker image thinking maybe it was a library mismatch on the host. Passed through /dev/kfd and /dev/dri, added the right group permissions. This one actually fails faster — crashes immediately instead of timing out — but dmesg tells the same story:

amdgpu: [gfxhub] page fault ... Process ollama ...
GCVM_L2_PROTECTION_FAULT_STATUS:0x00800932
  Faulty UTCL2 client ID: CPF (0x4)
  WALKER_ERROR: 0x1 / PERMISSION_FAULTS: 0x3 / MAPPING_ERROR: 0x1

So it's not a library issue — Docker shares the host kernel so it hits the exact same thing.

My best guess is it's the kernel. I'm on 6.19.11, rick-github is on 6.11. I still have 6.17 kernels installed so I'll boot into that this weekend and try again — should at least tell us whether it's a 6.19 regression.

Before I do that, any ideas on what might be causing this on newer kernels? Any HSA/ROCm flags worth trying, or is this a known amdgpu issue with gfx1151 SVM on 6.19? Happy to test anything — I've got the hardware and a bit of time this weekend.

<!-- gh-comment-id:4258162091 --> @sagar-kale commented on GitHub (Apr 16, 2026): Thanks @chejh-amd and @rick-github — I gave ROCm a proper go but kept hitting a wall. Here's what I tried: **Attempt 1 — native `ollama-linux-amd64-rocm` tarball (v0.20.7)** Downloaded the full ROCm-specific build (~944MB, extracts ~2.5GB of ROCm libs into `/usr/local/lib/ollama/rocm/`), started Ollama on a separate port so it wouldn't conflict with my existing setup. GPU gets detected fine (`Radeon 8060S Graphics, compute=gfx1151`) but then hangs for exactly 30 seconds during `GGML_CUDA_INIT` and times out: ``` failure during GPU discovery ... error="failed to finish discovery before timeout" inference compute id=cpu library=cpu ``` **Attempt 2 — Docker (`ollama/ollama:rocm`)** Tried the Docker image thinking maybe it was a library mismatch on the host. Passed through `/dev/kfd` and `/dev/dri`, added the right group permissions. This one actually fails faster — crashes immediately instead of timing out — but dmesg tells the same story: ``` amdgpu: [gfxhub] page fault ... Process ollama ... GCVM_L2_PROTECTION_FAULT_STATUS:0x00800932 Faulty UTCL2 client ID: CPF (0x4) WALKER_ERROR: 0x1 / PERMISSION_FAULTS: 0x3 / MAPPING_ERROR: 0x1 ``` So it's not a library issue — Docker shares the host kernel so it hits the exact same thing. My best guess is it's the kernel. I'm on **6.19.11**, rick-github is on **6.11**. I still have 6.17 kernels installed so I'll boot into that this weekend and try again — should at least tell us whether it's a 6.19 regression. Before I do that, any ideas on what might be causing this on newer kernels? Any HSA/ROCm flags worth trying, or is this a known amdgpu issue with gfx1151 SVM on 6.19? Happy to test anything — I've got the hardware and a bit of time this weekend.
Author
Owner

@rick-github commented on GitHub (Apr 16, 2026):

https://github.com/ollama/ollama/issues/15420#issuecomment-4208015418

<!-- gh-comment-id:4258471443 --> @rick-github commented on GitHub (Apr 16, 2026): https://github.com/ollama/ollama/issues/15420#issuecomment-4208015418
Author
Owner

@chejh-amd commented on GitHub (Apr 16, 2026):

The kernel version theory is very likely right. There have been multiple reports of GCVM_L2_PROTECTION_FAULT / page faults on gfx1151 with kernels 6.18.4+ and 6.19.x — the amdgpu driver in those kernels introduced changes that require a matching ROCm version to work correctly.

The short version: kernel 6.17.7 has been reported to work with ROCm on Strix Halo. Kernel 6.17.9 and later (including your 6.19.11) appear to need either ROCm nightlies or an upcoming ROCm release that aligns with the new kernel-side changes. So booting into your 6.17 kernel this weekend is the right experiment.

A few things worth checking when you do:

  • Confirm your linux-firmware package version — some firmware builds (e.g. 20251125) are also known to break ROCm regardless of kernel.

  • If 6.17 works, uname -r + apt list --installed linux-firmware would be useful data to post back here.

<!-- gh-comment-id:4258474826 --> @chejh-amd commented on GitHub (Apr 16, 2026): The kernel version theory is very likely right. There have been multiple reports of GCVM_L2_PROTECTION_FAULT / page faults on gfx1151 with kernels 6.18.4+ and 6.19.x — the amdgpu driver in those kernels introduced changes that require a matching ROCm version to work correctly. The short version: kernel 6.17.7 has been reported to work with ROCm on Strix Halo. Kernel 6.17.9 and later (including your 6.19.11) appear to need either ROCm nightlies or an upcoming ROCm release that aligns with the new kernel-side changes. So booting into your 6.17 kernel this weekend is the right experiment. A few things worth checking when you do: - Confirm your `linux-firmware` package version — some firmware builds (e.g. 20251125) are also known to break ROCm regardless of kernel. - If 6.17 works, `uname -r` + `apt list --installed linux-firmware` would be useful data to post back here.
Author
Owner

@sagar-kale commented on GitHub (Apr 16, 2026):

Following up on the weekend testing I promised — ended up going considerably deeper than planned.

tl;dr: The GCVM_L2 page fault is not a kernel version regression. It's a fundamental KFD compute mapping issue present across all Ubuntu kernels I tested. ROCm init actually works with the right userspace, but GPU compute dispatch fails regardless.


Kernel 6.17 test

Booted into 6.17.0-20-generic as planned. Same [gfxhub] page fault at ROCm init — so it's not a 6.19 regression, it affects 6.17 too. That ruled out the kernel version theory pretty quickly.

linux-firmware check

Ubuntu's package is 20240318.git3b128b60. Came across a Framework community thread saying linux-firmware 20251125 broke ROCm on gfx1151 and 20260309 restored it. Manually pulled the GC 11.5.1 compute blobs (gc_11_5_1_me, mec, mes1, mes_2, pfp) from upstream linux-firmware HEAD and rebuilt initramfs. Reran the Docker ROCm test — still the exact same page fault. Firmware wasn't the issue.

TheRock 7.13 nightly (the interesting one)

Tried a different angle: installed AMD's TheRock nightly ROCm build (7.13.0a20260416, gfx1151-specific wheels via pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ "rocm[libraries,devel]").

Running rocminfo with these newer userspace libs gave the first clean GPU detection I'd seen:

Agent 2: gfx1151 — Radeon 8060S Graphics
  Compute Unit: 40
  Wavefront Size: 32
  Pool 1: ~63 GB GLOBAL COARSE GRAINED
  ISA: amdgcn-amd-amdhsa--gfx1151

No dmesg errors at all. So TheRock 7.13 fixed the HSA init page fault that was happening with ROCm 7.2.x.

Built llama.cpp against those libs with GGML_HIP_NO_VMM=ON and GPU_TARGETS=gfx1151 (flags from a known-good Proxmox/gfx1151 report). Device detection worked:

ggml_cuda_init: found 1 ROCm devices
Device 0: Radeon 8060S Graphics, gfx1151, VMM: no, VRAM: 63717 MiB

But the moment any GPU kernel is dispatched — even a trivial 1024-element float add — it hangs and logs:

amdgpu: [gfxhub] page fault
GCVM_L2_PROTECTION_FAULT_STATUS:0x00800932
  Faulty UTCL2 client ID: CPF (0x4)
  WALKER_ERROR: 0x1 / PERMISSION_FAULTS: 0x3 / MAPPING_ERROR: 0x1

CPF (Command Processor Fetch) can't access the virtual address the kernel placed the compute commands at. The address is in host user-space (~0x74c75c159000), consistent with a unified memory allocation that isn't getting pre-mapped into the GPU page tables before dispatch.

HSA_XNACK=1

Tried this hoping the page fault would become retryable. rocminfo reports XNACK enabled: NO regardless — the kernel/firmware isn't exposing XNACK capability, so there's no retry path.

Ubuntu OEM kernel (6.17.0-1017-oem)

Also tried this on the theory it might have different KFD patches. Same compute page fault. Same result.


Where things stand

The init fault is fixed in TheRock 7.13 (ROCr VGPR count fix for gfx1151, landed Dec 2024). The compute fault is a separate, lower-level issue — the GPU page table walker can't resolve host virtual addresses during compute dispatch. This is a KFD/amdgpu kernel driver issue. The one configuration reported to work (llama.cpp discussion #20856) uses a Proxmox 6.19.2-1-pve kernel, which presumably has a patch Ubuntu's kernels don't.

For now I'm sitting on llama-swap + Vulkan at ~52 t/s while this works itself out upstream. Happy to test anything specific if it'd be useful data for AMD.

<!-- gh-comment-id:4263542370 --> @sagar-kale commented on GitHub (Apr 16, 2026): Following up on the weekend testing I promised — ended up going considerably deeper than planned. **tl;dr:** The GCVM_L2 page fault is not a kernel version regression. It's a fundamental KFD compute mapping issue present across all Ubuntu kernels I tested. ROCm init actually works with the right userspace, but GPU compute dispatch fails regardless. --- **Kernel 6.17 test** Booted into 6.17.0-20-generic as planned. Same `[gfxhub] page fault` at ROCm init — so it's not a 6.19 regression, it affects 6.17 too. That ruled out the kernel version theory pretty quickly. **linux-firmware check** Ubuntu's package is `20240318.git3b128b60`. Came across a Framework community thread saying `linux-firmware 20251125` broke ROCm on gfx1151 and `20260309` restored it. Manually pulled the GC 11.5.1 compute blobs (`gc_11_5_1_me`, `mec`, `mes1`, `mes_2`, `pfp`) from upstream linux-firmware HEAD and rebuilt initramfs. Reran the Docker ROCm test — still the exact same page fault. Firmware wasn't the issue. **TheRock 7.13 nightly (the interesting one)** Tried a different angle: installed AMD's TheRock nightly ROCm build (7.13.0a20260416, gfx1151-specific wheels via `pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ "rocm[libraries,devel]"`). Running `rocminfo` with these newer userspace libs gave the first clean GPU detection I'd seen: ``` Agent 2: gfx1151 — Radeon 8060S Graphics Compute Unit: 40 Wavefront Size: 32 Pool 1: ~63 GB GLOBAL COARSE GRAINED ISA: amdgcn-amd-amdhsa--gfx1151 ``` No dmesg errors at all. So TheRock 7.13 fixed the HSA init page fault that was happening with ROCm 7.2.x. Built llama.cpp against those libs with `GGML_HIP_NO_VMM=ON` and `GPU_TARGETS=gfx1151` (flags from a known-good Proxmox/gfx1151 report). Device detection worked: ``` ggml_cuda_init: found 1 ROCm devices Device 0: Radeon 8060S Graphics, gfx1151, VMM: no, VRAM: 63717 MiB ``` But the moment any GPU kernel is dispatched — even a trivial 1024-element float add — it hangs and logs: ``` amdgpu: [gfxhub] page fault GCVM_L2_PROTECTION_FAULT_STATUS:0x00800932 Faulty UTCL2 client ID: CPF (0x4) WALKER_ERROR: 0x1 / PERMISSION_FAULTS: 0x3 / MAPPING_ERROR: 0x1 ``` CPF (Command Processor Fetch) can't access the virtual address the kernel placed the compute commands at. The address is in host user-space (~0x74c75c159000), consistent with a unified memory allocation that isn't getting pre-mapped into the GPU page tables before dispatch. **HSA_XNACK=1** Tried this hoping the page fault would become retryable. `rocminfo` reports `XNACK enabled: NO` regardless — the kernel/firmware isn't exposing XNACK capability, so there's no retry path. **Ubuntu OEM kernel (6.17.0-1017-oem)** Also tried this on the theory it might have different KFD patches. Same compute page fault. Same result. --- **Where things stand** The init fault is fixed in TheRock 7.13 (ROCr VGPR count fix for gfx1151, landed Dec 2024). The compute fault is a separate, lower-level issue — the GPU page table walker can't resolve host virtual addresses during compute dispatch. This is a KFD/amdgpu kernel driver issue. The one configuration reported to work (llama.cpp discussion [#20856](https://github.com/ggml-org/llama.cpp/discussions/20856)) uses a Proxmox 6.19.2-1-pve kernel, which presumably has a patch Ubuntu's kernels don't. For now I'm sitting on llama-swap + Vulkan at ~52 t/s while this works itself out upstream. Happy to test anything specific if it'd be useful data for AMD.
Author
Owner

@Znuff commented on GitHub (Apr 17, 2026):

@sagar-kale as per the previous comment, and AMD Documentation, you can now get it working on:

ii  linux-image-6.17.0-20-generic            6.17.0-20.20~24.04.1
ii  linux-image-generic-hwe-24.04-edge       6.17.0-20.20~24.04.1

The only catch, as detailed in the previous ticket mentioned, is that you have to use the 31.10+ amdgpu kernel driver, ie:

# cat amdgpu.list
deb [arch=amd64,i386 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/31.10/ubuntu noble main

This finally fixes the GCVM_L2_PROTECTION_FAULT_STATUS issue that was also mentioned in #13589

For some reasons, the AMD Documentation only mentions 30.30.x driver to install, which produces the GCVM_L2_PROTECTION_FAULT_STATUS issue on Ubuntu 24.04 (even with the latest HWE Kernel).

I assume this won't be an issue in 26.04 LTS

<!-- gh-comment-id:4264697136 --> @Znuff commented on GitHub (Apr 17, 2026): @sagar-kale as per the previous comment, and AMD Documentation, you can **now** get it working on: ``` ii linux-image-6.17.0-20-generic 6.17.0-20.20~24.04.1 ii linux-image-generic-hwe-24.04-edge 6.17.0-20.20~24.04.1 ``` The only catch, as detailed in the previous ticket mentioned, is that you have to use the 31.10+ `amdgpu` kernel driver, ie: ``` # cat amdgpu.list deb [arch=amd64,i386 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/31.10/ubuntu noble main ``` This finally fixes the `GCVM_L2_PROTECTION_FAULT_STATUS` issue that was also mentioned in #13589 For some reasons, the AMD Documentation only mentions 30.30.x driver to install, which produces the `GCVM_L2_PROTECTION_FAULT_STATUS` issue on Ubuntu 24.04 (even with the latest HWE Kernel). *I assume this won't be an issue in 26.04 LTS*
Author
Owner

@sagar-kale commented on GitHub (Apr 17, 2026):

@Znuff @rick-github @chejh-amd — that tip about amdgpu 31.10 was the missing piece. Thank you, genuinely — I'd been going in circles for a while and that comment saved me from going further down dead ends.

Here's a full writeup of everything I tested after your replies, in case it's useful for others landing on this issue.


System

Machine Minisforum MS-S1 Max
CPU/APU AMD Ryzen AI MAX+ 395 (Strix Halo)
GPU Integrated Radeon 8060S (gfx1151), 40 CUs
RAM 128 GB unified memory
GPU memory available ~116 GiB (via GTT pool, amdgpu.gttsize=117760)
OS Ubuntu 24.04 LTS
Default kernel 6.17.0-20-generic (HWE edge)
amdgpu-dkms 31.10 (1:6.18.4.31100000)
Ollama 0.20.7 native

What fixed it — amdgpu-dkms 31.10

Exactly as @Znuff described. Swapped the AMD repo from 30.30.131.10, built the DKMS module for 6.17.0-20-generic, rebooted. The GCVM_L2_PROTECTION_FAULT_STATUS:0x00800932 CPF fault was completely gone.

# /etc/apt/sources.list.d/amdgpu.list
deb [arch=amd64,i386 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/31.10/ubuntu noble main

sudo apt-get install amdgpu-dkms amdgpu-dkms-firmware
sudo dkms build amdgpu/6.18.4-2286447.24.04 -k 6.17.0-20-generic
sudo dkms install amdgpu/6.18.4-2286447.24.04 -k 6.17.0-20-generic

Native Ollama service config (/etc/systemd/system/ollama.service.d/override.conf):

Environment="OLLAMA_VULKAN=0"
Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1"

Speed results — all models, each tested in isolation

Each model was tested with a fresh Ollama restart (single model in VRAM at a time). 128 token generation, flash attention enabled.

Model Size t/s
gemma4:e4b Q4_K 9.6 GB 55.2 t/s
gemma4:26b Q4_K 17 GB 54.4 t/s
qwen3.5:9b Q4_K 6.6 GB 32.7 t/s
gpt-oss:120b Q4_K 65 GB 26.6 t/s
qwen2.5:32b Q4_K 19 GB 11.0 t/s
gemma3:27b Q4_K 17 GB 11.8 t/s
qwen3.5:122b Q4_K 81 GB 9.3 t/s

The 122b and 120b models spill some layers to CPU (Ollama uses partial GPU offload at that size), which explains the lower t/s. Everything else is fully on GPU.

GPU activity during inference (monitored via sysfs during gemma4:26b): peaked at 95% busy. VRAM sysfs shows ~0.3 GiB because on a unified memory APU the model lives in the GTT pool, not the frame buffer — this is expected and doesn't indicate CPU fallback.

One note on stacked vs isolated: if you run multiple models sequentially without restarting Ollama (with KEEP_ALIVE=-1), previous models stay in VRAM. The 122b/120b models will fail allocation if a 17-26 GB model is still loaded alongside them. Isolated runs are the clean baseline.


What didn't work (for the record)

  • 30.30 amdgpu DKMS + any kernel: GCVM_L2_PROTECTION_FAULT_STATUS CPF fault on every compute dispatch
  • TheRock 7.13 nightly + 30.30: GPU init works (no HSA page fault), compute still faults — the VGPR fix in TheRock doesn't help without the driver fix
  • Ubuntu OEM kernel 6.17.0-1017-oem: same compute fault with 30.30 driver
  • linux-firmware GC 11.5.1 blobs from upstream HEAD: no effect on the compute fault
  • HSA_XNACK=1: rocminfo still reports XNACK enabled: NO — firmware/kernel not exposing XNACK so no retry path
  • Ollama ROCm Docker (0.20.5, 0.21.0): works fine once the host has amdgpu 31.10, but ~15% slower than native install (46 t/s vs 54 t/s on gemma4:26b) — suspect the bundled ROCm libs differ from what the native install links against

Questions for anyone who knows

1. Will these speeds improve when Ollama bumps its llama.cpp vendor?

Currently Ollama vendors llama.cpp at b7437 (Dec 2025). Two PRs landed after that which gave a big Vulkan boost on gfx1151:

Are there equivalent ROCm/HIP improvements in newer llama.cpp that haven't made it into Ollama yet? Or is the ROCm path already pulling from a more current snapshot?

2. Is 54 t/s on gemma4:26b about what you'd expect for this hardware on ROCm, or should it be higher?

@rick-github's numbers on the NucBox EVO-X2 (same gfx1151) were 51.85 t/s on Ollama 0.20.5-rocm, so we're in the same range. But running the same model through llama.cpp b8765 with GGML_HIP_NO_VMM=ON only got 48.6 t/s — native Ollama ROCm beat our manual build. Curious whether there's an obvious flag or build option that'd push it further, or whether 54–55 t/s is the ceiling for this chip on a 26B Q4 model.

<!-- gh-comment-id:4266392729 --> @sagar-kale commented on GitHub (Apr 17, 2026): @Znuff @rick-github @chejh-amd — that tip about amdgpu 31.10 was the missing piece. Thank you, genuinely — I'd been going in circles for a while and that comment saved me from going further down dead ends. Here's a full writeup of everything I tested after your replies, in case it's useful for others landing on this issue. --- ## System | | | |---|---| | **Machine** | Minisforum MS-S1 Max | | **CPU/APU** | AMD Ryzen AI MAX+ 395 (Strix Halo) | | **GPU** | Integrated Radeon 8060S (gfx1151), 40 CUs | | **RAM** | 128 GB unified memory | | **GPU memory available** | ~116 GiB (via GTT pool, `amdgpu.gttsize=117760`) | | **OS** | Ubuntu 24.04 LTS | | **Default kernel** | 6.17.0-20-generic (HWE edge) | | **amdgpu-dkms** | 31.10 (`1:6.18.4.31100000`) | | **Ollama** | 0.20.7 native | --- ## What fixed it — amdgpu-dkms 31.10 Exactly as @Znuff described. Swapped the AMD repo from `30.30.1` → `31.10`, built the DKMS module for `6.17.0-20-generic`, rebooted. The `GCVM_L2_PROTECTION_FAULT_STATUS:0x00800932` CPF fault was completely gone. ```bash # /etc/apt/sources.list.d/amdgpu.list deb [arch=amd64,i386 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/31.10/ubuntu noble main sudo apt-get install amdgpu-dkms amdgpu-dkms-firmware sudo dkms build amdgpu/6.18.4-2286447.24.04 -k 6.17.0-20-generic sudo dkms install amdgpu/6.18.4-2286447.24.04 -k 6.17.0-20-generic ``` Native Ollama service config (`/etc/systemd/system/ollama.service.d/override.conf`): ``` Environment="OLLAMA_VULKAN=0" Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1" ``` --- ## Speed results — all models, each tested in isolation Each model was tested with a fresh Ollama restart (single model in VRAM at a time). 128 token generation, flash attention enabled. | Model | Size | t/s | |---|---|---| | gemma4:e4b Q4_K | 9.6 GB | **55.2 t/s** | | gemma4:26b Q4_K | 17 GB | **54.4 t/s** | | qwen3.5:9b Q4_K | 6.6 GB | **32.7 t/s** | | gpt-oss:120b Q4_K | 65 GB | **26.6 t/s** | | qwen2.5:32b Q4_K | 19 GB | **11.0 t/s** | | gemma3:27b Q4_K | 17 GB | **11.8 t/s** | | qwen3.5:122b Q4_K | 81 GB | **9.3 t/s** | The 122b and 120b models spill some layers to CPU (Ollama uses partial GPU offload at that size), which explains the lower t/s. Everything else is fully on GPU. **GPU activity during inference** (monitored via sysfs during gemma4:26b): peaked at **95% busy**. VRAM sysfs shows ~0.3 GiB because on a unified memory APU the model lives in the GTT pool, not the frame buffer — this is expected and doesn't indicate CPU fallback. **One note on stacked vs isolated:** if you run multiple models sequentially without restarting Ollama (with `KEEP_ALIVE=-1`), previous models stay in VRAM. The 122b/120b models will fail allocation if a 17-26 GB model is still loaded alongside them. Isolated runs are the clean baseline. --- ## What didn't work (for the record) - **30.30 amdgpu DKMS + any kernel**: `GCVM_L2_PROTECTION_FAULT_STATUS` CPF fault on every compute dispatch - **TheRock 7.13 nightly + 30.30**: GPU init works (no HSA page fault), compute still faults — the VGPR fix in TheRock doesn't help without the driver fix - **Ubuntu OEM kernel 6.17.0-1017-oem**: same compute fault with 30.30 driver - **linux-firmware GC 11.5.1 blobs from upstream HEAD**: no effect on the compute fault - **HSA_XNACK=1**: `rocminfo` still reports `XNACK enabled: NO` — firmware/kernel not exposing XNACK so no retry path - **Ollama ROCm Docker (0.20.5, 0.21.0)**: works fine once the host has amdgpu 31.10, but ~15% slower than native install (46 t/s vs 54 t/s on gemma4:26b) — suspect the bundled ROCm libs differ from what the native install links against --- ## Questions for anyone who knows **1. Will these speeds improve when Ollama bumps its llama.cpp vendor?** Currently Ollama vendors llama.cpp at b7437 (Dec 2025). Two PRs landed after that which gave a big Vulkan boost on gfx1151: - [ggml-org/llama.cpp#19625](https://github.com/ggml-org/llama.cpp/pull/19625) — Wave32 flash attention (Feb 2026) - [ggml-org/llama.cpp#20551](https://github.com/ggml-org/llama.cpp/pull/20551) — graphics queue on AMD Vulkan (Mar 2026) Are there equivalent ROCm/HIP improvements in newer llama.cpp that haven't made it into Ollama yet? Or is the ROCm path already pulling from a more current snapshot? **2. Is 54 t/s on gemma4:26b about what you'd expect for this hardware on ROCm, or should it be higher?** @rick-github's numbers on the NucBox EVO-X2 (same gfx1151) were 51.85 t/s on Ollama 0.20.5-rocm, so we're in the same range. But running the same model through llama.cpp b8765 with `GGML_HIP_NO_VMM=ON` only got 48.6 t/s — native Ollama ROCm beat our manual build. Curious whether there's an obvious flag or build option that'd push it further, or whether 54–55 t/s is the ceiling for this chip on a 26B Q4 model.
Author
Owner

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15601
Analyzed: 2026-04-18T18:19:47.508770

Analysis

  • Type: unknown
  • Severity: medium
  • Components: unknown

Implementation Plan

  • Effort: medium
  • Steps:

This issue has been triaged and marked for implementation.

<!-- gh-comment-id:4274305457 --> @PureBlissAK commented on GitHub (Apr 18, 2026): <!-- ollama-issue-orchestrator:v1 issue:15601 --> ## 🤖 Automated Triage & Analysis Report **Issue**: #15601 **Analyzed**: 2026-04-18T18:19:47.508770 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#35718