[GH-ISSUE #15601] Vulkan/AMD performance: vendored llama.cpp (b7437, Dec 2025) missing Wave32 FA (#19625) and graphics queue (#20551) — ~56% t/s gap vs standalone llama.cpp #35718

New Issue

GiteaMirror · 2026-04-22T20:24:04-05:00

GiteaMirror commented

2026-04-22 20:24:04 -05:00

Originally created by @sagar-kale on GitHub (Apr 15, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15601

Summary

Ollama's vendored llama.cpp is currently at b7437 (Dec 16, 2025). Two significant Vulkan/AMD performance PRs landed in llama.cpp after that and have not yet been picked up by Ollama:

PR	Description	Merged into llama.cpp
ggml-org/llama.cpp#19625	Vulkan: scalar flash attention refactor + Wave32 on AMD	Feb 24, 2026
ggml-org/llama.cpp#20551	Vulkan: use graphics queue on AMD	Mar 15, 2026

Measured Impact

Benchmarked on the same hardware, same model, same flags (-ngl 99 -fa 1 --no-mmap):

Setup	gemma4:26b Q4_K_XL tg128	gemma4:e4b Q4_K_XL tg128
Ollama v0.20.5 (llama.cpp b7437)	~34 t/s	~34 t/s
llama.cpp b8765 (has both PRs)	52.3 t/s	56.2 t/s
Windows LM Studio (same hardware)	~56 t/s	~56 t/s

That's a ~56% throughput improvement from two Vulkan-specific commits that Ollama simply hasn't vendored yet. Standalone llama.cpp b8765 on Linux/Vulkan is now at parity with Windows LM Studio on the same machine. This is not a hardware/driver issue — the gap disappears entirely when running standalone llama.cpp.

Token speed vs context depth (llama.cpp b8765, tg128)

For reference, full context-depth profile on this hardware:

Context depth	gemma4:26b	gemma4:e4b
d0 (fresh)	52.3 t/s	56.2 t/s
d8k	45.6 t/s	~50 t/s
d32k	40.1 t/s	42.5 t/s
d64k	35.1 t/s	35.0 t/s
d128k	17.0 t/s	26.1 t/s

With Ollama (b7437) you're stuck at ~34 t/s even at d0 — below what standalone llama.cpp delivers at d64k.

Current Workaround

Due to this gap, I switched from Ollama to llama-swap + llama.cpp built from source. llama-swap is a lightweight proxy that hot-swaps llama-server instances on a single port, making it a drop-in Ollama replacement (same port 11434, OpenAI-compatible API).

Setup:

# Build llama.cpp from source with Vulkan
git clone https://github.com/ggml-org/llama.cpp
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)

# Run llama-server directly
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
  ./build/bin/llama-server \
  --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --port 11434 -ngl 99 -fa on --no-mmap

This works, but it's a significant workaround — model management, multi-model serving, and automatic updates all have to be handled manually. Ollama fixing this would make the workaround unnecessary.

System


Hardware	Minisforum MS-S1 Max (AMD Ryzen AI MAX+ 395 / Radeon 8060S, Strix Halo)
GPU arch	gfx1151, 128 GB unified memory (iGPU shares system RAM)
OS	Ubuntu 24.04.4 LTS
Kernel	6.19.11
Vulkan driver	RADV (Mesa 25.2.8), `radeon_icd.json`
Ollama version	v0.20.5 (llama.cpp b7437, Dec 16 2025)
Standalone llama.cpp	b8765 (Apr 2026), built from source with Vulkan

Steps to Reproduce

# With Ollama (v0.20.5)
ollama run gemma4:26b
# observe ~34 t/s in generation

# With standalone llama.cpp b8765 (same model, same quant, same hardware)
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
  llama-bench -m gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -ngl 99 -mmp 0 -fa 1 -p 0 -n 128
# observe 52+ t/s

Request

Please update the vendored llama.cpp to a commit that includes both PRs (any commit ≥ b8500 / after Mar 15, 2026 should include both). The ROCm 7.2.1 update in v0.20.7 is appreciated, but the Vulkan path (which is what iGPU/APU users rely on — ROCm doesn't support Strix Halo yet) is still stuck on December code.

Users with AMD APUs (Strix Halo, Phoenix, Hawk Point) running Vulkan are leaving ~56% performance on the table compared to what's already available in upstream llama.cpp.

Originally created by @sagar-kale on GitHub (Apr 15, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15601 ## Summary Ollama's vendored llama.cpp is currently at **b7437 (Dec 16, 2025)**. Two significant Vulkan/AMD performance PRs landed in llama.cpp after that and have not yet been picked up by Ollama: | PR | Description | Merged into llama.cpp | |---|---|---| | [ggml-org/llama.cpp#19625](https://github.com/ggml-org/llama.cpp/pull/19625) | Vulkan: scalar flash attention refactor + Wave32 on AMD | Feb 24, 2026 | | [ggml-org/llama.cpp#20551](https://github.com/ggml-org/llama.cpp/pull/20551) | Vulkan: use graphics queue on AMD | Mar 15, 2026 | ## Measured Impact Benchmarked on the same hardware, same model, same flags (`-ngl 99 -fa 1 --no-mmap`): | Setup | gemma4:26b Q4_K_XL tg128 | gemma4:e4b Q4_K_XL tg128 | |---|---|---| | Ollama v0.20.5 (llama.cpp b7437) | ~34 t/s | ~34 t/s | | llama.cpp b8765 (has both PRs) | **52.3 t/s** | **56.2 t/s** | | Windows LM Studio (same hardware) | ~56 t/s | ~56 t/s | That's a **~56% throughput improvement** from two Vulkan-specific commits that Ollama simply hasn't vendored yet. Standalone llama.cpp b8765 on Linux/Vulkan is now at parity with Windows LM Studio on the same machine. This is not a hardware/driver issue — the gap disappears entirely when running standalone llama.cpp. ### Token speed vs context depth (llama.cpp b8765, tg128) For reference, full context-depth profile on this hardware: | Context depth | gemma4:26b | gemma4:e4b | |---|---|---| | d0 (fresh) | 52.3 t/s | 56.2 t/s | | d8k | 45.6 t/s | ~50 t/s | | d32k | 40.1 t/s | 42.5 t/s | | d64k | 35.1 t/s | 35.0 t/s | | d128k | 17.0 t/s | 26.1 t/s | With Ollama (b7437) you're stuck at ~34 t/s even at d0 — below what standalone llama.cpp delivers at d64k. ## Current Workaround Due to this gap, I switched from Ollama to **[llama-swap](https://github.com/mostlygeek/llama-swap) + llama.cpp built from source**. llama-swap is a lightweight proxy that hot-swaps llama-server instances on a single port, making it a drop-in Ollama replacement (same port 11434, OpenAI-compatible API). Setup: ```bash # Build llama.cpp from source with Vulkan git clone https://github.com/ggml-org/llama.cpp cmake -B build -DGGML_VULKAN=ON cmake --build build --config Release -j$(nproc) # Run llama-server directly VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \ ./build/bin/llama-server \ --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \ --port 11434 -ngl 99 -fa on --no-mmap ``` This works, but it's a significant workaround — model management, multi-model serving, and automatic updates all have to be handled manually. Ollama fixing this would make the workaround unnecessary. ## System | | | |---|---| | **Hardware** | Minisforum MS-S1 Max (AMD Ryzen AI MAX+ 395 / Radeon 8060S, Strix Halo) | | **GPU arch** | gfx1151, 128 GB unified memory (iGPU shares system RAM) | | **OS** | Ubuntu 24.04.4 LTS | | **Kernel** | 6.19.11 | | **Vulkan driver** | RADV (Mesa 25.2.8), `radeon_icd.json` | | **Ollama version** | v0.20.5 (llama.cpp b7437, Dec 16 2025) | | **Standalone llama.cpp** | b8765 (Apr 2026), built from source with Vulkan | ## Steps to Reproduce ```bash # With Ollama (v0.20.5) ollama run gemma4:26b # observe ~34 t/s in generation # With standalone llama.cpp b8765 (same model, same quant, same hardware) VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \ llama-bench -m gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -ngl 99 -mmp 0 -fa 1 -p 0 -n 128 # observe 52+ t/s ``` ## Request Please update the vendored llama.cpp to a commit that includes both PRs (any commit ≥ b8500 / after Mar 15, 2026 should include both). The ROCm 7.2.1 update in v0.20.7 is appreciated, but the Vulkan path (which is what iGPU/APU users rely on — ROCm doesn't support Strix Halo yet) is still stuck on December code. Users with AMD APUs (Strix Halo, Phoenix, Hawk Point) running Vulkan are leaving ~56% performance on the table compared to what's already available in upstream llama.cpp.

GiteaMirror commented

2026-04-22 20:24:05 -05:00

@sagar-kale commented on GitHub (Apr 15, 2026):

Bump test: ec98e2002 → b8797 (Apr 15, 2026)

Tested what it takes to vendor-bump llama.cpp to include both target PRs. Using the Makefile.sync / apply-patches mechanism against the latest llama.cpp HEAD (8dc530b86, Apr 15 2026):

18 of 36 patches applied cleanly. 18 failed.

Failed patches

Patch	Failure type	Files
0001-ggml-backend-malloc	Content conflict	`ggml-sycl.cpp`
0003-clip-unicode	Content conflict	`tools/mtmd/clip.cpp`
0004-solar-pro	Content conflict	`llama-model.h`, `llama-arch.h/cpp`
0005-fix-deepseek-deseret-regex	Content conflict	`unicode.cpp`
0009-remove-amx	Content conflict	`ggml/CMakeLists.txt`
0015-ggml-Export-GPU-UUIDs	sha1 error	`ggml-backend.h`
0018-ggml-Add-batch-size-hint	sha1 error	`ggml-backend.h`
0020-ggml-No-alloc-mode	sha1 error	`ggml-backend.h`
0021-decode-disable-output_all	Content conflict	`llama-context.cpp`
0022-ggml-Enable-resetting-backend-devices	sha1 error	`ggml-backend.h`
0024-GPU-discovery-enhancements	sha1 error	`ggml-backend.h`
0025-NVML-fallback-unified-memory	sha1 error	`mem_nvml.cpp` (moved?)
0026-report-LoadLibrary-failures	sha1 error	`ggml-backend-reg.cpp`
0027-interleave-multi-rope	Content conflict	`rope_funcs.glsl`, `rope.cu`
0028-Add-memory-detection-DXGI-PDH	sha1 error	`ggml/CMakeLists.txt`
0032-ggml-enable-MLA-flash-attention	Content conflict	`ggml-metal-device.m`, `fattn*` CUDA
0033-ggml-metal-solve_tri	Content conflict	`ggml-metal.metal`, `ggml-metal-device.m`
0036-backport-kernels-gemma4	Content conflict	`ggml-metal.metal`, `fattn.cu`, `fattn-mma-f16.cuh`

Failure pattern

8 "sha1 lacking" errors all cluster around ggml/include/ggml-backend.h — this file has been substantially rewritten upstream, so git can't construct a 3-way merge base for any of the Ollama patches that touch it. These would need manual re-implementation.

10 content conflicts are more tractable — surrounding context shifted but the files exist and the intent of each patch is clear.

Posting this for maintainer awareness — the bump is doable but non-trivial given ggml-backend.h churn. Happy to test any candidate bump commit on gfx1151 hardware once it's ready.

@sagar-kale commented on GitHub (Apr 15, 2026): ## Bump test: ec98e2002 → b8797 (Apr 15, 2026) Tested what it takes to vendor-bump llama.cpp to include both target PRs. Using the `Makefile.sync` / `apply-patches` mechanism against the latest llama.cpp HEAD (`8dc530b86`, Apr 15 2026): **18 of 36 patches applied cleanly. 18 failed.** ### Failed patches | Patch | Failure type | Files | |---|---|---| | 0001-ggml-backend-malloc | Content conflict | `ggml-sycl.cpp` | | 0003-clip-unicode | Content conflict | `tools/mtmd/clip.cpp` | | 0004-solar-pro | Content conflict | `llama-model.h`, `llama-arch.h/cpp` | | 0005-fix-deepseek-deseret-regex | Content conflict | `unicode.cpp` | | 0009-remove-amx | Content conflict | `ggml/CMakeLists.txt` | | 0015-ggml-Export-GPU-UUIDs | sha1 error | `ggml-backend.h` | | 0018-ggml-Add-batch-size-hint | sha1 error | `ggml-backend.h` | | 0020-ggml-No-alloc-mode | sha1 error | `ggml-backend.h` | | 0021-decode-disable-output_all | Content conflict | `llama-context.cpp` | | 0022-ggml-Enable-resetting-backend-devices | sha1 error | `ggml-backend.h` | | 0024-GPU-discovery-enhancements | sha1 error | `ggml-backend.h` | | 0025-NVML-fallback-unified-memory | sha1 error | `mem_nvml.cpp` (moved?) | | 0026-report-LoadLibrary-failures | sha1 error | `ggml-backend-reg.cpp` | | 0027-interleave-multi-rope | Content conflict | `rope_funcs.glsl`, `rope.cu` | | 0028-Add-memory-detection-DXGI-PDH | sha1 error | `ggml/CMakeLists.txt` | | 0032-ggml-enable-MLA-flash-attention | Content conflict | `ggml-metal-device.m`, `fattn*` CUDA | | 0033-ggml-metal-solve_tri | Content conflict | `ggml-metal.metal`, `ggml-metal-device.m` | | 0036-backport-kernels-gemma4 | Content conflict | `ggml-metal.metal`, `fattn.cu`, `fattn-mma-f16.cuh` | ### Failure pattern **8 "sha1 lacking" errors** all cluster around `ggml/include/ggml-backend.h` — this file has been substantially rewritten upstream, so git can't construct a 3-way merge base for any of the Ollama patches that touch it. These would need manual re-implementation. **10 content conflicts** are more tractable — surrounding context shifted but the files exist and the intent of each patch is clear. --- Posting this for maintainer awareness — the bump is doable but non-trivial given `ggml-backend.h` churn. Happy to test any candidate bump commit on gfx1151 hardware once it's ready.

GiteaMirror commented

2026-04-22 20:24:06 -05:00

@rick-github commented on GitHub (Apr 15, 2026):

While waiting for the vendor sync try ROCm.

Setup	gemma4:26b-a4b-it-q4_K_M	gemma4:e4b-it-q4_K_M
ollama (ROCm)	51.85 t/s	53.18 t/s


Hardware	NucBox_EVO-X2 AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
GPU arch	gfx1151
GPT/VRAM	15864M/98304M
Available (ROCm)	111.5 GiB
OS	Linux Mint 22.3
Kernel	6.11.0-29-generic
ROCm driver	7.2.1
Ollama version	0.20.5-rocm

@rick-github commented on GitHub (Apr 15, 2026): While waiting for the [vendor sync](https://github.com/ollama/ollama/pull/14864) try ROCm. | Setup | gemma4:26b-a4b-it-q4_K_M| gemma4:e4b-it-q4_K_M | | -- | -- | -- | | ollama (ROCm) | 51.85 t/s| 53.18 t/s| | | | |---|---| | **Hardware** | NucBox_EVO-X2 AMD RYZEN AI MAX+ 395 w/ Radeon 8060S | | **GPU arch** | gfx1151| | **GPT/VRAM** | 15864M/98304M| | **Available (ROCm)** | 111.5 GiB | | **OS** | Linux Mint 22.3 | | **Kernel** | 6.11.0-29-generic | | **ROCm driver** | 7.2.1 | | **Ollama version** | 0.20.5-rocm |

GiteaMirror commented

2026-04-22 20:24:06 -05:00

@chejh-amd commented on GitHub (Apr 16, 2026):

Great benchmarking work — the ~56% gap you measured lines up with what we'd expect from ggml-org/llama.cpp#19625 (Wave32 FA) and ggml-org/llama.cpp#20551 (graphics queue) being absent in the vendored build.

As rick-github noted, the ROCm path on Ollama 0.20.5 already gets you into that 51–53 t/s range on gfx1151 since it's not gated by the vendor sync. If you'd like to try it: Ollama ships a -rocm variant, and ROCm 7.2.1 supports Strix Halo (gfx1151).

@chejh-amd commented on GitHub (Apr 16, 2026): Great benchmarking work — the ~56% gap you measured lines up with what we'd expect from [ggml-org/llama.cpp#19625](https://github.com/ggml-org/llama.cpp/pull/19625) (Wave32 FA) and [ggml-org/llama.cpp#20551](https://github.com/ggml-org/llama.cpp/pull/20551) (graphics queue) being absent in the vendored build. As rick-github noted, the ROCm path on Ollama 0.20.5 already gets you into that 51–53 t/s range on gfx1151 since it's not gated by the vendor sync. If you'd like to try it: Ollama ships a -rocm variant, and ROCm 7.2.1 supports Strix Halo (gfx1151).

GiteaMirror commented

2026-04-22 20:24:07 -05:00

@sagar-kale commented on GitHub (Apr 16, 2026):

Thanks @chejh-amd and @rick-github — I gave ROCm a proper go but kept hitting a wall. Here's what I tried:

Attempt 1 — native ollama-linux-amd64-rocm tarball (v0.20.7)
Downloaded the full ROCm-specific build (~944MB, extracts ~2.5GB of ROCm libs into /usr/local/lib/ollama/rocm/), started Ollama on a separate port so it wouldn't conflict with my existing setup. GPU gets detected fine (Radeon 8060S Graphics, compute=gfx1151) but then hangs for exactly 30 seconds during GGML_CUDA_INIT and times out:

failure during GPU discovery ... error="failed to finish discovery before timeout"
inference compute id=cpu library=cpu

Attempt 2 — Docker (ollama/ollama:rocm)
Tried the Docker image thinking maybe it was a library mismatch on the host. Passed through /dev/kfd and /dev/dri, added the right group permissions. This one actually fails faster — crashes immediately instead of timing out — but dmesg tells the same story:

amdgpu: [gfxhub] page fault ... Process ollama ...
GCVM_L2_PROTECTION_FAULT_STATUS:0x00800932
  Faulty UTCL2 client ID: CPF (0x4)
  WALKER_ERROR: 0x1 / PERMISSION_FAULTS: 0x3 / MAPPING_ERROR: 0x1

So it's not a library issue — Docker shares the host kernel so it hits the exact same thing.

My best guess is it's the kernel. I'm on 6.19.11, rick-github is on 6.11. I still have 6.17 kernels installed so I'll boot into that this weekend and try again — should at least tell us whether it's a 6.19 regression.

Before I do that, any ideas on what might be causing this on newer kernels? Any HSA/ROCm flags worth trying, or is this a known amdgpu issue with gfx1151 SVM on 6.19? Happy to test anything — I've got the hardware and a bit of time this weekend.

@sagar-kale commented on GitHub (Apr 16, 2026): Thanks @chejh-amd and @rick-github — I gave ROCm a proper go but kept hitting a wall. Here's what I tried: **Attempt 1 — native `ollama-linux-amd64-rocm` tarball (v0.20.7)** Downloaded the full ROCm-specific build (~944MB, extracts ~2.5GB of ROCm libs into `/usr/local/lib/ollama/rocm/`), started Ollama on a separate port so it wouldn't conflict with my existing setup. GPU gets detected fine (`Radeon 8060S Graphics, compute=gfx1151`) but then hangs for exactly 30 seconds during `GGML_CUDA_INIT` and times out: ``` failure during GPU discovery ... error="failed to finish discovery before timeout" inference compute id=cpu library=cpu ``` **Attempt 2 — Docker (`ollama/ollama:rocm`)** Tried the Docker image thinking maybe it was a library mismatch on the host. Passed through `/dev/kfd` and `/dev/dri`, added the right group permissions. This one actually fails faster — crashes immediately instead of timing out — but dmesg tells the same story: ``` amdgpu: [gfxhub] page fault ... Process ollama ... GCVM_L2_PROTECTION_FAULT_STATUS:0x00800932 Faulty UTCL2 client ID: CPF (0x4) WALKER_ERROR: 0x1 / PERMISSION_FAULTS: 0x3 / MAPPING_ERROR: 0x1 ``` So it's not a library issue — Docker shares the host kernel so it hits the exact same thing. My best guess is it's the kernel. I'm on **6.19.11**, rick-github is on **6.11**. I still have 6.17 kernels installed so I'll boot into that this weekend and try again — should at least tell us whether it's a 6.19 regression. Before I do that, any ideas on what might be causing this on newer kernels? Any HSA/ROCm flags worth trying, or is this a known amdgpu issue with gfx1151 SVM on 6.19? Happy to test anything — I've got the hardware and a bit of time this weekend.

GiteaMirror commented

2026-04-22 20:24:07 -05:00

@rick-github commented on GitHub (Apr 16, 2026):

https://github.com/ollama/ollama/issues/15420#issuecomment-4208015418

@rick-github commented on GitHub (Apr 16, 2026): https://github.com/ollama/ollama/issues/15420#issuecomment-4208015418

GiteaMirror commented

2026-04-22 20:24:07 -05:00

@chejh-amd commented on GitHub (Apr 16, 2026):

The kernel version theory is very likely right. There have been multiple reports of GCVM_L2_PROTECTION_FAULT / page faults on gfx1151 with kernels 6.18.4+ and 6.19.x — the amdgpu driver in those kernels introduced changes that require a matching ROCm version to work correctly.

The short version: kernel 6.17.7 has been reported to work with ROCm on Strix Halo. Kernel 6.17.9 and later (including your 6.19.11) appear to need either ROCm nightlies or an upcoming ROCm release that aligns with the new kernel-side changes. So booting into your 6.17 kernel this weekend is the right experiment.

A few things worth checking when you do:

Confirm your linux-firmware package version — some firmware builds (e.g. 20251125) are also known to break ROCm regardless of kernel.
If 6.17 works, uname -r + apt list --installed linux-firmware would be useful data to post back here.

@chejh-amd commented on GitHub (Apr 16, 2026): The kernel version theory is very likely right. There have been multiple reports of GCVM_L2_PROTECTION_FAULT / page faults on gfx1151 with kernels 6.18.4+ and 6.19.x — the amdgpu driver in those kernels introduced changes that require a matching ROCm version to work correctly. The short version: kernel 6.17.7 has been reported to work with ROCm on Strix Halo. Kernel 6.17.9 and later (including your 6.19.11) appear to need either ROCm nightlies or an upcoming ROCm release that aligns with the new kernel-side changes. So booting into your 6.17 kernel this weekend is the right experiment. A few things worth checking when you do: - Confirm your `linux-firmware` package version — some firmware builds (e.g. 20251125) are also known to break ROCm regardless of kernel. - If 6.17 works, `uname -r` + `apt list --installed linux-firmware` would be useful data to post back here.

GiteaMirror commented

2026-04-22 20:24:10 -05:00

@sagar-kale commented on GitHub (Apr 16, 2026):

Following up on the weekend testing I promised — ended up going considerably deeper than planned.

tl;dr: The GCVM_L2 page fault is not a kernel version regression. It's a fundamental KFD compute mapping issue present across all Ubuntu kernels I tested. ROCm init actually works with the right userspace, but GPU compute dispatch fails regardless.

Kernel 6.17 test

Booted into 6.17.0-20-generic as planned. Same [gfxhub] page fault at ROCm init — so it's not a 6.19 regression, it affects 6.17 too. That ruled out the kernel version theory pretty quickly.

linux-firmware check

Ubuntu's package is 20240318.git3b128b60. Came across a Framework community thread saying linux-firmware 20251125 broke ROCm on gfx1151 and 20260309 restored it. Manually pulled the GC 11.5.1 compute blobs (gc_11_5_1_me, mec, mes1, mes_2, pfp) from upstream linux-firmware HEAD and rebuilt initramfs. Reran the Docker ROCm test — still the exact same page fault. Firmware wasn't the issue.

TheRock 7.13 nightly (the interesting one)

Tried a different angle: installed AMD's TheRock nightly ROCm build (7.13.0a20260416, gfx1151-specific wheels via pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ "rocm[libraries,devel]").

Running rocminfo with these newer userspace libs gave the first clean GPU detection I'd seen:

Agent 2: gfx1151 — Radeon 8060S Graphics
  Compute Unit: 40
  Wavefront Size: 32
  Pool 1: ~63 GB GLOBAL COARSE GRAINED
  ISA: amdgcn-amd-amdhsa--gfx1151

No dmesg errors at all. So TheRock 7.13 fixed the HSA init page fault that was happening with ROCm 7.2.x.

Built llama.cpp against those libs with GGML_HIP_NO_VMM=ON and GPU_TARGETS=gfx1151 (flags from a known-good Proxmox/gfx1151 report). Device detection worked:

ggml_cuda_init: found 1 ROCm devices
Device 0: Radeon 8060S Graphics, gfx1151, VMM: no, VRAM: 63717 MiB

But the moment any GPU kernel is dispatched — even a trivial 1024-element float add — it hangs and logs:

amdgpu: [gfxhub] page fault
GCVM_L2_PROTECTION_FAULT_STATUS:0x00800932
  Faulty UTCL2 client ID: CPF (0x4)
  WALKER_ERROR: 0x1 / PERMISSION_FAULTS: 0x3 / MAPPING_ERROR: 0x1

CPF (Command Processor Fetch) can't access the virtual address the kernel placed the compute commands at. The address is in host user-space (~0x74c75c159000), consistent with a unified memory allocation that isn't getting pre-mapped into the GPU page tables before dispatch.

HSA_XNACK=1

Tried this hoping the page fault would become retryable. rocminfo reports XNACK enabled: NO regardless — the kernel/firmware isn't exposing XNACK capability, so there's no retry path.

Ubuntu OEM kernel (6.17.0-1017-oem)

Also tried this on the theory it might have different KFD patches. Same compute page fault. Same result.

Where things stand

The init fault is fixed in TheRock 7.13 (ROCr VGPR count fix for gfx1151, landed Dec 2024). The compute fault is a separate, lower-level issue — the GPU page table walker can't resolve host virtual addresses during compute dispatch. This is a KFD/amdgpu kernel driver issue. The one configuration reported to work (llama.cpp discussion #20856) uses a Proxmox 6.19.2-1-pve kernel, which presumably has a patch Ubuntu's kernels don't.

For now I'm sitting on llama-swap + Vulkan at ~52 t/s while this works itself out upstream. Happy to test anything specific if it'd be useful data for AMD.

@sagar-kale commented on GitHub (Apr 16, 2026): Following up on the weekend testing I promised — ended up going considerably deeper than planned. **tl;dr:** The GCVM_L2 page fault is not a kernel version regression. It's a fundamental KFD compute mapping issue present across all Ubuntu kernels I tested. ROCm init actually works with the right userspace, but GPU compute dispatch fails regardless. --- **Kernel 6.17 test** Booted into 6.17.0-20-generic as planned. Same `[gfxhub] page fault` at ROCm init — so it's not a 6.19 regression, it affects 6.17 too. That ruled out the kernel version theory pretty quickly. **linux-firmware check** Ubuntu's package is `20240318.git3b128b60`. Came across a Framework community thread saying `linux-firmware 20251125` broke ROCm on gfx1151 and `20260309` restored it. Manually pulled the GC 11.5.1 compute blobs (`gc_11_5_1_me`, `mec`, `mes1`, `mes_2`, `pfp`) from upstream linux-firmware HEAD and rebuilt initramfs. Reran the Docker ROCm test — still the exact same page fault. Firmware wasn't the issue. **TheRock 7.13 nightly (the interesting one)** Tried a different angle: installed AMD's TheRock nightly ROCm build (7.13.0a20260416, gfx1151-specific wheels via `pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ "rocm[libraries,devel]"`). Running `rocminfo` with these newer userspace libs gave the first clean GPU detection I'd seen: ``` Agent 2: gfx1151 — Radeon 8060S Graphics Compute Unit: 40 Wavefront Size: 32 Pool 1: ~63 GB GLOBAL COARSE GRAINED ISA: amdgcn-amd-amdhsa--gfx1151 ``` No dmesg errors at all. So TheRock 7.13 fixed the HSA init page fault that was happening with ROCm 7.2.x. Built llama.cpp against those libs with `GGML_HIP_NO_VMM=ON` and `GPU_TARGETS=gfx1151` (flags from a known-good Proxmox/gfx1151 report). Device detection worked: ``` ggml_cuda_init: found 1 ROCm devices Device 0: Radeon 8060S Graphics, gfx1151, VMM: no, VRAM: 63717 MiB ``` But the moment any GPU kernel is dispatched — even a trivial 1024-element float add — it hangs and logs: ``` amdgpu: [gfxhub] page fault GCVM_L2_PROTECTION_FAULT_STATUS:0x00800932 Faulty UTCL2 client ID: CPF (0x4) WALKER_ERROR: 0x1 / PERMISSION_FAULTS: 0x3 / MAPPING_ERROR: 0x1 ``` CPF (Command Processor Fetch) can't access the virtual address the kernel placed the compute commands at. The address is in host user-space (~0x74c75c159000), consistent with a unified memory allocation that isn't getting pre-mapped into the GPU page tables before dispatch. **HSA_XNACK=1** Tried this hoping the page fault would become retryable. `rocminfo` reports `XNACK enabled: NO` regardless — the kernel/firmware isn't exposing XNACK capability, so there's no retry path. **Ubuntu OEM kernel (6.17.0-1017-oem)** Also tried this on the theory it might have different KFD patches. Same compute page fault. Same result. --- **Where things stand** The init fault is fixed in TheRock 7.13 (ROCr VGPR count fix for gfx1151, landed Dec 2024). The compute fault is a separate, lower-level issue — the GPU page table walker can't resolve host virtual addresses during compute dispatch. This is a KFD/amdgpu kernel driver issue. The one configuration reported to work (llama.cpp discussion [#20856](https://github.com/ggml-org/llama.cpp/discussions/20856)) uses a Proxmox 6.19.2-1-pve kernel, which presumably has a patch Ubuntu's kernels don't. For now I'm sitting on llama-swap + Vulkan at ~52 t/s while this works itself out upstream. Happy to test anything specific if it'd be useful data for AMD.

GiteaMirror commented

2026-04-22 20:24:10 -05:00

@Znuff commented on GitHub (Apr 17, 2026):

@sagar-kale as per the previous comment, and AMD Documentation, you can now get it working on:

ii  linux-image-6.17.0-20-generic            6.17.0-20.20~24.04.1
ii  linux-image-generic-hwe-24.04-edge       6.17.0-20.20~24.04.1

The only catch, as detailed in the previous ticket mentioned, is that you have to use the 31.10+ amdgpu kernel driver, ie:

# cat amdgpu.list
deb [arch=amd64,i386 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/31.10/ubuntu noble main

This finally fixes the GCVM_L2_PROTECTION_FAULT_STATUS issue that was also mentioned in #13589

For some reasons, the AMD Documentation only mentions 30.30.x driver to install, which produces the GCVM_L2_PROTECTION_FAULT_STATUS issue on Ubuntu 24.04 (even with the latest HWE Kernel).

I assume this won't be an issue in 26.04 LTS

@Znuff commented on GitHub (Apr 17, 2026): @sagar-kale as per the previous comment, and AMD Documentation, you can **now** get it working on: ``` ii linux-image-6.17.0-20-generic 6.17.0-20.20~24.04.1 ii linux-image-generic-hwe-24.04-edge 6.17.0-20.20~24.04.1 ``` The only catch, as detailed in the previous ticket mentioned, is that you have to use the 31.10+ `amdgpu` kernel driver, ie: ``` # cat amdgpu.list deb [arch=amd64,i386 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/31.10/ubuntu noble main ``` This finally fixes the `GCVM_L2_PROTECTION_FAULT_STATUS` issue that was also mentioned in #13589 For some reasons, the AMD Documentation only mentions 30.30.x driver to install, which produces the `GCVM_L2_PROTECTION_FAULT_STATUS` issue on Ubuntu 24.04 (even with the latest HWE Kernel). *I assume this won't be an issue in 26.04 LTS*

GiteaMirror commented

2026-04-22 20:24:11 -05:00

@sagar-kale commented on GitHub (Apr 17, 2026):

@Znuff @rick-github @chejh-amd — that tip about amdgpu 31.10 was the missing piece. Thank you, genuinely — I'd been going in circles for a while and that comment saved me from going further down dead ends.

Here's a full writeup of everything I tested after your replies, in case it's useful for others landing on this issue.

System


Machine	Minisforum MS-S1 Max
CPU/APU	AMD Ryzen AI MAX+ 395 (Strix Halo)
GPU	Integrated Radeon 8060S (gfx1151), 40 CUs
RAM	128 GB unified memory
GPU memory available	~116 GiB (via GTT pool, `amdgpu.gttsize=117760`)
OS	Ubuntu 24.04 LTS
Default kernel	6.17.0-20-generic (HWE edge)
amdgpu-dkms	31.10 (`1:6.18.4.31100000`)
Ollama	0.20.7 native

What fixed it — amdgpu-dkms 31.10

Exactly as @Znuff described. Swapped the AMD repo from 30.30.1 → 31.10, built the DKMS module for 6.17.0-20-generic, rebooted. The GCVM_L2_PROTECTION_FAULT_STATUS:0x00800932 CPF fault was completely gone.

# /etc/apt/sources.list.d/amdgpu.list
deb [arch=amd64,i386 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/31.10/ubuntu noble main

sudo apt-get install amdgpu-dkms amdgpu-dkms-firmware
sudo dkms build amdgpu/6.18.4-2286447.24.04 -k 6.17.0-20-generic
sudo dkms install amdgpu/6.18.4-2286447.24.04 -k 6.17.0-20-generic

Native Ollama service config (/etc/systemd/system/ollama.service.d/override.conf):

Environment="OLLAMA_VULKAN=0"
Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1"

Speed results — all models, each tested in isolation

Each model was tested with a fresh Ollama restart (single model in VRAM at a time). 128 token generation, flash attention enabled.

Model	Size	t/s
gemma4:e4b Q4_K	9.6 GB	55.2 t/s
gemma4:26b Q4_K	17 GB	54.4 t/s
qwen3.5:9b Q4_K	6.6 GB	32.7 t/s
gpt-oss:120b Q4_K	65 GB	26.6 t/s
qwen2.5:32b Q4_K	19 GB	11.0 t/s
gemma3:27b Q4_K	17 GB	11.8 t/s
qwen3.5:122b Q4_K	81 GB	9.3 t/s

The 122b and 120b models spill some layers to CPU (Ollama uses partial GPU offload at that size), which explains the lower t/s. Everything else is fully on GPU.

GPU activity during inference (monitored via sysfs during gemma4:26b): peaked at 95% busy. VRAM sysfs shows ~0.3 GiB because on a unified memory APU the model lives in the GTT pool, not the frame buffer — this is expected and doesn't indicate CPU fallback.

One note on stacked vs isolated: if you run multiple models sequentially without restarting Ollama (with KEEP_ALIVE=-1), previous models stay in VRAM. The 122b/120b models will fail allocation if a 17-26 GB model is still loaded alongside them. Isolated runs are the clean baseline.

What didn't work (for the record)

30.30 amdgpu DKMS + any kernel: GCVM_L2_PROTECTION_FAULT_STATUS CPF fault on every compute dispatch
TheRock 7.13 nightly + 30.30: GPU init works (no HSA page fault), compute still faults — the VGPR fix in TheRock doesn't help without the driver fix
Ubuntu OEM kernel 6.17.0-1017-oem: same compute fault with 30.30 driver
linux-firmware GC 11.5.1 blobs from upstream HEAD: no effect on the compute fault
HSA_XNACK=1: rocminfo still reports XNACK enabled: NO — firmware/kernel not exposing XNACK so no retry path
Ollama ROCm Docker (0.20.5, 0.21.0): works fine once the host has amdgpu 31.10, but ~15% slower than native install (46 t/s vs 54 t/s on gemma4:26b) — suspect the bundled ROCm libs differ from what the native install links against

Questions for anyone who knows

1. Will these speeds improve when Ollama bumps its llama.cpp vendor?

Currently Ollama vendors llama.cpp at b7437 (Dec 2025). Two PRs landed after that which gave a big Vulkan boost on gfx1151:

ggml-org/llama.cpp#19625 — Wave32 flash attention (Feb 2026)
ggml-org/llama.cpp#20551 — graphics queue on AMD Vulkan (Mar 2026)

Are there equivalent ROCm/HIP improvements in newer llama.cpp that haven't made it into Ollama yet? Or is the ROCm path already pulling from a more current snapshot?

2. Is 54 t/s on gemma4:26b about what you'd expect for this hardware on ROCm, or should it be higher?

@rick-github's numbers on the NucBox EVO-X2 (same gfx1151) were 51.85 t/s on Ollama 0.20.5-rocm, so we're in the same range. But running the same model through llama.cpp b8765 with GGML_HIP_NO_VMM=ON only got 48.6 t/s — native Ollama ROCm beat our manual build. Curious whether there's an obvious flag or build option that'd push it further, or whether 54–55 t/s is the ceiling for this chip on a 26B Q4 model.

@sagar-kale commented on GitHub (Apr 17, 2026): @Znuff @rick-github @chejh-amd — that tip about amdgpu 31.10 was the missing piece. Thank you, genuinely — I'd been going in circles for a while and that comment saved me from going further down dead ends. Here's a full writeup of everything I tested after your replies, in case it's useful for others landing on this issue. --- ## System | | | |---|---| | **Machine** | Minisforum MS-S1 Max | | **CPU/APU** | AMD Ryzen AI MAX+ 395 (Strix Halo) | | **GPU** | Integrated Radeon 8060S (gfx1151), 40 CUs | | **RAM** | 128 GB unified memory | | **GPU memory available** | ~116 GiB (via GTT pool, `amdgpu.gttsize=117760`) | | **OS** | Ubuntu 24.04 LTS | | **Default kernel** | 6.17.0-20-generic (HWE edge) | | **amdgpu-dkms** | 31.10 (`1:6.18.4.31100000`) | | **Ollama** | 0.20.7 native | --- ## What fixed it — amdgpu-dkms 31.10 Exactly as @Znuff described. Swapped the AMD repo from `30.30.1` → `31.10`, built the DKMS module for `6.17.0-20-generic`, rebooted. The `GCVM_L2_PROTECTION_FAULT_STATUS:0x00800932` CPF fault was completely gone. ```bash # /etc/apt/sources.list.d/amdgpu.list deb [arch=amd64,i386 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/31.10/ubuntu noble main sudo apt-get install amdgpu-dkms amdgpu-dkms-firmware sudo dkms build amdgpu/6.18.4-2286447.24.04 -k 6.17.0-20-generic sudo dkms install amdgpu/6.18.4-2286447.24.04 -k 6.17.0-20-generic ``` Native Ollama service config (`/etc/systemd/system/ollama.service.d/override.conf`): ``` Environment="OLLAMA_VULKAN=0" Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1" ``` --- ## Speed results — all models, each tested in isolation Each model was tested with a fresh Ollama restart (single model in VRAM at a time). 128 token generation, flash attention enabled. | Model | Size | t/s | |---|---|---| | gemma4:e4b Q4_K | 9.6 GB | **55.2 t/s** | | gemma4:26b Q4_K | 17 GB | **54.4 t/s** | | qwen3.5:9b Q4_K | 6.6 GB | **32.7 t/s** | | gpt-oss:120b Q4_K | 65 GB | **26.6 t/s** | | qwen2.5:32b Q4_K | 19 GB | **11.0 t/s** | | gemma3:27b Q4_K | 17 GB | **11.8 t/s** | | qwen3.5:122b Q4_K | 81 GB | **9.3 t/s** | The 122b and 120b models spill some layers to CPU (Ollama uses partial GPU offload at that size), which explains the lower t/s. Everything else is fully on GPU. **GPU activity during inference** (monitored via sysfs during gemma4:26b): peaked at **95% busy**. VRAM sysfs shows ~0.3 GiB because on a unified memory APU the model lives in the GTT pool, not the frame buffer — this is expected and doesn't indicate CPU fallback. **One note on stacked vs isolated:** if you run multiple models sequentially without restarting Ollama (with `KEEP_ALIVE=-1`), previous models stay in VRAM. The 122b/120b models will fail allocation if a 17-26 GB model is still loaded alongside them. Isolated runs are the clean baseline. --- ## What didn't work (for the record) - **30.30 amdgpu DKMS + any kernel**: `GCVM_L2_PROTECTION_FAULT_STATUS` CPF fault on every compute dispatch - **TheRock 7.13 nightly + 30.30**: GPU init works (no HSA page fault), compute still faults — the VGPR fix in TheRock doesn't help without the driver fix - **Ubuntu OEM kernel 6.17.0-1017-oem**: same compute fault with 30.30 driver - **linux-firmware GC 11.5.1 blobs from upstream HEAD**: no effect on the compute fault - **HSA_XNACK=1**: `rocminfo` still reports `XNACK enabled: NO` — firmware/kernel not exposing XNACK so no retry path - **Ollama ROCm Docker (0.20.5, 0.21.0)**: works fine once the host has amdgpu 31.10, but ~15% slower than native install (46 t/s vs 54 t/s on gemma4:26b) — suspect the bundled ROCm libs differ from what the native install links against --- ## Questions for anyone who knows **1. Will these speeds improve when Ollama bumps its llama.cpp vendor?** Currently Ollama vendors llama.cpp at b7437 (Dec 2025). Two PRs landed after that which gave a big Vulkan boost on gfx1151: - [ggml-org/llama.cpp#19625](https://github.com/ggml-org/llama.cpp/pull/19625) — Wave32 flash attention (Feb 2026) - [ggml-org/llama.cpp#20551](https://github.com/ggml-org/llama.cpp/pull/20551) — graphics queue on AMD Vulkan (Mar 2026) Are there equivalent ROCm/HIP improvements in newer llama.cpp that haven't made it into Ollama yet? Or is the ROCm path already pulling from a more current snapshot? **2. Is 54 t/s on gemma4:26b about what you'd expect for this hardware on ROCm, or should it be higher?** @rick-github's numbers on the NucBox EVO-X2 (same gfx1151) were 51.85 t/s on Ollama 0.20.5-rocm, so we're in the same range. But running the same model through llama.cpp b8765 with `GGML_HIP_NO_VMM=ON` only got 48.6 t/s — native Ollama ROCm beat our manual build. Curious whether there's an obvious flag or build option that'd push it further, or whether 54–55 t/s is the ceiling for this chip on a 26B Q4 model.

GiteaMirror commented

2026-04-22 20:24:11 -05:00

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15601
Analyzed: 2026-04-18T18:19:47.508770

Analysis

Type: unknown
Severity: medium
Components: unknown

Implementation Plan

Effort: medium
Steps:

This issue has been triaged and marked for implementation.

@PureBlissAK commented on GitHub (Apr 18, 2026):  ## 🤖 Automated Triage & Analysis Report **Issue**: #15601 **Analyzed**: 2026-04-18T18:19:47.508770 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#35718