[GH-ISSUE #12432] Qwen3 vs Qwen3-2507 Regression caused by flash attention. AMD ROCM #70316

New Issue

GiteaMirror · 2026-05-04T21:05:39-05:00

GiteaMirror commented

2026-05-04 21:05:39 -05:00

Originally created by @lennarkivistik on GitHub (Sep 27, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12432

What is the issue?

Not sure this is something ollama team can fix but I thought I would atleast mention it.

I was doing some performance benchmarks being a avid user of local ollama and was doing comparisons between qwen3 and the newer qwen3 2705, but while playing around i noticed a big regression on the newer Qwen3:*-2507 builds that they are much slower on AMD ROCm compared to original Qwen3:4B

and it seems to stem from OLLAMA_FLASH_ATTENTION and OLLAMA_KV_CACHE_TYPE

What I tested

Just for testings sake i took the smal 4B variant
Ran multiple context sizes (2k → 25k) with original Qwen3:4B and the new Qwen3:4B-2507 builds (q4_K_M and q8_0).
Compared performance across different Flash Attention / KV cache settings.
Original Qwen3:4B: runs great with FlashAttn=ON and KV=q8_0.
- Sustains 110 tok/s at small/medium context (on the graphs below they are up to 400tok/s cause of througput speed) as the context was so small and it compared the time to finish vs time it ran for.
- VRAM usage stable (~4–6 GB).
2507 builds (q4_K_M and q8_0):

Relevant log output

**Ollama environment variables:**

# Good configs that work with original Qwen3:4B (and the rest of the fleet of models for me atleast)
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0

# Required configs for Qwen3:4B-2507 builds (FlashAttn unsupported, q8 cache not usable)
OLLAMA_FLASH_ATTENTION=0
OLLAMA_KV_CACHE_TYPE=q4_0

OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_MAX_QUEUE=128
OLLAMA_KEEP_ALIVE=10m
OLLAMA_MODELS=/var/lib/ollama
OLLAMA_TMPDIR=/var/tmp/ollama
OLLAMA_NUM_THREADS=16
OLLAMA_NUM_PARALLEL=1
OLLAMA_NUM_BATCH=512

OS

Linux Endeavour OS Arch

GPU

AMD Radeon RX 7900 24GB XTX

CPU

AMD AMD Ryzen 9 7950X3D with 64GB RAM

Ollama version

0.12.2

Originally created by @lennarkivistik on GitHub (Sep 27, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12432 ### What is the issue? ### Not sure this is something ollama team can fix but I thought I would atleast mention it. I was doing some performance benchmarks being a avid user of local ollama and was doing comparisons between qwen3 and the newer qwen3 2705, but while playing around i noticed a big regression on the newer Qwen3:*-2507 builds that they are much slower on AMD ROCm compared to original Qwen3:4B and it seems to stem from **OLLAMA_FLASH_ATTENTION** and **OLLAMA_KV_CACHE_TYPE** ### What I tested * Just for testings sake i took the smal 4B variant * Ran multiple context sizes (2k → 25k) with **original Qwen3:4B** and the **new Qwen3:4B-2507** builds (`q4_K_M` and `q8_0`). * Compared performance across different Flash Attention / KV cache settings. * **Original Qwen3:4B**: runs great with `FlashAttn=ON` and `KV=q8_0`. * Sustains 110 tok/s at small/medium context (on the graphs below they are up to 400tok/s cause of througput speed) as the context was so small and it compared the time to finish vs time it ran for. * VRAM usage stable (~4–6 GB). * **2507 builds (`q4_K_M` and `q8_0`)**: <img width="1300" height="800" alt="Image" src="https://github.com/user-attachments/assets/41adec17-9cc5-45b4-8b9f-a29d907d784a" /> <img width="1300" height="800" alt="Image" src="https://github.com/user-attachments/assets/86de8517-70a2-4081-9893-af8fefae7570" /> ### Relevant log output ```shell **Ollama environment variables:** # Good configs that work with original Qwen3:4B (and the rest of the fleet of models for me atleast) OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 # Required configs for Qwen3:4B-2507 builds (FlashAttn unsupported, q8 cache not usable) OLLAMA_FLASH_ATTENTION=0 OLLAMA_KV_CACHE_TYPE=q4_0 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MAX_QUEUE=128 OLLAMA_KEEP_ALIVE=10m OLLAMA_MODELS=/var/lib/ollama OLLAMA_TMPDIR=/var/tmp/ollama OLLAMA_NUM_THREADS=16 OLLAMA_NUM_PARALLEL=1 OLLAMA_NUM_BATCH=512 ``` ### OS Linux Endeavour OS Arch ### GPU AMD Radeon RX 7900 24GB XTX ### CPU AMD AMD Ryzen 9 7950X3D with 64GB RAM ### Ollama version 0.12.2

GiteaMirror added the bug label 2026-05-04 21:05:40 -05:00

GiteaMirror closed this issue

2026-05-04 21:05:41 -05:00

GiteaMirror commented

2026-05-04 21:05:43 -05:00

@jessegross commented on GitHub (Sep 29, 2025):

You might want to try just flash attention but no KV cache quantization. In many cases the latter is the issue but disabling flash attention disables both.

@jessegross commented on GitHub (Sep 29, 2025): You might want to try just flash attention but no KV cache quantization. In many cases the latter is the issue but disabling flash attention disables both.

GiteaMirror commented

2026-05-04 21:05:46 -05:00

@lennarkivistik commented on GitHub (Sep 30, 2025):

Then I'll try and run another set of benchmarks without defining OLLAMA_KV_CACHE_TYPE, that explains why the graphs
are quite similar when OLLAMA_FLASH_ATTENTION set to 0 as OLLAMA_KV_CACHE_TYPE didnt seem to do anything at that point.

On the RX 7900 most of the performance gains come from keeping FA on if/when stable and context does not eat away the 24GB RAM.

The most suprising thing is that the regression could not be seen on inital first question, when i just asked it to tell a short story, i got these results and for some reason gemma3 gets super bad results then also.

It was when i started adding more input data that the regression tanked on the qwen3:b--2507 models

Model	Eval rate (tok/s)	Prompt eval rate (tok/s)	Total time
llama3.2:3b	147.79	409.89	3.53 s
qwen3:4b-instruct-2507-q4_K_M	111.94	216.81	3.54 s
qwen3:4b-thinking-2507-q4_K_M	111.53	237.02	7.34 s
qwen3:4b	109.84	154.36	8.51 s
gpt-oss:20b	106.18	229.57	7.15 s
qwen3:4b-instruct-2507-q8_0	98.62	235.93	3.96 s
qwen3:4b-thinking-2507-q8_0	98.58	259.10	7.07 s
qwen3:8b	76.83	134.83	7.72 s
qwen3:30b	76.69	61.59	8.62 s
deepseek-r1:8b	76.49	66.43	7.18 s
qwen3-coder:30b	76.39	47.17	3.92 s
qwen3:14b	51.18	103.42	14.24 s
deepseek-r1:32b	23.73	55.55	2m12 s
gemma3:12b	17.47	117.97	38.79 s

@lennarkivistik commented on GitHub (Sep 30, 2025): Then I'll try and run another set of benchmarks without defining **OLLAMA_KV_CACHE_TYPE**, that explains why the graphs are quite similar when **OLLAMA_FLASH_ATTENTION** set to 0 as **OLLAMA_KV_CACHE_TYPE** didnt seem to do anything at that point. On the **RX 7900** most of the performance gains come from keeping **FA** on if/when stable and context does not eat away the 24GB RAM. The most suprising thing is that the regression could not be seen on inital first question, when i just asked it to **tell a short story**, i got these results and for some reason gemma3 gets super bad results then also. It was when i started adding more input data that the regression tanked on the qwen3:*b-*-2507 models Model | Eval rate (tok/s) | Prompt eval rate (tok/s) | Total time -- | -- | -- | -- llama3.2:3b | 147.79 | 409.89 | 3.53 s qwen3:4b-instruct-2507-q4_K_M | 111.94 | 216.81 | 3.54 s qwen3:4b-thinking-2507-q4_K_M | 111.53 | 237.02 | 7.34 s qwen3:4b | 109.84 | 154.36 | 8.51 s gpt-oss:20b | 106.18 | 229.57 | 7.15 s qwen3:4b-instruct-2507-q8_0 | 98.62 | 235.93 | 3.96 s qwen3:4b-thinking-2507-q8_0 | 98.58 | 259.10 | 7.07 s qwen3:8b | 76.83 | 134.83 | 7.72 s qwen3:30b | 76.69 | 61.59 | 8.62 s deepseek-r1:8b | 76.49 | 66.43 | 7.18 s qwen3-coder:30b | 76.39 | 47.17 | 3.92 s qwen3:14b | 51.18 | 103.42 | 14.24 s deepseek-r1:32b | 23.73 | 55.55 | 2m12 s gemma3:12b | 17.47 | 117.97 | 38.79 s

GiteaMirror commented

2026-05-04 21:05:48 -05:00

@sunskyx commented on GitHub (Oct 2, 2025):

I also use the 7900XTX, and after rolling back to version ollama:0.12.1, the performance returned to normal.

@sunskyx commented on GitHub (Oct 2, 2025): I also use the 7900XTX, and after rolling back to version ollama:0.12.1, the performance returned to normal.

GiteaMirror commented

2026-05-04 21:05:50 -05:00

@lennarkivistik commented on GitHub (Oct 3, 2025):

I ran a quick benchmark using both of your settings, and as expected, v0.12.1 was indeed faster. The only differences in my setup were switching the version and commenting out OLLAMA_KV_CACHE_TYPE for the v0.12.3 run.

The q4_k_m build was very unstable on my system, it froze multiple times even at smaller context sizes. (By “freeze,” I mean the GPU maxed out at 100% and never finished the run.) Normally, freezes are rare with other models as long as ctx_num is set higher than the combined input and system prompt. I even raised ctx_num to give extra breathing room, but the hangs persisted (on these models compared to the rest).

So far, both qwen3:4b-instruct-2507-q8_0 and qwen3:4b-instruct-2507-q4_K_M seem unstable for my setup regardless of settings. I’ll do a more thorough round of testing over the weekend, since I’m curious if q4_k_m might behave better under v0.12.1.

@lennarkivistik commented on GitHub (Oct 3, 2025): I ran a quick benchmark using both of your settings, and as expected, v0.12.1 was indeed faster. The only differences in my setup were switching the version and commenting out OLLAMA_KV_CACHE_TYPE for the v0.12.3 run. The q4_k_m build was very unstable on my system, it froze multiple times even at smaller context sizes. (By “freeze,” I mean the GPU maxed out at 100% and never finished the run.) Normally, freezes are rare with other models as long as ctx_num is set higher than the combined input and system prompt. I even raised ctx_num to give extra breathing room, but the hangs persisted (on these models compared to the rest). So far, both qwen3:4b-instruct-2507-q8_0 and qwen3:4b-instruct-2507-q4_K_M seem unstable for my setup regardless of settings. I’ll do a more thorough round of testing over the weekend, since I’m curious if q4_k_m might behave better under v0.12.1. <img width="1300" height="800" alt="Image" src="https://github.com/user-attachments/assets/180fe9fa-d512-4ddf-b7db-473ddd8e3752" />

GiteaMirror commented

2026-05-04 21:05:50 -05:00

@jessegross commented on GitHub (Oct 3, 2025):

Can you try out one of the 0.12.4 RCs? It improves flash attention performance on GPUs similar to yours.

@jessegross commented on GitHub (Oct 3, 2025): Can you try out one of the 0.12.4 RCs? It improves flash attention performance on GPUs similar to yours.

GiteaMirror commented

2026-05-04 21:05:51 -05:00

@lennarkivistik commented on GitHub (Oct 4, 2025):

Hi Jesse, thanks for the ping. I tried 0.12.4 RC4. I installed both tarballs (generic + ROCm) just like I do on stable, no other changes to my unit or env.

0.12.1 - 0.12.3 (works)

Service comes up and selects ROCm correctly:

version 0.12.1
...
inference compute id=GPU-2b91a683f3f9e991 library=rocm compute=gfx1100 total="24.0 GiB" available="22.2 GiB"

Logs show my GPU is supported:

rocm supported GPUs [ ... gfx1100 ... ]
amdgpu is supported gpu_type=gfx1100

0.12.4-rc4 (regresses to CPU)

ggml sees the ROCm device:

ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100)

But the runner returns no usable combos and falls back to CPU/low VRAM:

runner enumerated devices OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/rocm]" devices=[]
filtering out unsupported or overlapping GPU library combinations → supported=map[]
inference compute id=cpu library=cpu ...
entering low vram mode "total vram"="0 B"

What my upgrade script does (between versions)

Versioned fetch: VERSION=<tag> ./upgrade-ollama.sh downloads both release archives:
- ollama-linux-amd64.tgz (generic)
- ollama-linux-amd64-rocm.tgz (ROCm payload)
Clean + install flow:
- Stops the ollama service.
- Detects current install prefix; installs the binary to /usr/bin/ollama (or /usr/local/bin if that was in use).
- Removes legacy binaries in the other prefix (prevents PATH confusion).
- Extracts the ROCm tar straight into /usr → libraries end up under /usr/lib/ollama/rocm (libggml-hip.so, libamdhip64.so, librocblas.so, etc.). It does not touch /opt/rocm.
- Does not set or modify LD_LIBRARY_PATH (it only prints a warning suggesting adding /usr/lib/ollama/rocm).
On 0.12.4-rc4 the runner probes those and ROCm, but still filters down to devices=[].

v0.12.4-rc4.log
v0.12.1.log

Upgrade script I use if it also helps

VERSION=0.12.4-rc4 ./upgrade-ollama.sh
upgrade-ollama.sh

Just ping If you want me to test another candidate build, im happy to help using my hardware.

@lennarkivistik commented on GitHub (Oct 4, 2025): Hi Jesse, thanks for the ping. I tried **0.12.4 RC4**. I installed **both** tarballs (generic + ROCm) just like I do on stable, no other changes to my unit or env. ### 0.12.1 - 0.12.3 (works) * Service comes up and selects ROCm correctly: ``` version 0.12.1 ... inference compute id=GPU-2b91a683f3f9e991 library=rocm compute=gfx1100 total="24.0 GiB" available="22.2 GiB" ``` * Logs show my GPU is supported: ``` rocm supported GPUs [ ... gfx1100 ... ] amdgpu is supported gpu_type=gfx1100 ``` ### 0.12.4-rc4 (regresses to CPU) * ggml **sees** the ROCm device: ``` ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100) ``` * But the runner returns **no usable combos** and falls back to CPU/low VRAM: ``` runner enumerated devices OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/rocm]" devices=[] filtering out unsupported or overlapping GPU library combinations → supported=map[] inference compute id=cpu library=cpu ... entering low vram mode "total vram"="0 B" ``` ### What my upgrade script does (between versions) * **Versioned fetch:** `VERSION=<tag> ./upgrade-ollama.sh` downloads **both** release archives: * `ollama-linux-amd64.tgz` (generic) * `ollama-linux-amd64-rocm.tgz` (ROCm payload) * **Clean + install flow:** * Stops the `ollama` service. * Detects current install prefix; installs the binary to `/usr/bin/ollama` (or `/usr/local/bin` if that was in use). * Removes legacy binaries in the *other* prefix (prevents PATH confusion). * Extracts the ROCm tar straight into `/usr` → libraries end up under `/usr/lib/ollama/rocm` (`libggml-hip.so`, `libamdhip64.so`, `librocblas.so`, etc.). It does not touch `/opt/rocm`. * **Does not** set or modify `LD_LIBRARY_PATH` (it only prints a warning suggesting adding `/usr/lib/ollama/rocm`). * On 0.12.4-rc4 the runner probes those and ROCm, but still filters down to `devices=[]`. [v0.12.4-rc4.log](https://github.com/user-attachments/files/22694127/v0.12.4-rc4.log) [v0.12.1.log](https://github.com/user-attachments/files/22694338/v0.12.1.log) Upgrade script I use if it also helps > `VERSION=0.12.4-rc4 ./upgrade-ollama.sh` [upgrade-ollama.sh](https://gist.github.com/lennarkivistik/9544f9d8de5e5363c62838918e76a653#file-upgrade-ollama-sh) Just ping If you want me to test another candidate build, im happy to help using my hardware.

GiteaMirror commented

2026-05-04 21:05:52 -05:00

@lennarkivistik commented on GitHub (Oct 5, 2025):

So I ran some more tests today and tried to figure out why qwen3:4b-instruct-2507-q8_0 and qwen3:4b-instruct-2507-q4_K_M were missbehaving, as it was still the case which ever version I ran.

running ollama run qwen3:4b-instruct-2507-q8_0 always worked and was fast in response, but when i ran via the api /api/generate, it freezed up and i got those bad results, (but only those two, never other models).

so changes i have made in my benchmarking code was to hard set batch size to 512,
before i had it go up a bit if the context was larger but the issue persisted even after that.

      const response = await axios.post(`${this.baseUrl}/api/generate`, {
        model,
        prompt,
        stream: false,
        options: {
          num_ctx: options.numCtx || 4096,
          num_batch: 512,
          num_thread: 16,
          ...options
        }

after running ollama rm qwen3:4b-instruct-2507-q8_0 and reinstalling them both, then the issue seems to have resolved itself for now, the newer tests seem to be good on 0.12.3 but it does not explain why the issue was present when running /api/generate vs ollama run exact same model, anyway im eager to test. the new 0.12.4 if it has improvements to Flash Attention as its a blessing for local models :)

Benchmark Method

Cut out a chunk of a sample book corresponding to the amount of estimated tokens needed
while system prompt was to summarize what it got, summary was discarded.
so each run it just increase the chunk from the book.

@lennarkivistik commented on GitHub (Oct 5, 2025): So I ran some more tests today and tried to figure out why `qwen3:4b-instruct-2507-q8_0` and `qwen3:4b-instruct-2507-q4_K_M` were missbehaving, as it was still the case which ever version I ran. running ollama run qwen3:4b-instruct-2507-q8_0 always worked and was fast in response, but when i ran via the api /api/generate, it freezed up and i got those bad results, (but only those two, never other models). so changes i have made in my benchmarking code was to hard set batch size to 512, before i had it go up a bit if the context was larger but the issue persisted even after that. ```javascript const response = await axios.post(`${this.baseUrl}/api/generate`, { model, prompt, stream: false, options: { num_ctx: options.numCtx || 4096, num_batch: 512, num_thread: 16, ...options } ``` after running `ollama rm qwen3:4b-instruct-2507-q8_0` and reinstalling them both, then the issue seems to have resolved itself for now, the newer tests seem to be good on 0.12.3 but it does not explain why the issue was present when running /api/generate vs `ollama run` exact same model, anyway im eager to test. the new 0.12.4 if it has improvements to Flash Attention as its a blessing for local models :) ### Benchmark Method Cut out a chunk of a sample book corresponding to the amount of estimated tokens needed while system prompt was to summarize what it got, summary was discarded. so each run it just increase the chunk from the book. <img width="1856" height="1137" alt="Image" src="https://github.com/user-attachments/assets/b5113ec7-b40d-4f1c-a0db-f9e50ea1626b" />

GiteaMirror commented

2026-05-04 21:05:53 -05:00

@jessegross commented on GitHub (Oct 6, 2025):

Does 0.12.4-rc see you GPU if you don't set HIP_VISIBLE_DEVICES? There is new GPU discovery code in this release so it's possible there is an issue there.

@jessegross commented on GitHub (Oct 6, 2025): Does 0.12.4-rc see you GPU if you don't set HIP_VISIBLE_DEVICES? There is new GPU discovery code in this release so it's possible there is an issue there.

GiteaMirror commented

2026-05-04 21:05:54 -05:00

@dhiltgen commented on GitHub (Oct 6, 2025):

@lennarkivistik your log seems to have long lines chopped off by the default pager. Can you use journalctl -u ollama --no-pager --pager-end to ensure we can see the full log lines? If you could also set OLLAMA_DEBUG=2 to get trace logs turned on that will help spot what's going wrong. As Jesse mentioned, not setting HIP_VISIBLE_DEVICES may work around whatever the bug is until we get it fixed. My suspicion is device 0 is an iGPU in your setup, but without trace logs I can't tell for sure. I only need the logs from startup to where it reports "inference compute"

@dhiltgen commented on GitHub (Oct 6, 2025): @lennarkivistik your log seems to have long lines chopped off by the default pager. Can you use `journalctl -u ollama --no-pager --pager-end` to ensure we can see the full log lines? If you could also set `OLLAMA_DEBUG=2` to get trace logs turned on that will help spot what's going wrong. As Jesse mentioned, not setting HIP_VISIBLE_DEVICES may work around whatever the bug is until we get it fixed. My suspicion is device 0 is an iGPU in your setup, but without trace logs I can't tell for sure. I only need the logs from startup to where it reports "inference compute"

GiteaMirror commented

2026-05-04 21:05:55 -05:00

@lennarkivistik commented on GitHub (Oct 6, 2025):

Yes noticed that, ill send an update also uncommenting HIP_VISIBLE_DEVICES did not utilize the gpu

0.12.4-rc6.log

@lennarkivistik commented on GitHub (Oct 6, 2025): Yes noticed that, ill send an update also uncommenting HIP_VISIBLE_DEVICES did not utilize the gpu [0.12.4-rc6.log](https://github.com/user-attachments/files/22728492/0.12.4-rc6.log)

GiteaMirror commented

2026-05-04 21:05:56 -05:00

@dhiltgen commented on GitHub (Oct 6, 2025):

@lennarkivistik those updated logs include the full lines, but stop before the final "inference compute" was reported. That said, I see there is an iGPU, so somehow we're getting the indexing wrong most likely.

@dhiltgen commented on GitHub (Oct 6, 2025): @lennarkivistik those updated logs include the full lines, but stop before the final "inference compute" was reported. That said, I see there is an iGPU, so somehow we're getting the indexing wrong most likely.

GiteaMirror commented

2026-05-04 21:05:58 -05:00

@lennarkivistik commented on GitHub (Oct 6, 2025):

Well that was embarrasing not including the end, here are some fresh new ones
Here I set HIP_VISIBLE_DEVICES to 1
0.12.4-rc6-hip1.txt
Here I have HIP_VISIBLE_DEVICES unset
0.12.4-rc6-no-hip.txt

and the default was having it on 0

@lennarkivistik commented on GitHub (Oct 6, 2025): Well that was embarrasing not including the end, here are some fresh new ones Here I set HIP_VISIBLE_DEVICES to 1 [0.12.4-rc6-hip1.txt](https://github.com/user-attachments/files/22728759/0.12.4-rc6-hip1.txt) Here I have HIP_VISIBLE_DEVICES unset [0.12.4-rc6-no-hip.txt](https://github.com/user-attachments/files/22728761/0.12.4-rc6-no-hip.txt) and the default was having it on **0**

GiteaMirror commented

2026-05-04 21:06:01 -05:00

@dhiltgen commented on GitHub (Oct 6, 2025):

@lennarkivistik based on your logs, it seems like you might have an old copy of libggml-hip.so in ./lib/ollama/ from a prior build or version of Ollama. Can you make sure to remove everything under ./lib/ollama before extracting the tgz files and see if that fixes things?

@dhiltgen commented on GitHub (Oct 6, 2025): @lennarkivistik based on your logs, it seems like you might have an old copy of libggml-hip.so in ./lib/ollama/ from a prior build or version of Ollama. Can you make sure to remove everything under ./lib/ollama before extracting the tgz files and see if that fixes things?

GiteaMirror commented

2026-05-04 21:06:04 -05:00

@lennarkivistik commented on GitHub (Oct 6, 2025):

I cleared /lib/ollama/* and also updated my upgrade script to honor that for future upgrades/downgrades but it did no difference so was likely not the issue.

Probed the changelog with Chattie and it thinks the filter is the culprit.

The new test cases for filterOverlapByLibrary explicitly decide which version “wins” when the same device IDs appear under
 cuda_v12 vs cuda_v13. That’s fine for CUDA↔CUDA, but if ROCm presents the same GPU via a different ID format (PCI bus ID
vs logical index vs UUID), the map keys may collide or fail to match in a way that marks all entries “needs deletion”. The result
would be an empty supported map, exactly like my logs show.

Just ping if you want me to confirm or try something, ill happily help out with tests for rocm (my kind of hardware) if it helps out, in the future also!

@lennarkivistik commented on GitHub (Oct 6, 2025): I cleared **/lib/ollama/*** and also updated my [upgrade script](https://gist.githubusercontent.com/lennarkivistik/9544f9d8de5e5363c62838918e76a653/raw/49e8e89dd1a9bd3060311cad311179ebac24ebfe/upgrade-ollama.sh) to honor that for future upgrades/downgrades but it did no difference so was likely not the issue. Probed the changelog with Chattie and it thinks the filter is the culprit. ``` The new test cases for filterOverlapByLibrary explicitly decide which version “wins” when the same device IDs appear under cuda_v12 vs cuda_v13. That’s fine for CUDA↔CUDA, but if ROCm presents the same GPU via a different ID format (PCI bus ID vs logical index vs UUID), the map keys may collide or fail to match in a way that marks all entries “needs deletion”. The result would be an empty supported map, exactly like my logs show. ``` Just ping if you want me to confirm or try something, ill happily help out with tests for rocm (my kind of hardware) if it helps out, in the future also!

GiteaMirror commented

2026-05-04 21:06:06 -05:00

@dhiltgen commented on GitHub (Oct 6, 2025):

@lennarkivistik can you share what is present in ls -l /usr/lib/ollama/ on your system?

The following log lines imply there's a libggml-hip.so present there - it shouldn't be discovering AMD GPUs without that library. This runner is a "cuda" runner, but it's reporting AMD devices, which explains why things are getting mixed up.

okt 06 20:44:45 ollama[2989514]: time=2025-10-06T20:44:45.409+02:00 level=DEBUG source=runner.go:401 msg="spawing runner with" OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v12]" extra_envs=[]
...
okt 06 20:44:45 ollama[2989532]: time=2025-10-06T20:44:45.428+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama
okt 06 20:44:45 ollama[2989532]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
okt 06 20:44:45 ollama[2989532]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
okt 06 20:44:45 ollama[2989532]: ggml_cuda_init: found 2 ROCm devices:
okt 06 20:44:45 ollama[2989532]:   Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, ID: GPU-2b91a683f3f9e991
okt 06 20:44:45 ollama[2989532]:   Device 1: AMD Radeon Graphics, gfx1036 (0x1036), VMM: no, Wave Size: 32, ID: 1
okt 06 20:44:45 ollama[2989532]: time=2025-10-06T20:44:45.932+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v12

@dhiltgen commented on GitHub (Oct 6, 2025): @lennarkivistik can you share what is present in `ls -l /usr/lib/ollama/` on your system? The following log lines imply there's a `libggml-hip.so` present there - it shouldn't be discovering AMD GPUs without that library. This runner is a "cuda" runner, but it's reporting AMD devices, which explains why things are getting mixed up. ``` okt 06 20:44:45 ollama[2989514]: time=2025-10-06T20:44:45.409+02:00 level=DEBUG source=runner.go:401 msg="spawing runner with" OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v12]" extra_envs=[] ... okt 06 20:44:45 ollama[2989532]: time=2025-10-06T20:44:45.428+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama okt 06 20:44:45 ollama[2989532]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no okt 06 20:44:45 ollama[2989532]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no okt 06 20:44:45 ollama[2989532]: ggml_cuda_init: found 2 ROCm devices: okt 06 20:44:45 ollama[2989532]: Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, ID: GPU-2b91a683f3f9e991 okt 06 20:44:45 ollama[2989532]: Device 1: AMD Radeon Graphics, gfx1036 (0x1036), VMM: no, Wave Size: 32, ID: 1 okt 06 20:44:45 ollama[2989532]: time=2025-10-06T20:44:45.932+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v12 ```

GiteaMirror commented

2026-05-04 21:06:07 -05:00

@lennarkivistik commented on GitHub (Oct 6, 2025):

This is whats present under /usr/lib/ollama
tree /usr/lib/ollama/

@lennarkivistik commented on GitHub (Oct 6, 2025): This is whats present under /usr/lib/ollama [tree /usr/lib/ollama/](https://github.com/user-attachments/files/22730059/tree.txt)

GiteaMirror commented

2026-05-04 21:06:08 -05:00

@dhiltgen commented on GitHub (Oct 6, 2025):

The contents in ./lib/ollama/rocm look plausible but I'm not seeing the ./lib/ollama/cuda_v12 or ./lib/ollama/cuda_v13 directories which should have been there. My current theory is there's another ./lib/ollama/libggml-hip.so which shouldn't be there as well. I don't see them in the tgz's we published, so my theory is it's a stale file.

% curl -fsSL https://github.com/ollama/ollama/releases/download/v0.12.4-rc6/ollama-linux-amd64-rocm.tgz | tar tzf -  | grep libggml-hip.so
lib/ollama/rocm/libggml-hip.so
% curl -fsSL https://github.com/ollama/ollama/releases/download/v0.12.4-rc6/ollama-linux-amd64.tgz | tar tzf -  | grep libggml-hip.so
%

I'm curious if rm -rf /usr/lib/ollama followed by extracting the 2 tar files above starts working, or still has problems with your setup.

@dhiltgen commented on GitHub (Oct 6, 2025): The contents in `./lib/ollama/rocm` look plausible but I'm not seeing the `./lib/ollama/cuda_v12` or `./lib/ollama/cuda_v13` directories which should have been there. My current theory is there's another `./lib/ollama/libggml-hip.so` which shouldn't be there as well. I don't see them in the tgz's we published, so my theory is it's a stale file. ``` % curl -fsSL https://github.com/ollama/ollama/releases/download/v0.12.4-rc6/ollama-linux-amd64-rocm.tgz | tar tzf - | grep libggml-hip.so lib/ollama/rocm/libggml-hip.so % curl -fsSL https://github.com/ollama/ollama/releases/download/v0.12.4-rc6/ollama-linux-amd64.tgz | tar tzf - | grep libggml-hip.so % ``` I'm curious if `rm -rf /usr/lib/ollama` followed by extracting the 2 tar files above starts working, or still has problems with your setup.

GiteaMirror commented

2026-05-04 21:06:09 -05:00

@dhiltgen commented on GitHub (Oct 6, 2025):

It also appears some log lines aren't showing up, and I'm not sure why. Perhaps to eliminate variables, can you simply run

sudo systemctl stop ollama
OLLAMA_DEBUG=2 /usr/bin/ollama serve 2>&1 | tee /tmp/serve.log

and then ^C as soon as you see "inference compute" and share that log?

@dhiltgen commented on GitHub (Oct 6, 2025): It also appears some log lines aren't showing up, and I'm not sure why. Perhaps to eliminate variables, can you simply run ``` sudo systemctl stop ollama OLLAMA_DEBUG=2 /usr/bin/ollama serve 2>&1 | tee /tmp/serve.log ``` and then `^C` as soon as you see "inference compute" and share that log?

GiteaMirror commented

2026-05-04 21:06:10 -05:00

@dhiltgen commented on GitHub (Oct 6, 2025):

I might see what's going on - are you by any chance manually setting OLLAMA_LIBRARY_PATH=/usr/lib/ollama/rocm? It looks like that's getting set twice and might be what's causing the problem.

@dhiltgen commented on GitHub (Oct 6, 2025): I might see what's going on - are you by any chance manually setting `OLLAMA_LIBRARY_PATH=/usr/lib/ollama/rocm`? It looks like that's getting set twice and might be what's causing the problem.

GiteaMirror commented

2026-05-04 21:06:10 -05:00

@lennarkivistik commented on GitHub (Oct 6, 2025):

Doing it manually worked with 0.12.4-rc6,
0.12.4-rc6-working.txt

So essentially my upgrade script (updated now) was only copying over ROCm subtree, leaving /usr/lib/ollama too bare which worked for 0.12.3 downwards but not for 0.12.4-rc upwards. The new discovery runner then failed to load any GPU backend and fell back to CPU.

Thanks @dhiltgen!

@lennarkivistik commented on GitHub (Oct 6, 2025): Doing it manually worked with 0.12.4-rc6, [0.12.4-rc6-working.txt](https://github.com/user-attachments/files/22730744/0.12.4-rc6-working.txt) So essentially my [upgrade script](https://gist.github.com/lennarkivistik/9544f9d8de5e5363c62838918e76a653) (updated now) was only copying over ROCm subtree, leaving /usr/lib/ollama too bare which worked for 0.12.3 downwards but not for 0.12.4-rc upwards. The new discovery runner then failed to load any GPU backend and fell back to CPU. Thanks @dhiltgen!

GiteaMirror commented

2026-05-04 21:06:11 -05:00

@lennarkivistik commented on GitHub (Oct 6, 2025):

Wow!
After making my first benchmarks with the rc6, it has got to be the best performance boost for rocm like by a order of magnitude!

Very impressive!!

For qwen3:4b-instruct-2507-q4_K_M
prompt eval rate has gone from 227.02 tokens/s to 1905.77 tokens/s on a short, tell a story prompt

v0.12.3 topped out around 33 k context at ~19 tok/s.
v0.12.4-rc6 extends stable generation past 128 k context, with dramatically better scaling so a major optimization in memory handling or FlashAttention implementation between builds.

Ollama Version	Model	FlashAttn	Context (tokens)	Batch Size	Duration (s)
0.12.4-rc6	qwen3:4b-instruct-2507-q8_0	ON	80 000	2 048	77
0.12.4-rc6	qwen3:4b-instruct-2507-q8_0	ON	129 440	2 048	122

@lennarkivistik commented on GitHub (Oct 6, 2025): Wow! After making my first benchmarks with the rc6, it has got to be the best performance boost for rocm like by a order of magnitude! Very impressive!! <img width="1342" height="694" alt="Image" src="https://github.com/user-attachments/assets/65dee333-e408-44fa-b0db-b1cb29035042" /> For qwen3:4b-instruct-2507-q4_K_M prompt eval rate has gone from **227.02 tokens/s** to **1905.77 tokens/s** on a short, tell a story prompt v0.12.3 topped out around 33 k context at ~19 tok/s. v0.12.4-rc6 extends stable generation past 128 k context, with dramatically better scaling so a major optimization in memory handling or FlashAttention implementation between builds. Ollama Version | Model | FlashAttn | Context (tokens) | Batch Size | Duration (s) -- | -- | -- | -- | -- | -- 0.12.4-rc6 | qwen3:4b-instruct-2507-q8_0 | ON | 80 000 | 2 048 | 77 0.12.4-rc6 | qwen3:4b-instruct-2507-q8_0 | ON | 129 440 | 2 048 | 122

GiteaMirror commented

2026-05-04 21:06:11 -05:00

@lennarkivistik commented on GitHub (Oct 11, 2025):

Thanks again, as I stated before @dhiltgen and @jessegross Im very happy to help out if any testing of a new rc if it affects anything related to my hardware ( Linux / AMD / ROCm ) if there ever is a need.
ollama is awesome! Blog Post About This

@lennarkivistik commented on GitHub (Oct 11, 2025): Thanks again, as I stated before @dhiltgen and @jessegross Im very happy to help out if any testing of a new rc if it affects anything related to my hardware ( Linux / AMD / ROCm ) if there ever is a need. ollama is awesome! [Blog Post About This](https://lennarkivistik.com/howtos/2025/benchmarking-ai-models-on-my-amd-rig-for-n8n/)

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#70316