[GH-ISSUE #12432] Qwen3 vs Qwen3-2507 Regression caused by flash attention. AMD ROCM #70316

Closed
opened 2026-05-04 21:05:39 -05:00 by GiteaMirror · 22 comments
Owner

Originally created by @lennarkivistik on GitHub (Sep 27, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12432

What is the issue?

Not sure this is something ollama team can fix but I thought I would atleast mention it.

I was doing some performance benchmarks being a avid user of local ollama and was doing comparisons between qwen3 and the newer qwen3 2705, but while playing around i noticed a big regression on the newer Qwen3:*-2507 builds that they are much slower on AMD ROCm compared to original Qwen3:4B

and it seems to stem from OLLAMA_FLASH_ATTENTION and OLLAMA_KV_CACHE_TYPE

What I tested

  • Just for testings sake i took the smal 4B variant

  • Ran multiple context sizes (2k → 25k) with original Qwen3:4B and the new Qwen3:4B-2507 builds (q4_K_M and q8_0).

  • Compared performance across different Flash Attention / KV cache settings.

  • Original Qwen3:4B: runs great with FlashAttn=ON and KV=q8_0.

    • Sustains 110 tok/s at small/medium context (on the graphs below they are up to 400tok/s cause of througput speed) as the context was so small and it compared the time to finish vs time it ran for.
    • VRAM usage stable (~4–6 GB).
  • 2507 builds (q4_K_M and q8_0):

Image Image

Relevant log output

**Ollama environment variables:**

# Good configs that work with original Qwen3:4B (and the rest of the fleet of models for me atleast)
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0

# Required configs for Qwen3:4B-2507 builds (FlashAttn unsupported, q8 cache not usable)
OLLAMA_FLASH_ATTENTION=0
OLLAMA_KV_CACHE_TYPE=q4_0

OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_MAX_QUEUE=128
OLLAMA_KEEP_ALIVE=10m
OLLAMA_MODELS=/var/lib/ollama
OLLAMA_TMPDIR=/var/tmp/ollama
OLLAMA_NUM_THREADS=16
OLLAMA_NUM_PARALLEL=1
OLLAMA_NUM_BATCH=512

OS

Linux Endeavour OS Arch

GPU

AMD Radeon RX 7900 24GB XTX

CPU

AMD AMD Ryzen 9 7950X3D with 64GB RAM

Ollama version

0.12.2

Originally created by @lennarkivistik on GitHub (Sep 27, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12432 ### What is the issue? ### Not sure this is something ollama team can fix but I thought I would atleast mention it. I was doing some performance benchmarks being a avid user of local ollama and was doing comparisons between qwen3 and the newer qwen3 2705, but while playing around i noticed a big regression on the newer Qwen3:*-2507 builds that they are much slower on AMD ROCm compared to original Qwen3:4B and it seems to stem from **OLLAMA_FLASH_ATTENTION** and **OLLAMA_KV_CACHE_TYPE** ### What I tested * Just for testings sake i took the smal 4B variant * Ran multiple context sizes (2k → 25k) with **original Qwen3:4B** and the **new Qwen3:4B-2507** builds (`q4_K_M` and `q8_0`). * Compared performance across different Flash Attention / KV cache settings. * **Original Qwen3:4B**: runs great with `FlashAttn=ON` and `KV=q8_0`. * Sustains 110 tok/s at small/medium context (on the graphs below they are up to 400tok/s cause of througput speed) as the context was so small and it compared the time to finish vs time it ran for. * VRAM usage stable (~4–6 GB). * **2507 builds (`q4_K_M` and `q8_0`)**: <img width="1300" height="800" alt="Image" src="https://github.com/user-attachments/assets/41adec17-9cc5-45b4-8b9f-a29d907d784a" /> <img width="1300" height="800" alt="Image" src="https://github.com/user-attachments/assets/86de8517-70a2-4081-9893-af8fefae7570" /> ### Relevant log output ```shell **Ollama environment variables:** # Good configs that work with original Qwen3:4B (and the rest of the fleet of models for me atleast) OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 # Required configs for Qwen3:4B-2507 builds (FlashAttn unsupported, q8 cache not usable) OLLAMA_FLASH_ATTENTION=0 OLLAMA_KV_CACHE_TYPE=q4_0 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_MAX_QUEUE=128 OLLAMA_KEEP_ALIVE=10m OLLAMA_MODELS=/var/lib/ollama OLLAMA_TMPDIR=/var/tmp/ollama OLLAMA_NUM_THREADS=16 OLLAMA_NUM_PARALLEL=1 OLLAMA_NUM_BATCH=512 ``` ### OS Linux Endeavour OS Arch ### GPU AMD Radeon RX 7900 24GB XTX ### CPU AMD AMD Ryzen 9 7950X3D with 64GB RAM ### Ollama version 0.12.2
GiteaMirror added the bug label 2026-05-04 21:05:40 -05:00
Author
Owner

@jessegross commented on GitHub (Sep 29, 2025):

You might want to try just flash attention but no KV cache quantization. In many cases the latter is the issue but disabling flash attention disables both.

<!-- gh-comment-id:3349256044 --> @jessegross commented on GitHub (Sep 29, 2025): You might want to try just flash attention but no KV cache quantization. In many cases the latter is the issue but disabling flash attention disables both.
Author
Owner

@lennarkivistik commented on GitHub (Sep 30, 2025):

Then I'll try and run another set of benchmarks without defining OLLAMA_KV_CACHE_TYPE, that explains why the graphs
are quite similar when OLLAMA_FLASH_ATTENTION set to 0 as OLLAMA_KV_CACHE_TYPE didnt seem to do anything at that point.

On the RX 7900 most of the performance gains come from keeping FA on if/when stable and context does not eat away the 24GB RAM.

The most suprising thing is that the regression could not be seen on inital first question, when i just asked it to tell a short story, i got these results and for some reason gemma3 gets super bad results then also.

It was when i started adding more input data that the regression tanked on the qwen3:b--2507 models

Model Eval rate (tok/s) Prompt eval rate (tok/s) Total time
llama3.2:3b 147.79 409.89 3.53 s
qwen3:4b-instruct-2507-q4_K_M 111.94 216.81 3.54 s
qwen3:4b-thinking-2507-q4_K_M 111.53 237.02 7.34 s
qwen3:4b 109.84 154.36 8.51 s
gpt-oss:20b 106.18 229.57 7.15 s
qwen3:4b-instruct-2507-q8_0 98.62 235.93 3.96 s
qwen3:4b-thinking-2507-q8_0 98.58 259.10 7.07 s
qwen3:8b 76.83 134.83 7.72 s
qwen3:30b 76.69 61.59 8.62 s
deepseek-r1:8b 76.49 66.43 7.18 s
qwen3-coder:30b 76.39 47.17 3.92 s
qwen3:14b 51.18 103.42 14.24 s
deepseek-r1:32b 23.73 55.55 2m12 s
gemma3:12b 17.47 117.97 38.79 s
<!-- gh-comment-id:3350211832 --> @lennarkivistik commented on GitHub (Sep 30, 2025): Then I'll try and run another set of benchmarks without defining **OLLAMA_KV_CACHE_TYPE**, that explains why the graphs are quite similar when **OLLAMA_FLASH_ATTENTION** set to 0 as **OLLAMA_KV_CACHE_TYPE** didnt seem to do anything at that point. On the **RX 7900** most of the performance gains come from keeping **FA** on if/when stable and context does not eat away the 24GB RAM. The most suprising thing is that the regression could not be seen on inital first question, when i just asked it to **tell a short story**, i got these results and for some reason gemma3 gets super bad results then also. It was when i started adding more input data that the regression tanked on the qwen3:*b-*-2507 models Model | Eval rate (tok/s) | Prompt eval rate (tok/s) | Total time -- | -- | -- | -- llama3.2:3b | 147.79 | 409.89 | 3.53 s qwen3:4b-instruct-2507-q4_K_M | 111.94 | 216.81 | 3.54 s qwen3:4b-thinking-2507-q4_K_M | 111.53 | 237.02 | 7.34 s qwen3:4b | 109.84 | 154.36 | 8.51 s gpt-oss:20b | 106.18 | 229.57 | 7.15 s qwen3:4b-instruct-2507-q8_0 | 98.62 | 235.93 | 3.96 s qwen3:4b-thinking-2507-q8_0 | 98.58 | 259.10 | 7.07 s qwen3:8b | 76.83 | 134.83 | 7.72 s qwen3:30b | 76.69 | 61.59 | 8.62 s deepseek-r1:8b | 76.49 | 66.43 | 7.18 s qwen3-coder:30b | 76.39 | 47.17 | 3.92 s qwen3:14b | 51.18 | 103.42 | 14.24 s deepseek-r1:32b | 23.73 | 55.55 | 2m12 s gemma3:12b | 17.47 | 117.97 | 38.79 s
Author
Owner

@sunskyx commented on GitHub (Oct 2, 2025):

I also use the 7900XTX, and after rolling back to version ollama:0.12.1, the performance returned to normal.

<!-- gh-comment-id:3361894252 --> @sunskyx commented on GitHub (Oct 2, 2025): I also use the 7900XTX, and after rolling back to version ollama:0.12.1, the performance returned to normal.
Author
Owner

@lennarkivistik commented on GitHub (Oct 3, 2025):

I ran a quick benchmark using both of your settings, and as expected, v0.12.1 was indeed faster. The only differences in my setup were switching the version and commenting out OLLAMA_KV_CACHE_TYPE for the v0.12.3 run.

The q4_k_m build was very unstable on my system, it froze multiple times even at smaller context sizes. (By “freeze,” I mean the GPU maxed out at 100% and never finished the run.) Normally, freezes are rare with other models as long as ctx_num is set higher than the combined input and system prompt. I even raised ctx_num to give extra breathing room, but the hangs persisted (on these models compared to the rest).

So far, both qwen3:4b-instruct-2507-q8_0 and qwen3:4b-instruct-2507-q4_K_M seem unstable for my setup regardless of settings. I’ll do a more thorough round of testing over the weekend, since I’m curious if q4_k_m might behave better under v0.12.1.

Image
<!-- gh-comment-id:3364421450 --> @lennarkivistik commented on GitHub (Oct 3, 2025): I ran a quick benchmark using both of your settings, and as expected, v0.12.1 was indeed faster. The only differences in my setup were switching the version and commenting out OLLAMA_KV_CACHE_TYPE for the v0.12.3 run. The q4_k_m build was very unstable on my system, it froze multiple times even at smaller context sizes. (By “freeze,” I mean the GPU maxed out at 100% and never finished the run.) Normally, freezes are rare with other models as long as ctx_num is set higher than the combined input and system prompt. I even raised ctx_num to give extra breathing room, but the hangs persisted (on these models compared to the rest). So far, both qwen3:4b-instruct-2507-q8_0 and qwen3:4b-instruct-2507-q4_K_M seem unstable for my setup regardless of settings. I’ll do a more thorough round of testing over the weekend, since I’m curious if q4_k_m might behave better under v0.12.1. <img width="1300" height="800" alt="Image" src="https://github.com/user-attachments/assets/180fe9fa-d512-4ddf-b7db-473ddd8e3752" />
Author
Owner

@jessegross commented on GitHub (Oct 3, 2025):

Can you try out one of the 0.12.4 RCs? It improves flash attention performance on GPUs similar to yours.

<!-- gh-comment-id:3366857877 --> @jessegross commented on GitHub (Oct 3, 2025): Can you try out one of the 0.12.4 RCs? It improves flash attention performance on GPUs similar to yours.
Author
Owner

@lennarkivistik commented on GitHub (Oct 4, 2025):

Hi Jesse, thanks for the ping. I tried 0.12.4 RC4. I installed both tarballs (generic + ROCm) just like I do on stable, no other changes to my unit or env.

0.12.1 - 0.12.3 (works)

  • Service comes up and selects ROCm correctly:

    version 0.12.1
    ...
    inference compute id=GPU-2b91a683f3f9e991 library=rocm compute=gfx1100 total="24.0 GiB" available="22.2 GiB"
    
  • Logs show my GPU is supported:

    rocm supported GPUs [ ... gfx1100 ... ]
    amdgpu is supported gpu_type=gfx1100
    

0.12.4-rc4 (regresses to CPU)

  • ggml sees the ROCm device:

    ggml_cuda_init: found 1 ROCm devices:
      Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100)
    
  • But the runner returns no usable combos and falls back to CPU/low VRAM:

    runner enumerated devices OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/rocm]" devices=[]
    filtering out unsupported or overlapping GPU library combinations → supported=map[]
    inference compute id=cpu library=cpu ...
    entering low vram mode "total vram"="0 B"
    

What my upgrade script does (between versions)

  • Versioned fetch: VERSION=<tag> ./upgrade-ollama.sh downloads both release archives:

    • ollama-linux-amd64.tgz (generic)
    • ollama-linux-amd64-rocm.tgz (ROCm payload)
  • Clean + install flow:

    • Stops the ollama service.
    • Detects current install prefix; installs the binary to /usr/bin/ollama (or /usr/local/bin if that was in use).
    • Removes legacy binaries in the other prefix (prevents PATH confusion).
    • Extracts the ROCm tar straight into /usr → libraries end up under /usr/lib/ollama/rocm (libggml-hip.so, libamdhip64.so, librocblas.so, etc.). It does not touch /opt/rocm.
    • Does not set or modify LD_LIBRARY_PATH (it only prints a warning suggesting adding /usr/lib/ollama/rocm).
  • On 0.12.4-rc4 the runner probes those and ROCm, but still filters down to devices=[].

v0.12.4-rc4.log
v0.12.1.log

Upgrade script I use if it also helps

VERSION=0.12.4-rc4 ./upgrade-ollama.sh
upgrade-ollama.sh

Just ping If you want me to test another candidate build, im happy to help using my hardware.

<!-- gh-comment-id:3367986120 --> @lennarkivistik commented on GitHub (Oct 4, 2025): Hi Jesse, thanks for the ping. I tried **0.12.4 RC4**. I installed **both** tarballs (generic + ROCm) just like I do on stable, no other changes to my unit or env. ### 0.12.1 - 0.12.3 (works) * Service comes up and selects ROCm correctly: ``` version 0.12.1 ... inference compute id=GPU-2b91a683f3f9e991 library=rocm compute=gfx1100 total="24.0 GiB" available="22.2 GiB" ``` * Logs show my GPU is supported: ``` rocm supported GPUs [ ... gfx1100 ... ] amdgpu is supported gpu_type=gfx1100 ``` ### 0.12.4-rc4 (regresses to CPU) * ggml **sees** the ROCm device: ``` ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100) ``` * But the runner returns **no usable combos** and falls back to CPU/low VRAM: ``` runner enumerated devices OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/rocm]" devices=[] filtering out unsupported or overlapping GPU library combinations → supported=map[] inference compute id=cpu library=cpu ... entering low vram mode "total vram"="0 B" ``` ### What my upgrade script does (between versions) * **Versioned fetch:** `VERSION=<tag> ./upgrade-ollama.sh` downloads **both** release archives: * `ollama-linux-amd64.tgz` (generic) * `ollama-linux-amd64-rocm.tgz` (ROCm payload) * **Clean + install flow:** * Stops the `ollama` service. * Detects current install prefix; installs the binary to `/usr/bin/ollama` (or `/usr/local/bin` if that was in use). * Removes legacy binaries in the *other* prefix (prevents PATH confusion). * Extracts the ROCm tar straight into `/usr` → libraries end up under `/usr/lib/ollama/rocm` (`libggml-hip.so`, `libamdhip64.so`, `librocblas.so`, etc.). It does not touch `/opt/rocm`. * **Does not** set or modify `LD_LIBRARY_PATH` (it only prints a warning suggesting adding `/usr/lib/ollama/rocm`). * On 0.12.4-rc4 the runner probes those and ROCm, but still filters down to `devices=[]`. [v0.12.4-rc4.log](https://github.com/user-attachments/files/22694127/v0.12.4-rc4.log) [v0.12.1.log](https://github.com/user-attachments/files/22694338/v0.12.1.log) Upgrade script I use if it also helps > `VERSION=0.12.4-rc4 ./upgrade-ollama.sh` [upgrade-ollama.sh](https://gist.github.com/lennarkivistik/9544f9d8de5e5363c62838918e76a653#file-upgrade-ollama-sh) Just ping If you want me to test another candidate build, im happy to help using my hardware.
Author
Owner

@lennarkivistik commented on GitHub (Oct 5, 2025):

So I ran some more tests today and tried to figure out why qwen3:4b-instruct-2507-q8_0 and qwen3:4b-instruct-2507-q4_K_M were missbehaving, as it was still the case which ever version I ran.

running ollama run qwen3:4b-instruct-2507-q8_0 always worked and was fast in response, but when i ran via the api /api/generate, it freezed up and i got those bad results, (but only those two, never other models).

so changes i have made in my benchmarking code was to hard set batch size to 512,
before i had it go up a bit if the context was larger but the issue persisted even after that.

      const response = await axios.post(`${this.baseUrl}/api/generate`, {
        model,
        prompt,
        stream: false,
        options: {
          num_ctx: options.numCtx || 4096,
          num_batch: 512,
          num_thread: 16,
          ...options
        }

after running ollama rm qwen3:4b-instruct-2507-q8_0 and reinstalling them both, then the issue seems to have resolved itself for now, the newer tests seem to be good on 0.12.3 but it does not explain why the issue was present when running /api/generate vs ollama run exact same model, anyway im eager to test. the new 0.12.4 if it has improvements to Flash Attention as its a blessing for local models :)

Benchmark Method

Cut out a chunk of a sample book corresponding to the amount of estimated tokens needed
while system prompt was to summarize what it got, summary was discarded.
so each run it just increase the chunk from the book.

Image
<!-- gh-comment-id:3369346273 --> @lennarkivistik commented on GitHub (Oct 5, 2025): So I ran some more tests today and tried to figure out why `qwen3:4b-instruct-2507-q8_0` and `qwen3:4b-instruct-2507-q4_K_M` were missbehaving, as it was still the case which ever version I ran. running ollama run qwen3:4b-instruct-2507-q8_0 always worked and was fast in response, but when i ran via the api /api/generate, it freezed up and i got those bad results, (but only those two, never other models). so changes i have made in my benchmarking code was to hard set batch size to 512, before i had it go up a bit if the context was larger but the issue persisted even after that. ```javascript const response = await axios.post(`${this.baseUrl}/api/generate`, { model, prompt, stream: false, options: { num_ctx: options.numCtx || 4096, num_batch: 512, num_thread: 16, ...options } ``` after running `ollama rm qwen3:4b-instruct-2507-q8_0` and reinstalling them both, then the issue seems to have resolved itself for now, the newer tests seem to be good on 0.12.3 but it does not explain why the issue was present when running /api/generate vs `ollama run` exact same model, anyway im eager to test. the new 0.12.4 if it has improvements to Flash Attention as its a blessing for local models :) ### Benchmark Method Cut out a chunk of a sample book corresponding to the amount of estimated tokens needed while system prompt was to summarize what it got, summary was discarded. so each run it just increase the chunk from the book. <img width="1856" height="1137" alt="Image" src="https://github.com/user-attachments/assets/b5113ec7-b40d-4f1c-a0db-f9e50ea1626b" />
Author
Owner

@jessegross commented on GitHub (Oct 6, 2025):

Does 0.12.4-rc see you GPU if you don't set HIP_VISIBLE_DEVICES? There is new GPU discovery code in this release so it's possible there is an issue there.

<!-- gh-comment-id:3373131062 --> @jessegross commented on GitHub (Oct 6, 2025): Does 0.12.4-rc see you GPU if you don't set HIP_VISIBLE_DEVICES? There is new GPU discovery code in this release so it's possible there is an issue there.
Author
Owner

@dhiltgen commented on GitHub (Oct 6, 2025):

@lennarkivistik your log seems to have long lines chopped off by the default pager. Can you use journalctl -u ollama --no-pager --pager-end to ensure we can see the full log lines? If you could also set OLLAMA_DEBUG=2 to get trace logs turned on that will help spot what's going wrong. As Jesse mentioned, not setting HIP_VISIBLE_DEVICES may work around whatever the bug is until we get it fixed. My suspicion is device 0 is an iGPU in your setup, but without trace logs I can't tell for sure. I only need the logs from startup to where it reports "inference compute"

<!-- gh-comment-id:3373180509 --> @dhiltgen commented on GitHub (Oct 6, 2025): @lennarkivistik your log seems to have long lines chopped off by the default pager. Can you use `journalctl -u ollama --no-pager --pager-end` to ensure we can see the full log lines? If you could also set `OLLAMA_DEBUG=2` to get trace logs turned on that will help spot what's going wrong. As Jesse mentioned, not setting HIP_VISIBLE_DEVICES may work around whatever the bug is until we get it fixed. My suspicion is device 0 is an iGPU in your setup, but without trace logs I can't tell for sure. I only need the logs from startup to where it reports "inference compute"
Author
Owner

@lennarkivistik commented on GitHub (Oct 6, 2025):

Yes noticed that, ill send an update also uncommenting HIP_VISIBLE_DEVICES did not utilize the gpu

0.12.4-rc6.log

<!-- gh-comment-id:3373291672 --> @lennarkivistik commented on GitHub (Oct 6, 2025): Yes noticed that, ill send an update also uncommenting HIP_VISIBLE_DEVICES did not utilize the gpu [0.12.4-rc6.log](https://github.com/user-attachments/files/22728492/0.12.4-rc6.log)
Author
Owner

@dhiltgen commented on GitHub (Oct 6, 2025):

@lennarkivistik those updated logs include the full lines, but stop before the final "inference compute" was reported. That said, I see there is an iGPU, so somehow we're getting the indexing wrong most likely.

<!-- gh-comment-id:3373312106 --> @dhiltgen commented on GitHub (Oct 6, 2025): @lennarkivistik those updated logs include the full lines, but stop before the final "inference compute" was reported. That said, I see there is an iGPU, so somehow we're getting the indexing wrong most likely.
Author
Owner

@lennarkivistik commented on GitHub (Oct 6, 2025):

Well that was embarrasing not including the end, here are some fresh new ones
Here I set HIP_VISIBLE_DEVICES to 1
0.12.4-rc6-hip1.txt
Here I have HIP_VISIBLE_DEVICES unset
0.12.4-rc6-no-hip.txt

and the default was having it on 0

<!-- gh-comment-id:3373361467 --> @lennarkivistik commented on GitHub (Oct 6, 2025): Well that was embarrasing not including the end, here are some fresh new ones Here I set HIP_VISIBLE_DEVICES to 1 [0.12.4-rc6-hip1.txt](https://github.com/user-attachments/files/22728759/0.12.4-rc6-hip1.txt) Here I have HIP_VISIBLE_DEVICES unset [0.12.4-rc6-no-hip.txt](https://github.com/user-attachments/files/22728761/0.12.4-rc6-no-hip.txt) and the default was having it on **0**
Author
Owner

@dhiltgen commented on GitHub (Oct 6, 2025):

@lennarkivistik based on your logs, it seems like you might have an old copy of libggml-hip.so in ./lib/ollama/ from a prior build or version of Ollama. Can you make sure to remove everything under ./lib/ollama before extracting the tgz files and see if that fixes things?

<!-- gh-comment-id:3373430424 --> @dhiltgen commented on GitHub (Oct 6, 2025): @lennarkivistik based on your logs, it seems like you might have an old copy of libggml-hip.so in ./lib/ollama/ from a prior build or version of Ollama. Can you make sure to remove everything under ./lib/ollama before extracting the tgz files and see if that fixes things?
Author
Owner

@lennarkivistik commented on GitHub (Oct 6, 2025):

I cleared /lib/ollama/* and also updated my upgrade script to honor that for future upgrades/downgrades but it did no difference so was likely not the issue.

Probed the changelog with Chattie and it thinks the filter is the culprit.

The new test cases for filterOverlapByLibrary explicitly decide which version “wins” when the same device IDs appear under
 cuda_v12 vs cuda_v13. That’s fine for CUDA↔CUDA, but if ROCm presents the same GPU via a different ID format (PCI bus ID
vs logical index vs UUID), the map keys may collide or fail to match in a way that marks all entries “needs deletion”. The result
would be an empty supported map, exactly like my logs show. 

Just ping if you want me to confirm or try something, ill happily help out with tests for rocm (my kind of hardware) if it helps out, in the future also!

<!-- gh-comment-id:3373616036 --> @lennarkivistik commented on GitHub (Oct 6, 2025): I cleared **/lib/ollama/*** and also updated my [upgrade script](https://gist.githubusercontent.com/lennarkivistik/9544f9d8de5e5363c62838918e76a653/raw/49e8e89dd1a9bd3060311cad311179ebac24ebfe/upgrade-ollama.sh) to honor that for future upgrades/downgrades but it did no difference so was likely not the issue. Probed the changelog with Chattie and it thinks the filter is the culprit. ``` The new test cases for filterOverlapByLibrary explicitly decide which version “wins” when the same device IDs appear under cuda_v12 vs cuda_v13. That’s fine for CUDA↔CUDA, but if ROCm presents the same GPU via a different ID format (PCI bus ID vs logical index vs UUID), the map keys may collide or fail to match in a way that marks all entries “needs deletion”. The result would be an empty supported map, exactly like my logs show. ``` Just ping if you want me to confirm or try something, ill happily help out with tests for rocm (my kind of hardware) if it helps out, in the future also!
Author
Owner

@dhiltgen commented on GitHub (Oct 6, 2025):

@lennarkivistik can you share what is present in ls -l /usr/lib/ollama/ on your system?

The following log lines imply there's a libggml-hip.so present there - it shouldn't be discovering AMD GPUs without that library. This runner is a "cuda" runner, but it's reporting AMD devices, which explains why things are getting mixed up.

okt 06 20:44:45 ollama[2989514]: time=2025-10-06T20:44:45.409+02:00 level=DEBUG source=runner.go:401 msg="spawing runner with" OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v12]" extra_envs=[]
...
okt 06 20:44:45 ollama[2989532]: time=2025-10-06T20:44:45.428+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama
okt 06 20:44:45 ollama[2989532]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
okt 06 20:44:45 ollama[2989532]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
okt 06 20:44:45 ollama[2989532]: ggml_cuda_init: found 2 ROCm devices:
okt 06 20:44:45 ollama[2989532]:   Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, ID: GPU-2b91a683f3f9e991
okt 06 20:44:45 ollama[2989532]:   Device 1: AMD Radeon Graphics, gfx1036 (0x1036), VMM: no, Wave Size: 32, ID: 1
okt 06 20:44:45 ollama[2989532]: time=2025-10-06T20:44:45.932+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v12
<!-- gh-comment-id:3373726507 --> @dhiltgen commented on GitHub (Oct 6, 2025): @lennarkivistik can you share what is present in `ls -l /usr/lib/ollama/` on your system? The following log lines imply there's a `libggml-hip.so` present there - it shouldn't be discovering AMD GPUs without that library. This runner is a "cuda" runner, but it's reporting AMD devices, which explains why things are getting mixed up. ``` okt 06 20:44:45 ollama[2989514]: time=2025-10-06T20:44:45.409+02:00 level=DEBUG source=runner.go:401 msg="spawing runner with" OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v12]" extra_envs=[] ... okt 06 20:44:45 ollama[2989532]: time=2025-10-06T20:44:45.428+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama okt 06 20:44:45 ollama[2989532]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no okt 06 20:44:45 ollama[2989532]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no okt 06 20:44:45 ollama[2989532]: ggml_cuda_init: found 2 ROCm devices: okt 06 20:44:45 ollama[2989532]: Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, ID: GPU-2b91a683f3f9e991 okt 06 20:44:45 ollama[2989532]: Device 1: AMD Radeon Graphics, gfx1036 (0x1036), VMM: no, Wave Size: 32, ID: 1 okt 06 20:44:45 ollama[2989532]: time=2025-10-06T20:44:45.932+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v12 ```
Author
Owner

@lennarkivistik commented on GitHub (Oct 6, 2025):

This is whats present under /usr/lib/ollama
tree /usr/lib/ollama/

<!-- gh-comment-id:3373746231 --> @lennarkivistik commented on GitHub (Oct 6, 2025): This is whats present under /usr/lib/ollama [tree /usr/lib/ollama/](https://github.com/user-attachments/files/22730059/tree.txt)
Author
Owner

@dhiltgen commented on GitHub (Oct 6, 2025):

The contents in ./lib/ollama/rocm look plausible but I'm not seeing the ./lib/ollama/cuda_v12 or ./lib/ollama/cuda_v13 directories which should have been there. My current theory is there's another ./lib/ollama/libggml-hip.so which shouldn't be there as well. I don't see them in the tgz's we published, so my theory is it's a stale file.

% curl -fsSL https://github.com/ollama/ollama/releases/download/v0.12.4-rc6/ollama-linux-amd64-rocm.tgz | tar tzf -  | grep libggml-hip.so
lib/ollama/rocm/libggml-hip.so
% curl -fsSL https://github.com/ollama/ollama/releases/download/v0.12.4-rc6/ollama-linux-amd64.tgz | tar tzf -  | grep libggml-hip.so
%

I'm curious if rm -rf /usr/lib/ollama followed by extracting the 2 tar files above starts working, or still has problems with your setup.

<!-- gh-comment-id:3373822487 --> @dhiltgen commented on GitHub (Oct 6, 2025): The contents in `./lib/ollama/rocm` look plausible but I'm not seeing the `./lib/ollama/cuda_v12` or `./lib/ollama/cuda_v13` directories which should have been there. My current theory is there's another `./lib/ollama/libggml-hip.so` which shouldn't be there as well. I don't see them in the tgz's we published, so my theory is it's a stale file. ``` % curl -fsSL https://github.com/ollama/ollama/releases/download/v0.12.4-rc6/ollama-linux-amd64-rocm.tgz | tar tzf - | grep libggml-hip.so lib/ollama/rocm/libggml-hip.so % curl -fsSL https://github.com/ollama/ollama/releases/download/v0.12.4-rc6/ollama-linux-amd64.tgz | tar tzf - | grep libggml-hip.so % ``` I'm curious if `rm -rf /usr/lib/ollama` followed by extracting the 2 tar files above starts working, or still has problems with your setup.
Author
Owner

@dhiltgen commented on GitHub (Oct 6, 2025):

It also appears some log lines aren't showing up, and I'm not sure why. Perhaps to eliminate variables, can you simply run

sudo systemctl stop ollama
OLLAMA_DEBUG=2 /usr/bin/ollama serve 2>&1 | tee /tmp/serve.log

and then ^C as soon as you see "inference compute" and share that log?

<!-- gh-comment-id:3374009259 --> @dhiltgen commented on GitHub (Oct 6, 2025): It also appears some log lines aren't showing up, and I'm not sure why. Perhaps to eliminate variables, can you simply run ``` sudo systemctl stop ollama OLLAMA_DEBUG=2 /usr/bin/ollama serve 2>&1 | tee /tmp/serve.log ``` and then `^C` as soon as you see "inference compute" and share that log?
Author
Owner

@dhiltgen commented on GitHub (Oct 6, 2025):

I might see what's going on - are you by any chance manually setting OLLAMA_LIBRARY_PATH=/usr/lib/ollama/rocm? It looks like that's getting set twice and might be what's causing the problem.

<!-- gh-comment-id:3374040148 --> @dhiltgen commented on GitHub (Oct 6, 2025): I might see what's going on - are you by any chance manually setting `OLLAMA_LIBRARY_PATH=/usr/lib/ollama/rocm`? It looks like that's getting set twice and might be what's causing the problem.
Author
Owner

@lennarkivistik commented on GitHub (Oct 6, 2025):

Doing it manually worked with 0.12.4-rc6,
0.12.4-rc6-working.txt

So essentially my upgrade script (updated now) was only copying over ROCm subtree, leaving /usr/lib/ollama too bare which worked for 0.12.3 downwards but not for 0.12.4-rc upwards. The new discovery runner then failed to load any GPU backend and fell back to CPU.

Thanks @dhiltgen!

<!-- gh-comment-id:3374085372 --> @lennarkivistik commented on GitHub (Oct 6, 2025): Doing it manually worked with 0.12.4-rc6, [0.12.4-rc6-working.txt](https://github.com/user-attachments/files/22730744/0.12.4-rc6-working.txt) So essentially my [upgrade script](https://gist.github.com/lennarkivistik/9544f9d8de5e5363c62838918e76a653) (updated now) was only copying over ROCm subtree, leaving /usr/lib/ollama too bare which worked for 0.12.3 downwards but not for 0.12.4-rc upwards. The new discovery runner then failed to load any GPU backend and fell back to CPU. Thanks @dhiltgen!
Author
Owner

@lennarkivistik commented on GitHub (Oct 6, 2025):

Wow!
After making my first benchmarks with the rc6, it has got to be the best performance boost for rocm like by a order of magnitude!

Very impressive!!

Image

For qwen3:4b-instruct-2507-q4_K_M
prompt eval rate has gone from 227.02 tokens/s to 1905.77 tokens/s on a short, tell a story prompt

v0.12.3 topped out around 33 k context at ~19 tok/s.
v0.12.4-rc6 extends stable generation past 128 k context, with dramatically better scaling so a major optimization in memory handling or FlashAttention implementation between builds.

Ollama Version Model FlashAttn Context (tokens) Batch Size Duration (s)
0.12.4-rc6 qwen3:4b-instruct-2507-q8_0 ON 80 000 2 048 77
0.12.4-rc6 qwen3:4b-instruct-2507-q8_0 ON 129 440 2 048 122
<!-- gh-comment-id:3374258966 --> @lennarkivistik commented on GitHub (Oct 6, 2025): Wow! After making my first benchmarks with the rc6, it has got to be the best performance boost for rocm like by a order of magnitude! Very impressive!! <img width="1342" height="694" alt="Image" src="https://github.com/user-attachments/assets/65dee333-e408-44fa-b0db-b1cb29035042" /> For qwen3:4b-instruct-2507-q4_K_M prompt eval rate has gone from **227.02 tokens/s** to **1905.77 tokens/s** on a short, tell a story prompt v0.12.3 topped out around 33 k context at ~19 tok/s. v0.12.4-rc6 extends stable generation past 128 k context, with dramatically better scaling so a major optimization in memory handling or FlashAttention implementation between builds. Ollama Version | Model | FlashAttn | Context (tokens) | Batch Size | Duration (s) -- | -- | -- | -- | -- | -- 0.12.4-rc6 | qwen3:4b-instruct-2507-q8_0 | ON | 80 000 | 2 048 | 77 0.12.4-rc6 | qwen3:4b-instruct-2507-q8_0 | ON | 129 440 | 2 048 | 122
Author
Owner

@lennarkivistik commented on GitHub (Oct 11, 2025):

Thanks again, as I stated before @dhiltgen and @jessegross Im very happy to help out if any testing of a new rc if it affects anything related to my hardware ( Linux / AMD / ROCm ) if there ever is a need.
ollama is awesome! Blog Post About This

<!-- gh-comment-id:3393084504 --> @lennarkivistik commented on GitHub (Oct 11, 2025): Thanks again, as I stated before @dhiltgen and @jessegross Im very happy to help out if any testing of a new rc if it affects anything related to my hardware ( Linux / AMD / ROCm ) if there ever is a need. ollama is awesome! [Blog Post About This](https://lennarkivistik.com/howtos/2025/benchmarking-ai-models-on-my-amd-rig-for-n8n/)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70316