[GH-ISSUE #15277] HIP build missing -ffast-math (CUDA has -use_fast_math) #56285

Open
opened 2026-04-29 10:34:33 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @jodendaal on GitHub (Apr 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15277

Description

The CUDA backend build uses -use_fast_math (ggml-cuda/CMakeLists.txt line 123) but the HIP/ROCm backend does not have the equivalent -ffast-math flag. This is a parity gap that leaves performance on the table for all AMD GPU users.

Impact

Benchmarked on an RX 7900 XTX (gfx1100, RDNA3) with ROCm 6.4, using qwen2.5-coder:14b (Q4_K_M):

Benchmark Without -ffast-math With -ffast-math Improvement
code-gen 60.81 t/s 64.20 t/s +5.6%
short-explanation 61.01 t/s 63.44 t/s +4.0%
list-generation 61.59 t/s 63.68 t/s +3.4%

This is a free ~4% generation speed improvement for all ROCm users with a one-line change.

Fix

Add to ml/backend/ggml/ggml/src/ggml-hip/CMakeLists.txt:

set(CMAKE_HIP_FLAGS "${CMAKE_HIP_FLAGS} -ffast-math")

PR: #15276

Originally created by @jodendaal on GitHub (Apr 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15277 ## Description The CUDA backend build uses `-use_fast_math` ([ggml-cuda/CMakeLists.txt line 123](https://github.com/ollama/ollama/blob/main/ml/backend/ggml/ggml/src/ggml-cuda/CMakeLists.txt#L123)) but the HIP/ROCm backend does not have the equivalent `-ffast-math` flag. This is a parity gap that leaves performance on the table for all AMD GPU users. ## Impact Benchmarked on an RX 7900 XTX (gfx1100, RDNA3) with ROCm 6.4, using `qwen2.5-coder:14b` (Q4_K_M): | Benchmark | Without -ffast-math | With -ffast-math | Improvement | |---|---|---|---| | code-gen | 60.81 t/s | 64.20 t/s | +5.6% | | short-explanation | 61.01 t/s | 63.44 t/s | +4.0% | | list-generation | 61.59 t/s | 63.68 t/s | +3.4% | This is a free ~4% generation speed improvement for all ROCm users with a one-line change. ## Fix Add to `ml/backend/ggml/ggml/src/ggml-hip/CMakeLists.txt`: ```cmake set(CMAKE_HIP_FLAGS "${CMAKE_HIP_FLAGS} -ffast-math") ``` PR: #15276
Author
Owner

@Jasdfgh commented on GitHub (Apr 9, 2026):

good catch. the CUDA/HIP parity gap on fast math is a clean example of free performance left on the table. ~4% generation speed for a one-liner is hard to argue against.

-ffast-math on HIP enables a similar class of optimizations to what -use_fast_math does on CUDA — FMA contraction, relaxed floating-point associativity, approximate reciprocals. the accuracy impact is usually small compared to quantization noise in typical inference workloads, though it can affect bitwise reproducibility.

the PR #15276 linked — has it gotten maintainer attention yet?

<!-- gh-comment-id:4214589983 --> @Jasdfgh commented on GitHub (Apr 9, 2026): good catch. the CUDA/HIP parity gap on fast math is a clean example of free performance left on the table. ~4% generation speed for a one-liner is hard to argue against. -ffast-math on HIP enables a similar class of optimizations to what -use_fast_math does on CUDA — FMA contraction, relaxed floating-point associativity, approximate reciprocals. the accuracy impact is usually small compared to quantization noise in typical inference workloads, though it can affect bitwise reproducibility. the PR #15276 linked — has it gotten maintainer attention yet?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56285