[GH-ISSUE #12928] Flash attention not working when using vulkan #8575

Closed
opened 2026-04-12 21:18:31 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @irfanbacker on GitHub (Nov 3, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12928

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I am trying to run ollama with flash attention enabled using vulkan.

Found this check in code:

// For each GPU, check if it does NOT support flash attention
func FlashAttentionSupported(l []DeviceInfo) bool {
	for _, gpu := range l {
		supportsFA := gpu.Library == "cpu" ||
			gpu.Name == "Metal" || gpu.Library == "Metal" ||
			(gpu.Library == "CUDA" && gpu.DriverMajor >= 7 && !(gpu.ComputeMajor == 7 && gpu.ComputeMinor == 2)) ||
			gpu.Library == "ROCm"

		if !supportsFA {
			return false
		}
	}
	return true
}

This probably needs to handle Vulkan cases as well ?

Relevant log output

time=2025-11-03T17:34:18.254Z level=INFO source=server.go:400 msg="starting runner" cmd="/tmp/go-build1189579383/b001/exe/ollama runner --ollama-engine --port 43265"
time=2025-11-03T17:34:18.366Z level=WARN source=server.go:203 msg="flash attention enabled but not supported by gpu"
time=2025-11-03T17:34:18.366Z level=WARN source=server.go:226 msg="quantized kv cache requested but flash attention disabled" type=q8_0
time=2025-11-03T17:34:18.367Z level=INFO source=server.go:400 msg="starting runner" cmd="/tmp/go-build1189579383/b001/exe/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-18039ff0770a2583cb94a06caed0f70b33e9e8921149ad8ad5b5a7af33c84c63 --port 37037"
time=2025-11-03T17:34:18.367Z level=INFO source=server.go:653 msg="loading model" "model layers"=49 requested=-1
time=2025-11-03T17:34:18.367Z level=INFO source=server.go:658 msg="system memory" total="7.4 GiB" free="3.6 GiB" free_swap="3.3 GiB"
time=2025-11-03T17:34:18.367Z level=INFO source=server.go:665 msg="gpu memory" id=00000000-c600-0000-0000-000000000000 library=Vulkan available="25.9 GiB" free="26.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-11-03T17:34:18.376Z level=INFO source=runner.go:1349 msg="starting ollama engine"
time=2025-11-03T17:34:18.376Z level=INFO source=runner.go:1384 msg="Server listening on 127.0.0.1:37037"
time=2025-11-03T17:34:18.379Z level=INFO source=runner.go:1222 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:12 GPULayers:49[ID:00000000-c600-0000-0000-000000000000 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-03T17:34:18.402Z level=INFO source=ggml.go:136 msg="" architecture=qwen3moe file_type=Q4_K_M name=cerebras/Qwen3-Coder-REAP-25B-A3B description="This model was obtained by uniformly pruning 20% of experts in Qwen3-Coder-30B-A3B-Instruct using the REAP method.\n" num_tensors=579 num_key_values=44
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from /root/ollama/build/lib/ollama/libggml-vulkan.so
load_backend: loaded CPU backend from /root/ollama/build/lib/ollama/libggml-cpu-icelake.so
time=2025-11-03T17:34:18.430Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-11-03T17:34:18.539Z level=WARN source=server.go:958 msg="model request too large for system" requested="9.0 GiB" available="6.9 GiB" total="7.4 GiB" free="3.6 GiB" swap="3.3 GiB"
time=2025-11-03T17:34:18.539Z level=INFO source=runner.go:1222 msg=load request="{Operation:close LoraPath:[] Parallel:0 BatchSize:0 FlashAttention:false KvSize:0 KvCacheType: NumThreads:0 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-11-03T17:34:18.539Z level=INFO source=device.go:212 msg="model weights" device=Vulkan0 size="14.0 GiB"
time=2025-11-03T17:34:18.539Z level=INFO source=device.go:217 msg="model weights" device=CPU size="166.9 MiB"
time=2025-11-03T17:34:18.539Z level=INFO source=device.go:223 msg="kv cache" device=Vulkan0 size="12.0 GiB"
time=2025-11-03T17:34:18.539Z level=INFO source=device.go:234 msg="compute graph" device=Vulkan0 size="8.3 GiB"
time=2025-11-03T17:34:18.539Z level=INFO source=device.go:239 msg="compute graph" device=CPU size="4.0 MiB"
time=2025-11-03T17:34:18.539Z level=INFO source=device.go:244 msg="total memory" size="34.5 GiB"
time=2025-11-03T17:34:18.539Z level=INFO source=sched.go:446 msg="Load failed" model=/root/.ollama/models/blobs/sha256-18039ff0770a2583cb94a06caed0f70b33e9e8921149ad8ad5b5a7af33c84c63 error="model requires more system memory (9.0 GiB) than is available (6.9 GiB)"
[GIN] 2025/11/03 - 17:34:18 | 500 |  375.965112ms |   192.168.0.115 | POST     "/api/chat"

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

main branch

Originally created by @irfanbacker on GitHub (Nov 3, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12928 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I am trying to run ollama with flash attention enabled using vulkan. Found this check in code: ```go // For each GPU, check if it does NOT support flash attention func FlashAttentionSupported(l []DeviceInfo) bool { for _, gpu := range l { supportsFA := gpu.Library == "cpu" || gpu.Name == "Metal" || gpu.Library == "Metal" || (gpu.Library == "CUDA" && gpu.DriverMajor >= 7 && !(gpu.ComputeMajor == 7 && gpu.ComputeMinor == 2)) || gpu.Library == "ROCm" if !supportsFA { return false } } return true } ``` This probably needs to handle Vulkan cases as well ? ### Relevant log output ```shell time=2025-11-03T17:34:18.254Z level=INFO source=server.go:400 msg="starting runner" cmd="/tmp/go-build1189579383/b001/exe/ollama runner --ollama-engine --port 43265" time=2025-11-03T17:34:18.366Z level=WARN source=server.go:203 msg="flash attention enabled but not supported by gpu" time=2025-11-03T17:34:18.366Z level=WARN source=server.go:226 msg="quantized kv cache requested but flash attention disabled" type=q8_0 time=2025-11-03T17:34:18.367Z level=INFO source=server.go:400 msg="starting runner" cmd="/tmp/go-build1189579383/b001/exe/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-18039ff0770a2583cb94a06caed0f70b33e9e8921149ad8ad5b5a7af33c84c63 --port 37037" time=2025-11-03T17:34:18.367Z level=INFO source=server.go:653 msg="loading model" "model layers"=49 requested=-1 time=2025-11-03T17:34:18.367Z level=INFO source=server.go:658 msg="system memory" total="7.4 GiB" free="3.6 GiB" free_swap="3.3 GiB" time=2025-11-03T17:34:18.367Z level=INFO source=server.go:665 msg="gpu memory" id=00000000-c600-0000-0000-000000000000 library=Vulkan available="25.9 GiB" free="26.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-11-03T17:34:18.376Z level=INFO source=runner.go:1349 msg="starting ollama engine" time=2025-11-03T17:34:18.376Z level=INFO source=runner.go:1384 msg="Server listening on 127.0.0.1:37037" time=2025-11-03T17:34:18.379Z level=INFO source=runner.go:1222 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:131072 KvCacheType: NumThreads:12 GPULayers:49[ID:00000000-c600-0000-0000-000000000000 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-03T17:34:18.402Z level=INFO source=ggml.go:136 msg="" architecture=qwen3moe file_type=Q4_K_M name=cerebras/Qwen3-Coder-REAP-25B-A3B description="This model was obtained by uniformly pruning 20% of experts in Qwen3-Coder-30B-A3B-Instruct using the REAP method.\n" num_tensors=579 num_key_values=44 ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat load_backend: loaded Vulkan backend from /root/ollama/build/lib/ollama/libggml-vulkan.so load_backend: loaded CPU backend from /root/ollama/build/lib/ollama/libggml-cpu-icelake.so time=2025-11-03T17:34:18.430Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-11-03T17:34:18.539Z level=WARN source=server.go:958 msg="model request too large for system" requested="9.0 GiB" available="6.9 GiB" total="7.4 GiB" free="3.6 GiB" swap="3.3 GiB" time=2025-11-03T17:34:18.539Z level=INFO source=runner.go:1222 msg=load request="{Operation:close LoraPath:[] Parallel:0 BatchSize:0 FlashAttention:false KvSize:0 KvCacheType: NumThreads:0 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-11-03T17:34:18.539Z level=INFO source=device.go:212 msg="model weights" device=Vulkan0 size="14.0 GiB" time=2025-11-03T17:34:18.539Z level=INFO source=device.go:217 msg="model weights" device=CPU size="166.9 MiB" time=2025-11-03T17:34:18.539Z level=INFO source=device.go:223 msg="kv cache" device=Vulkan0 size="12.0 GiB" time=2025-11-03T17:34:18.539Z level=INFO source=device.go:234 msg="compute graph" device=Vulkan0 size="8.3 GiB" time=2025-11-03T17:34:18.539Z level=INFO source=device.go:239 msg="compute graph" device=CPU size="4.0 MiB" time=2025-11-03T17:34:18.539Z level=INFO source=device.go:244 msg="total memory" size="34.5 GiB" time=2025-11-03T17:34:18.539Z level=INFO source=sched.go:446 msg="Load failed" model=/root/.ollama/models/blobs/sha256-18039ff0770a2583cb94a06caed0f70b33e9e8921149ad8ad5b5a7af33c84c63 error="model requires more system memory (9.0 GiB) than is available (6.9 GiB)" [GIN] 2025/11/03 - 17:34:18 | 500 | 375.965112ms | 192.168.0.115 | POST "/api/chat" ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version main branch
GiteaMirror added the bug label 2026-04-12 21:18:31 -05:00
Author
Owner

@irfanbacker commented on GitHub (Nov 3, 2025):

Seems like this was handled earlier when the Vulkan PR(https://github.com/ollama/ollama/pull/11835) was merged. I have tracked down the change in condition to this PR: https://github.com/ollama/ollama/pull/12540

Were the Vulkan changes missed during merge?

<!-- gh-comment-id:3481921296 --> @irfanbacker commented on GitHub (Nov 3, 2025): Seems like this was handled earlier when the Vulkan PR(https://github.com/ollama/ollama/pull/11835) was merged. I have tracked down the change in condition to this PR: https://github.com/ollama/ollama/pull/12540 Were the Vulkan changes missed during merge?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8575