[GH-ISSUE #4979] 0xc0000409 error with llava-phi3 #65185

Closed
opened 2026-05-03 19:57:07 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @razvanab on GitHub (Jun 11, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4979

What is the issue?

I get the 0xc0000409 error with llava-phi3 when OLLAMA_FLASH_ATTENTION is enabled.

RAM = 32GB

GPU = GTX 1060
VRAM = 6GB

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.1.42

Originally created by @razvanab on GitHub (Jun 11, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4979 ### What is the issue? I get the 0xc0000409 error with llava-phi3 when OLLAMA_FLASH_ATTENTION is enabled. RAM = 32GB GPU = GTX 1060 VRAM = 6GB ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.42
GiteaMirror added the bug label 2026-05-03 19:57:07 -05:00
Author
Owner

@AncientMystic commented on GitHub (Jun 11, 2024):

Pascal gpus do not officially support flash attention due to their lack of tensor cores and require a FP32 non tensor core vector kernel which has been recently updated for support in llama.cpp and i am not sure if ollama is using a recent enough version of llama.cpp as it should provide almost a 2x boost in tokens per second, which i notice no difference on my tesla p4

Not sure if that is the problem here at all but it could be apart of it

<!-- gh-comment-id:2160606522 --> @AncientMystic commented on GitHub (Jun 11, 2024): Pascal gpus do not officially support flash attention due to their lack of tensor cores and require a FP32 non tensor core vector kernel which has been recently updated for support in llama.cpp and i am not sure if ollama is using a recent enough version of llama.cpp as it should provide almost a 2x boost in tokens per second, which i notice no difference on my tesla p4 Not sure if that is the problem here at all but it could be apart of it
Author
Owner

@razvanab commented on GitHub (Jun 11, 2024):

I see. Thank you.

<!-- gh-comment-id:2160614611 --> @razvanab commented on GitHub (Jun 11, 2024): I see. Thank you.
Author
Owner

@dhiltgen commented on GitHub (Jun 18, 2024):

It sounds like we can close this issue. Eventually our goal is to get flash attention turned on automatically based on if the GPU can support it, but until then, the experimental env var can be used to enable manually.

<!-- gh-comment-id:2176888963 --> @dhiltgen commented on GitHub (Jun 18, 2024): It sounds like we can close this issue. Eventually our goal is to get flash attention turned on automatically based on if the GPU can support it, but until then, the experimental env var can be used to enable manually.
Author
Owner

@AncientMystic commented on GitHub (Jun 18, 2024):

It sounds like we can close this issue. Eventually our goal is to get flash attention turned on automatically based on if the GPU can support it, but until then, the experimental env var can be used to enable manually.

It does seem to be solved and it would be really nice to see the fp32 vector kernel flash attention supported on ollama (doesn't seem to be yet but from my understanding it is now in the last month on llama.cpp) as many of us are still on pascal which also has crippled fp16 performance, that plus the recently added kv quant should really benefit pascels

<!-- gh-comment-id:2176920624 --> @AncientMystic commented on GitHub (Jun 18, 2024): > It sounds like we can close this issue. Eventually our goal is to get flash attention turned on automatically based on if the GPU can support it, but until then, the experimental env var can be used to enable manually. It does seem to be solved and it would be really nice to see the fp32 vector kernel flash attention supported on ollama (doesn't seem to be yet but from my understanding it is now in the last month on llama.cpp) as many of us are still on pascal which also has crippled fp16 performance, that plus the recently added kv quant should really benefit pascels
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65185