[GH-ISSUE #5390] deepseek-coder-v2-lite flash attention not enabled #3371

Closed
opened 2026-04-12 14:00:02 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @reddev-aroy on GitHub (Jun 30, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5390

What is the issue?

As soon as the context length limit is reached for deepseek-coder-v2-lite the models are just repeating previous answers and keeps looping itself even after asking for something else in its response. This issue is resolved in LM Studio as after enabling flash attention this seems to be resolved, however this issue still exists in the latest ollama 0.1.48
I am suspecting it to be issue with the model itself but flash attention seems to resolve it in LM Studio. Need help resolving this issue in Ollama as running this model in cpu only seems to be faster for me attest in ollama than LM Studio with num_gpu 0

Ollama version - 0.1.48
Model used - deepseek-coder-v2-lite-instruct-Q5_K_M

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.1.48

Originally created by @reddev-aroy on GitHub (Jun 30, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5390 ### What is the issue? As soon as the context length limit is reached for deepseek-coder-v2-lite the models are just repeating previous answers and keeps looping itself even after asking for something else in its response. This issue is resolved in LM Studio as after enabling flash attention this seems to be resolved, however this issue still exists in the latest ollama 0.1.48 I am suspecting it to be issue with the model itself but flash attention seems to resolve it in LM Studio. Need help resolving this issue in Ollama as running this model in cpu only seems to be faster for me attest in ollama than LM Studio with num_gpu 0 Ollama version - 0.1.48 Model used - deepseek-coder-v2-lite-instruct-Q5_K_M ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.1.48
GiteaMirror added the bug label 2026-04-12 14:00:02 -05:00
Author
Owner

@rick-github commented on GitHub (Jun 30, 2024):

You can enable flash attention in ollama by setting OLLAMA_FLASH_ATTENTION=1 in the environment.

<!-- gh-comment-id:2198505803 --> @rick-github commented on GitHub (Jun 30, 2024): You can enable flash attention in ollama by setting OLLAMA_FLASH_ATTENTION=1 in the environment.
Author
Owner

@reddev-aroy commented on GitHub (Jun 30, 2024):

You can enable flash attention in ollama by setting OLLAMA_FLASH_ATTENTION=1 in the environment.

Tried below and restarting ollama, doesn't seem to work. Maybe deepseek-coder-v2 in ollama is turning off flash_attn automatically for this model architecture? Not sure.

launchctl setenv OLLAMA_FLASH_ATTENTION 1

<!-- gh-comment-id:2198511152 --> @reddev-aroy commented on GitHub (Jun 30, 2024): > You can enable flash attention in ollama by setting OLLAMA_FLASH_ATTENTION=1 in the environment. Tried below and restarting ollama, doesn't seem to work. Maybe deepseek-coder-v2 in ollama is turning off flash_attn automatically for this model architecture? Not sure. launchctl setenv OLLAMA_FLASH_ATTENTION 1
Author
Owner

@rick-github commented on GitHub (Jun 30, 2024):

It's an issue with llama.cpp, it turns off flash attention when the K and V heads are different.

<!-- gh-comment-id:2198528627 --> @rick-github commented on GitHub (Jun 30, 2024): It's an issue with llama.cpp, it [turns off flash attention](https://github.com/ggerganov/llama.cpp/blob/1c5eba6f8e628fb0a98afb27d8aaeb3b0e136451/src/llama.cpp#L17412) when the K and V heads are different.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3371