[GH-ISSUE #5975] Deepseek2 with large context crashes with "Deepseek2 does not support K-shift" #3735

Closed
opened 2026-04-12 14:32:43 -05:00 by GiteaMirror · 21 comments
Owner

Originally created by @balckwilliam on GitHub (Jul 26, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5975

Originally assigned to: @jessegross on GitHub.

What is the issue?

GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/src/llama.cpp:15147: false && "Deepseek2 does not support K-shift"

OS

Linux, Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.3.0

Originally created by @balckwilliam on GitHub (Jul 26, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5975 Originally assigned to: @jessegross on GitHub. ### What is the issue? GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/src/llama.cpp:15147: false && "Deepseek2 does not support K-shift" ### OS Linux, Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.0
GiteaMirror added the bug label 2026-04-12 14:32:43 -05:00
Author
Owner

@IcedCoffeee commented on GitHub (Aug 2, 2024):

i'm running into this too with deepseek-coder-v2. from what i can tell, it's from deepseek v2 not supporting kv/prompt caching properly.
i think https://github.com/ollama/ollama/pull/5760 or https://github.com/ollama/ollama/pull/4632 would fix this if one of them gets merged.

<!-- gh-comment-id:2264696369 --> @IcedCoffeee commented on GitHub (Aug 2, 2024): i'm running into this too with deepseek-coder-v2. from what i can tell, it's from deepseek v2 [not supporting](https://github.com/ggerganov/llama.cpp/blob/398ede5efeb07b9adf9fbda7ea63f630d476a792/src/llama.cpp#L15099) kv/prompt caching properly. i think https://github.com/ollama/ollama/pull/5760 or https://github.com/ollama/ollama/pull/4632 would fix this if one of them gets merged.
Author
Owner

@rick-github commented on GitHub (Aug 18, 2024):

Workaround discussed in https://github.com/ggerganov/llama.cpp/issues/8862.

<!-- gh-comment-id:2295280712 --> @rick-github commented on GitHub (Aug 18, 2024): Workaround discussed in https://github.com/ggerganov/llama.cpp/issues/8862.
Author
Owner

@httpjamesm commented on GitHub (Aug 18, 2024):

Can confirm this works. I created a simple ollama model file with the following contents matching the newfound limitations:

FROM deepseek-coder-v2
PARAMETER num_ctx 24576
PARAMETER num_predict 8192

Running aider.chat with this new model configuration works flawlessly.

<!-- gh-comment-id:2295330804 --> @httpjamesm commented on GitHub (Aug 18, 2024): Can confirm this works. I created a simple ollama model file with the following contents matching the newfound limitations: ``` FROM deepseek-coder-v2 PARAMETER num_ctx 24576 PARAMETER num_predict 8192 ``` Running aider.chat with this new model configuration works flawlessly.
Author
Owner

@talhaanwarch commented on GitHub (Aug 23, 2024):

@httpjamesm where i can set these params?

<!-- gh-comment-id:2306379957 --> @talhaanwarch commented on GitHub (Aug 23, 2024): @httpjamesm where i can set these params?
Author
Owner

@rick-github commented on GitHub (Aug 23, 2024):

Create a modelfile:

cat > Modelfile <<EOF
FROM deepseek-coder-v2
PARAMETER num_ctx 24576
PARAMETER num_predict 8192
EOF

Create the model:

ollama create deepseek-coder-v2-fixed -f Modelfile
<!-- gh-comment-id:2306851184 --> @rick-github commented on GitHub (Aug 23, 2024): Create a modelfile: ``` cat > Modelfile <<EOF FROM deepseek-coder-v2 PARAMETER num_ctx 24576 PARAMETER num_predict 8192 EOF ``` Create the model: ``` ollama create deepseek-coder-v2-fixed -f Modelfile ```
Author
Owner

@U0M0Z commented on GitHub (Aug 26, 2024):

Thanks for pointing to a solution @rick-github

I have tried to use the exact code provided in your last comment to generate a fixed model, but had mixed results. While overall the new deepseek-coder-v2-fixed is working, it is also extremely slow and it generates only a token every 2s or so. Is this expected/normal?

<!-- gh-comment-id:2311174177 --> @U0M0Z commented on GitHub (Aug 26, 2024): Thanks for pointing to a solution @rick-github I have tried to use the exact code provided in your last comment to generate a fixed model, but had mixed results. While overall the new deepseek-coder-v2-fixed is working, it is also extremely slow and it generates only a token every 2s or so. Is this expected/normal?
Author
Owner

@rick-github commented on GitHub (Aug 26, 2024):

Can you post server logs?

<!-- gh-comment-id:2311208439 --> @rick-github commented on GitHub (Aug 26, 2024): Can you post server logs?
Author
Owner

@smarthiesbs commented on GitHub (Sep 29, 2024):

Thanks for pointing to a solution @rick-github

I have tried to use the exact code provided in your last comment to generate a fixed model, but had mixed results. While overall the new deepseek-coder-v2-fixed is working, it is also extremely slow and it generates only a token every 2s or so. Is this expected/normal?

I also made the change and found the same thing: it works, the model is now stable, but it is very slow. Is there a solution for this?

<!-- gh-comment-id:2381385440 --> @smarthiesbs commented on GitHub (Sep 29, 2024): > Thanks for pointing to a solution @rick-github > > I have tried to use the exact code provided in your last comment to generate a fixed model, but had mixed results. While overall the new deepseek-coder-v2-fixed is working, it is also extremely slow and it generates only a token every 2s or so. Is this expected/normal? I also made the change and found the same thing: it works, the model is now stable, but it is very slow. Is there a solution for this?
Author
Owner

@rick-github commented on GitHub (Sep 29, 2024):

Server logs will aid in debugging.

<!-- gh-comment-id:2381442320 --> @rick-github commented on GitHub (Sep 29, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@eddyfadeev commented on GitHub (Nov 17, 2024):

Confirming that works flawlessly with @rick-github solution. Runs very fast on RTX4090

<!-- gh-comment-id:2480905304 --> @eddyfadeev commented on GitHub (Nov 17, 2024): Confirming that works flawlessly with @rick-github solution. Runs very fast on RTX4090
Author
Owner

@FireAngelx commented on GitHub (Dec 17, 2024):

num_predict

It just limit the total token_size and response token_size, useless for long text. The DeepSeekV2.5 support 128k tokens context.

<!-- gh-comment-id:2548873811 --> @FireAngelx commented on GitHub (Dec 17, 2024): > num_predict It just limit the total token_size and response token_size, useless for long text. The DeepSeekV2.5 support 128k tokens context.
Author
Owner

@rick-github commented on GitHub (Dec 17, 2024):

Deepseek2/2.5 doesn't support K-shift. All that the Modelfile parameters do is reduce the chance of a shift from happening, you can use whatever numbers you want as long as you prevent the total input tokens + output tokens from exceeding the context window.

<!-- gh-comment-id:2548969872 --> @rick-github commented on GitHub (Dec 17, 2024): Deepseek2/2.5 doesn't support K-shift. All that the Modelfile parameters do is reduce the chance of a shift from happening, you can use whatever numbers you want as long as you prevent the total input tokens + output tokens from exceeding the context window.
Author
Owner

@xihuai18 commented on GitHub (Jan 31, 2025):

Why not fix this in the official ollama model libaray?

<!-- gh-comment-id:2627470580 --> @xihuai18 commented on GitHub (Jan 31, 2025): Why not fix this in the official ollama model libaray?
Author
Owner

@rick-github commented on GitHub (Jan 31, 2025):

https://github.com/ggerganov/llama.cpp/issues/7343

<!-- gh-comment-id:2627877485 --> @rick-github commented on GitHub (Jan 31, 2025): https://github.com/ggerganov/llama.cpp/issues/7343
Author
Owner

@ttys0001 commented on GitHub (Feb 4, 2025):

The issue has indeed been resolved, but I am also using the 671b deepseek r1, which originally had a speed of 14 tokens/s, and now it has dropped to 2 tokens/s. Why is this happening?

Thanks for pointing to a solution @rick-github

I have tried to use the exact code provided in your last comment to generate a fixed model, but had mixed results. While overall the new deepseek-coder-v2-fixed is working, it is also extremely slow and it generates only a token every 2s or so. Is this expected/normal?

The issue has indeed been resolved, but I am also using the 671b deepseek r1, which originally had a speed of 14 tokens/s, and now it has dropped to 2 tokens/s.

<!-- gh-comment-id:2633537310 --> @ttys0001 commented on GitHub (Feb 4, 2025): The issue has indeed been resolved, but I am also using the 671b deepseek r1, which originally had a speed of 14 tokens/s, and now it has dropped to 2 tokens/s. Why is this happening? > Thanks for pointing to a solution [@rick-github](https://github.com/rick-github) > > I have tried to use the exact code provided in your last comment to generate a fixed model, but had mixed results. While overall the new deepseek-coder-v2-fixed is working, it is also extremely slow and it generates only a token every 2s or so. Is this expected/normal? The issue has indeed been resolved, but I am also using the 671b deepseek r1, which originally had a speed of 14 tokens/s, and now it has dropped to 2 tokens/s.
Author
Owner

@rick-github commented on GitHub (Feb 4, 2025):

Server logs will aid in debugging.

<!-- gh-comment-id:2633609306 --> @rick-github commented on GitHub (Feb 4, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@oOoOoOoll commented on GitHub (Feb 14, 2025):

The issue has indeed been resolved, but I am also using the 671b deepseek r1, which originally had a speed of 14 tokens/s, and now it has dropped to 2 tokens/s. Why is this happening?

Thanks for pointing to a solution @rick-github
I have tried to use the exact code provided in your last comment to generate a fixed model, but had mixed results. While overall the new deepseek-coder-v2-fixed is working, it is also extremely slow and it generates only a token every 2s or so. Is this expected/normal?

The issue has indeed been resolved, but I am also using the 671b deepseek r1, which originally had a speed of 14 tokens/s, and now it has dropped to 2 tokens/s.

i have same problem; did anyone has solutions?

<!-- gh-comment-id:2658575487 --> @oOoOoOoll commented on GitHub (Feb 14, 2025): > The issue has indeed been resolved, but I am also using the 671b deepseek r1, which originally had a speed of 14 tokens/s, and now it has dropped to 2 tokens/s. Why is this happening? > > > Thanks for pointing to a solution [@rick-github](https://github.com/rick-github) > > I have tried to use the exact code provided in your last comment to generate a fixed model, but had mixed results. While overall the new deepseek-coder-v2-fixed is working, it is also extremely slow and it generates only a token every 2s or so. Is this expected/normal? > > The issue has indeed been resolved, but I am also using the 671b deepseek r1, which originally had a speed of 14 tokens/s, and now it has dropped to 2 tokens/s. i have same problem; did anyone has solutions?
Author
Owner

@rick-github commented on GitHub (Feb 14, 2025):

Server logs will aid in debugging.

<!-- gh-comment-id:2659167188 --> @rick-github commented on GitHub (Feb 14, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@ice6 commented on GitHub (Feb 27, 2025):

@ttys0001 @oOoOoOoll have your problems been solved?

<!-- gh-comment-id:2686585229 --> @ice6 commented on GitHub (Feb 27, 2025): @ttys0001 @oOoOoOoll have your problems been solved?
Author
Owner

@somera commented on GitHub (Aug 22, 2025):

Create a modelfile:

cat > Modelfile <<EOF
FROM deepseek-coder-v2
PARAMETER num_ctx 24576
PARAMETER num_predict 8192
EOF

@rick-github ist this needed in Ollama v0.11.6 too?

<!-- gh-comment-id:3213663574 --> @somera commented on GitHub (Aug 22, 2025): > Create a modelfile: > > ``` > cat > Modelfile <<EOF > FROM deepseek-coder-v2 > PARAMETER num_ctx 24576 > PARAMETER num_predict 8192 > EOF > ``` @rick-github ist this needed in Ollama v0.11.6 too?
Author
Owner

@rick-github commented on GitHub (Aug 22, 2025):

Fixed in 0.6.4 with #9433.

<!-- gh-comment-id:3213920252 --> @rick-github commented on GitHub (Aug 22, 2025): Fixed in 0.6.4 with #9433.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3735