[GH-ISSUE #5975] Deepseek2 with large context crashes with "Deepseek2 does not support K-shift" #3735

New Issue

@IcedCoffeee commented on GitHub (Aug 2, 2024):

i'm running into this too with deepseek-coder-v2. from what i can tell, it's from deepseek v2 not supporting kv/prompt caching properly.
i think https://github.com/ollama/ollama/pull/5760 or https://github.com/ollama/ollama/pull/4632 would fix this if one of them gets merged.

@IcedCoffeee commented on GitHub (Aug 2, 2024): i'm running into this too with deepseek-coder-v2. from what i can tell, it's from deepseek v2 [not supporting](https://github.com/ggerganov/llama.cpp/blob/398ede5efeb07b9adf9fbda7ea63f630d476a792/src/llama.cpp#L15099) kv/prompt caching properly. i think https://github.com/ollama/ollama/pull/5760 or https://github.com/ollama/ollama/pull/4632 would fix this if one of them gets merged.

GiteaMirror commented

@rick-github commented on GitHub (Aug 18, 2024):

Workaround discussed in https://github.com/ggerganov/llama.cpp/issues/8862.

@rick-github commented on GitHub (Aug 18, 2024): Workaround discussed in https://github.com/ggerganov/llama.cpp/issues/8862.

GiteaMirror commented

@httpjamesm commented on GitHub (Aug 18, 2024):

Can confirm this works. I created a simple ollama model file with the following contents matching the newfound limitations:

FROM deepseek-coder-v2
PARAMETER num_ctx 24576
PARAMETER num_predict 8192

Running aider.chat with this new model configuration works flawlessly.

@httpjamesm commented on GitHub (Aug 18, 2024): Can confirm this works. I created a simple ollama model file with the following contents matching the newfound limitations: ``` FROM deepseek-coder-v2 PARAMETER num_ctx 24576 PARAMETER num_predict 8192 ``` Running aider.chat with this new model configuration works flawlessly.

GiteaMirror commented

@talhaanwarch commented on GitHub (Aug 23, 2024):

@httpjamesm where i can set these params?

@talhaanwarch commented on GitHub (Aug 23, 2024): @httpjamesm where i can set these params?

GiteaMirror commented

@rick-github commented on GitHub (Aug 23, 2024):

Create a modelfile:

cat > Modelfile <<EOF
FROM deepseek-coder-v2
PARAMETER num_ctx 24576
PARAMETER num_predict 8192
EOF

Create the model:

ollama create deepseek-coder-v2-fixed -f Modelfile

@rick-github commented on GitHub (Aug 23, 2024): Create a modelfile: ``` cat > Modelfile <<EOF FROM deepseek-coder-v2 PARAMETER num_ctx 24576 PARAMETER num_predict 8192 EOF ``` Create the model: ``` ollama create deepseek-coder-v2-fixed -f Modelfile ```

GiteaMirror commented

@U0M0Z commented on GitHub (Aug 26, 2024):

Thanks for pointing to a solution @rick-github

I have tried to use the exact code provided in your last comment to generate a fixed model, but had mixed results. While overall the new deepseek-coder-v2-fixed is working, it is also extremely slow and it generates only a token every 2s or so. Is this expected/normal?

@U0M0Z commented on GitHub (Aug 26, 2024): Thanks for pointing to a solution @rick-github I have tried to use the exact code provided in your last comment to generate a fixed model, but had mixed results. While overall the new deepseek-coder-v2-fixed is working, it is also extremely slow and it generates only a token every 2s or so. Is this expected/normal?

GiteaMirror commented

@rick-github commented on GitHub (Aug 26, 2024):

Can you post server logs?

@rick-github commented on GitHub (Aug 26, 2024): Can you post server logs?

GiteaMirror commented

2026-04-12 14:32:48 -05:00

@smarthiesbs commented on GitHub (Sep 29, 2024):

Thanks for pointing to a solution @rick-github

I have tried to use the exact code provided in your last comment to generate a fixed model, but had mixed results. While overall the new deepseek-coder-v2-fixed is working, it is also extremely slow and it generates only a token every 2s or so. Is this expected/normal?

I also made the change and found the same thing: it works, the model is now stable, but it is very slow. Is there a solution for this?

@smarthiesbs commented on GitHub (Sep 29, 2024): > Thanks for pointing to a solution @rick-github > > I have tried to use the exact code provided in your last comment to generate a fixed model, but had mixed results. While overall the new deepseek-coder-v2-fixed is working, it is also extremely slow and it generates only a token every 2s or so. Is this expected/normal? I also made the change and found the same thing: it works, the model is now stable, but it is very slow. Is there a solution for this?

GiteaMirror commented

@rick-github commented on GitHub (Sep 29, 2024):

Server logs will aid in debugging.

@rick-github commented on GitHub (Sep 29, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.

GiteaMirror commented

2026-04-12 14:32:48 -05:00

@eddyfadeev commented on GitHub (Nov 17, 2024):

Confirming that works flawlessly with @rick-github solution. Runs very fast on RTX4090

@eddyfadeev commented on GitHub (Nov 17, 2024): Confirming that works flawlessly with @rick-github solution. Runs very fast on RTX4090

GiteaMirror commented

2026-04-12 14:32:48 -05:00

@FireAngelx commented on GitHub (Dec 17, 2024):

num_predict

It just limit the total token_size and response token_size, useless for long text. The DeepSeekV2.5 support 128k tokens context.

@FireAngelx commented on GitHub (Dec 17, 2024): > num_predict It just limit the total token_size and response token_size, useless for long text. The DeepSeekV2.5 support 128k tokens context.

GiteaMirror commented

@rick-github commented on GitHub (Dec 17, 2024):

Deepseek2/2.5 doesn't support K-shift. All that the Modelfile parameters do is reduce the chance of a shift from happening, you can use whatever numbers you want as long as you prevent the total input tokens + output tokens from exceeding the context window.

@rick-github commented on GitHub (Dec 17, 2024): Deepseek2/2.5 doesn't support K-shift. All that the Modelfile parameters do is reduce the chance of a shift from happening, you can use whatever numbers you want as long as you prevent the total input tokens + output tokens from exceeding the context window.

GiteaMirror commented

@xihuai18 commented on GitHub (Jan 31, 2025):

Why not fix this in the official ollama model libaray?

@xihuai18 commented on GitHub (Jan 31, 2025): Why not fix this in the official ollama model libaray?

GiteaMirror commented

https://github.com/ggerganov/llama.cpp/issues/7343

@rick-github commented on GitHub (Jan 31, 2025):

@rick-github commented on GitHub (Jan 31, 2025): https://github.com/ggerganov/llama.cpp/issues/7343

GiteaMirror commented

@ttys0001 commented on GitHub (Feb 4, 2025):

The issue has indeed been resolved, but I am also using the 671b deepseek r1, which originally had a speed of 14 tokens/s, and now it has dropped to 2 tokens/s. Why is this happening?

Thanks for pointing to a solution @rick-github

I have tried to use the exact code provided in your last comment to generate a fixed model, but had mixed results. While overall the new deepseek-coder-v2-fixed is working, it is also extremely slow and it generates only a token every 2s or so. Is this expected/normal?

The issue has indeed been resolved, but I am also using the 671b deepseek r1, which originally had a speed of 14 tokens/s, and now it has dropped to 2 tokens/s.

@ttys0001 commented on GitHub (Feb 4, 2025): The issue has indeed been resolved, but I am also using the 671b deepseek r1, which originally had a speed of 14 tokens/s, and now it has dropped to 2 tokens/s. Why is this happening? > Thanks for pointing to a solution [@rick-github](https://github.com/rick-github) > > I have tried to use the exact code provided in your last comment to generate a fixed model, but had mixed results. While overall the new deepseek-coder-v2-fixed is working, it is also extremely slow and it generates only a token every 2s or so. Is this expected/normal? The issue has indeed been resolved, but I am also using the 671b deepseek r1, which originally had a speed of 14 tokens/s, and now it has dropped to 2 tokens/s.

GiteaMirror commented