[GH-ISSUE #9010] Error:" the current context does not support k-shift " deepseek-r1:671b crashes in memory after answering several questions and then reloads to memory again #52368

Closed
opened 2026-04-28 23:06:12 -05:00 by GiteaMirror · 29 comments
Owner

Originally created by @sobermh on GitHub (Feb 11, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9010

What is the issue?

Hey Ollama community,

description

After I run deepseek-r1:671b, I try to ask 2 questions and the response is normal, but when I ask the 3rd question, the model crashes in memory and starts trying to reload into memory.

I'm running on a pure cpu environment, and after loading deepseek-r1:671b , there is some memory left (details attached).

what should i do ?

Tried methods

After checking past issues, I tried creating a new modelfile, running a new model, but after several Q&A sessions, I still have the same problem
Modelfile
ver@ver-PowerEdge-R750:~$ cat Modelfile
FROM deepseek-r1:671b
PARAMETER num_ctx 4096
PARAMETER num_predict 512

Memory after deployment

Image

ollama service config

Image

Error log

Image

Relevant log output


OS

Linux

GPU

No response

CPU

Intel

Ollama version

0.5.7

Originally created by @sobermh on GitHub (Feb 11, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9010 ### What is the issue? Hey Ollama community, # description After I run deepseek-r1:671b, I try to ask 2 questions and the response is normal, but when I ask the 3rd question, the model crashes in memory and starts trying to reload into memory. I'm running on a pure cpu environment, and after loading deepseek-r1:671b , there is some memory left (details attached). what should i do ? # Tried methods After checking past issues, I tried creating a new modelfile, running a new model, but after several Q&A sessions, I still have the same problem **Modelfile** ver@ver-PowerEdge-R750:~$ cat Modelfile FROM deepseek-r1:671b PARAMETER num_ctx 4096 PARAMETER num_predict 512 # Memory after deployment ![Image](https://github.com/user-attachments/assets/278d4868-bf11-4305-b84c-00830f46d1cb) # ollama service config ![Image](https://github.com/user-attachments/assets/8707a873-2340-4c8a-82da-ec2a8ad65004) # Error log ![Image](https://github.com/user-attachments/assets/9683632a-b586-42bf-aa23-2841fbb9915c) ### Relevant log output ```shell ``` ### OS Linux ### GPU _No response_ ### CPU Intel ### Ollama version 0.5.7
GiteaMirror added the bug label 2026-04-28 23:06:12 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 11, 2025):

https://github.com/ollama/ollama/issues/5975

<!-- gh-comment-id:2650198440 --> @rick-github commented on GitHub (Feb 11, 2025): https://github.com/ollama/ollama/issues/5975
Author
Owner

@sobermh commented on GitHub (Feb 11, 2025):

@rick-github I tried creating a modelfile and then running the newly created model, but only increased the number of Q&As. When the number of times increased, the problem still exists.
### My modelfile:
ver@ver-PowerEdge-R750:~$ cat Modelfile
FROM deepseek-r1:671b
PARAMETER num_ctx 4096
PARAMETER num_predict 512

<!-- gh-comment-id:2650215533 --> @sobermh commented on GitHub (Feb 11, 2025): @rick-github I tried creating a modelfile and then running the newly created model, but only increased the number of Q&As. When the number of times increased, the problem still exists. **### My modelfile:** ver@ver-PowerEdge-R750:~$ cat Modelfile FROM deepseek-r1:671b PARAMETER num_ctx 4096 PARAMETER num_predict 512
Author
Owner

@rick-github commented on GitHub (Feb 11, 2025):

If your input tokens + output tokens > num_ctx, the model will fail due to k-shift. So if you want to use longer prompts (multiple Q&A), you need to increase num_ctx.

<!-- gh-comment-id:2650222374 --> @rick-github commented on GitHub (Feb 11, 2025): If your input tokens + output tokens > num_ctx, the model will fail due to k-shift. So if you want to use longer prompts (multiple Q&A), you need to increase `num_ctx`.
Author
Owner

@sobermh commented on GitHub (Feb 11, 2025):

@rick-github How much do I need to increase num_ctx ? Do I still have to control the user's input ? if input tokens too long,tokens + output tokens > num_ctx causing the model to crash in memory

<!-- gh-comment-id:2650351678 --> @sobermh commented on GitHub (Feb 11, 2025): @rick-github How much do I need to increase num_ctx ? Do I still have to control the user's input ? if input tokens too long,tokens + output tokens > num_ctx causing the model to crash in memory
Author
Owner

@rick-github commented on GitHub (Feb 11, 2025):

Increase num_ctx so that it's big enough to hold the input tokens and the output tokens. You still have to control the users input.

<!-- gh-comment-id:2650363616 --> @rick-github commented on GitHub (Feb 11, 2025): Increase `num_ctx` so that it's big enough to hold the input tokens and the output tokens. You still have to control the users input.
Author
Owner

@sobermh commented on GitHub (Feb 11, 2025):

@rick-github When I use the default configuration and chat with deepseek-r1:671b, even a simple problem causes the model to crash in memory after the 3rd and 4th conversations.

  1. Why does this problem occur under the default configuration?

  2. If the configuration is not changed, can I improve the hardware configuration to solve this problem?

<!-- gh-comment-id:2650390939 --> @sobermh commented on GitHub (Feb 11, 2025): @rick-github When I use the default configuration and chat with deepseek-r1:671b, even a simple problem causes the model to crash in memory after the 3rd and 4th conversations. 1. Why does this problem occur under the default configuration? 2. If the configuration is not changed, can I improve the hardware configuration to solve this problem?
Author
Owner

@fishreyuu commented on GitHub (Feb 11, 2025):

你可以将这段代码临时修改一下保障模型服务不会crash,

llama/runner/runner.go line 125 : discard := len(inputs) - s.cache.numCtx + 2048

并且build模型时设置
num_ctx 8192
num_predict 4096

但这只是临时解决方法。

长期解决方案应该查看一下kshift的代码实现, 对deepseek2架构开启 滑动窗口 sliding_window而不是调用kshift ,因为deepseek2架构不支持kshift

<!-- gh-comment-id:2651517528 --> @fishreyuu commented on GitHub (Feb 11, 2025): 你可以将这段代码临时修改一下保障模型服务不会crash, llama/runner/runner.go line 125 : discard := len(inputs) - s.cache.numCtx + 2048 并且build模型时设置 num_ctx 8192 num_predict 4096 但这只是临时解决方法。 长期解决方案应该查看一下kshift的代码实现, 对deepseek2架构开启 滑动窗口 sliding_window而不是调用kshift ,因为deepseek2架构不支持kshift
Author
Owner

@sobermh commented on GitHub (Feb 12, 2025):

@fishreyuu 我想是不是因为硬件条件的配置不够,因为即使我是简短的对话,在第三次和第四次问答之后,都会导致模型在内存中崩溃。这应该不是正常现象。

<!-- gh-comment-id:2652452252 --> @sobermh commented on GitHub (Feb 12, 2025): @fishreyuu 我想是不是因为硬件条件的配置不够,因为即使我是简短的对话,在第三次和第四次问答之后,都会导致模型在内存中崩溃。这应该不是正常现象。
Author
Owner

@fishreyuu commented on GitHub (Feb 12, 2025):

@sobermh 你可以调小
PARAMETER num_ctx 2048
PARAMETER num_predict 512

运行时 设置OLLAMA_NUM_PARALLEL=1

<!-- gh-comment-id:2652672532 --> @fishreyuu commented on GitHub (Feb 12, 2025): @sobermh 你可以调小 PARAMETER num_ctx 2048 PARAMETER num_predict 512 运行时 设置OLLAMA_NUM_PARALLEL=1
Author
Owner

@sobermh commented on GitHub (Feb 12, 2025):

@fishreyuu 我可以通过提升内存来避免这个错误吗

<!-- gh-comment-id:2652780455 --> @sobermh commented on GitHub (Feb 12, 2025): @fishreyuu 我可以通过提升内存来避免这个错误吗
Author
Owner

@sobermh commented on GitHub (Feb 12, 2025):

If your input tokens + output tokens > num_ctx, the model will fail due to k-shift. So if you want to use longer prompts (multiple Q&A), you need to increase num_ctx.

@fishreyuu 但是这位官方人员说的意思应该是num_ctx不能太小
我用了这个modelfile:

FROM deepseek-r1:671b
PARAMETER num_ctx 4096
PARAMETER num_predict 512

OLLAMA_NUM_PARALLEL=1

回答次数一多 模型在内存依然会崩溃

<!-- gh-comment-id:2653282332 --> @sobermh commented on GitHub (Feb 12, 2025): > If your input tokens + output tokens > num_ctx, the model will fail due to k-shift. So if you want to use longer prompts (multiple Q&A), you need to increase `num_ctx`. @fishreyuu 但是这位官方人员说的意思应该是num_ctx不能太小 我用了这个modelfile: ``` FROM deepseek-r1:671b PARAMETER num_ctx 4096 PARAMETER num_predict 512 ``` OLLAMA_NUM_PARALLEL=1 回答次数一多 模型在内存依然会崩溃
Author
Owner

@rick-github commented on GitHub (Feb 12, 2025):

FROM deepseek-r1:671b
PARAMETER num_ctx 163840
PARAMETER num_predict 8192
<!-- gh-comment-id:2653304187 --> @rick-github commented on GitHub (Feb 12, 2025): ``` FROM deepseek-r1:671b PARAMETER num_ctx 163840 PARAMETER num_predict 8192 ```
Author
Owner

@sobermh commented on GitHub (Feb 13, 2025):

@rick-github
My memory size is only 503G

  1. if I upgrade my memory after, will this new model still have the same problem?
  2. Can provide me with a new configuration to adapt to my current memory size

root@ver-PowerEdge-R750:/home/ver# ollama run deepseek-r1:671b-fixed
Error: model requires more system memory (3726.0 GiB) than is available (498.3 GiB)
root@ver-PowerEdge-R750:/home/ver# cat Modelfile
FROM deepseek-r1:671b
PARAMETER num_ctx 163840
PARAMETER num_predict 8192

<!-- gh-comment-id:2655249160 --> @sobermh commented on GitHub (Feb 13, 2025): @rick-github My memory size is only 503G 1. if I upgrade my memory after, will this new model still have the same problem? 2. Can provide me with a new configuration to adapt to my current memory size root@ver-PowerEdge-R750:/home/ver# ollama run deepseek-r1:671b-fixed Error: model requires more system memory (3726.0 GiB) than is available (498.3 GiB) root@ver-PowerEdge-R750:/home/ver# cat Modelfile FROM deepseek-r1:671b PARAMETER num_ctx 163840 PARAMETER num_predict 8192
Author
Owner

@YoungerLwb commented on GitHub (Feb 13, 2025):

@sobermh I also encountered this issue. Can I communicate with you via ?

<!-- gh-comment-id:2655375211 --> @YoungerLwb commented on GitHub (Feb 13, 2025): @sobermh I also encountered this issue. Can I communicate with you via ?
Author
Owner

@sobermh commented on GitHub (Feb 13, 2025):

@YoungerLwb yep
409788696@qq.com

<!-- gh-comment-id:2655382317 --> @sobermh commented on GitHub (Feb 13, 2025): @YoungerLwb yep 409788696@qq.com
Author
Owner

@oOoOoOoll commented on GitHub (Feb 14, 2025):

i think memory round * num_predict < num_ctx; k-shift problem will be solved

<!-- gh-comment-id:2658674308 --> @oOoOoOoll commented on GitHub (Feb 14, 2025): i think memory round * num_predict < num_ctx; k-shift problem will be solved
Author
Owner

@sobermh commented on GitHub (Feb 14, 2025):

@oOoOoOoll What does ”memory round“ mean?

<!-- gh-comment-id:2658686010 --> @sobermh commented on GitHub (Feb 14, 2025): @oOoOoOoll What does ”memory round“ mean?
Author
Owner

@oOoOoOoll commented on GitHub (Feb 14, 2025):

@oOoOoOoll What does ”memory round“ mean?

Conversation History or conversation history round nums

<!-- gh-comment-id:2658711636 --> @oOoOoOoll commented on GitHub (Feb 14, 2025): > [@oOoOoOoll](https://github.com/oOoOoOoll) What does ”memory round“ mean? Conversation History or conversation history round nums
Author
Owner

@sobermh commented on GitHub (Feb 14, 2025):

@oOoOoOoll What does ”memory round“ mean?

Conversation History or conversation history round nums

I think you might be right . I just want to know if there is a way to avoid this situation . Let the inequality always hold

<!-- gh-comment-id:2658762567 --> @sobermh commented on GitHub (Feb 14, 2025): > > [@oOoOoOoll](https://github.com/oOoOoOoll) What does ”memory round“ mean? > > Conversation History or conversation history round nums I think you might be right . I just want to know if there is a way to avoid this situation . Let the inequality always hold
Author
Owner

@mariaccc commented on GitHub (Feb 24, 2025):

FROM deepseek-r1:671b
PARAMETER num_ctx 163840
PARAMETER num_predict 8192

@rick-github how to calculate a suitable num_ctx? it base on your GPU?

<!-- gh-comment-id:2677496216 --> @mariaccc commented on GitHub (Feb 24, 2025): > ``` > FROM deepseek-r1:671b > PARAMETER num_ctx 163840 > PARAMETER num_predict 8192 > ``` @rick-github how to calculate a suitable num_ctx? it base on your GPU?
Author
Owner

@rick-github commented on GitHub (Feb 24, 2025):

Choose a size that allows the model to process the input tokens and generate output tokens to satisfy the requirements of the task you want to use an LLM for. If you don't have any particular requirements, set it as large as you can without causing the model to spill to system RAM.

<!-- gh-comment-id:2677927052 --> @rick-github commented on GitHub (Feb 24, 2025): Choose a size that allows the model to process the input tokens and generate output tokens to satisfy the requirements of the task you want to use an LLM for. If you don't have any particular requirements, set it as large as you can without causing the model to spill to system RAM.
Author
Owner

@mariaccc commented on GitHub (Feb 25, 2025):

@rick-github thanks, num_ctx 8192 works for me.

<!-- gh-comment-id:2680284247 --> @mariaccc commented on GitHub (Feb 25, 2025): @rick-github thanks, num_ctx 8192 works for me.
Author
Owner

@sobermh commented on GitHub (Feb 25, 2025):

@mariaccc Can I know your server configuration and modelfile detail?

<!-- gh-comment-id:2680288126 --> @sobermh commented on GitHub (Feb 25, 2025): @mariaccc Can I know your server configuration and modelfile detail?
Author
Owner

@mariaccc commented on GitHub (Feb 25, 2025):

@sobermh 8*A100 when I increase num_ctx, ollama ps PROCESSOR change to CPU and GPU, 8192 works full GPU

<!-- gh-comment-id:2680304611 --> @mariaccc commented on GitHub (Feb 25, 2025): @sobermh 8*A100 when I increase num_ctx, ollama ps PROCESSOR change to CPU and GPU, 8192 works full GPU
Author
Owner

@sobermh commented on GitHub (Feb 25, 2025):

@mariaccc What is the size of your memory?

<!-- gh-comment-id:2680313430 --> @sobermh commented on GitHub (Feb 25, 2025): @mariaccc What is the size of your memory?
Author
Owner

@mariaccc commented on GitHub (Feb 25, 2025):

@sobermh 500G

<!-- gh-comment-id:2680334262 --> @mariaccc commented on GitHub (Feb 25, 2025): @sobermh 500G
Author
Owner

@sobermh commented on GitHub (Feb 25, 2025):

@mariaccc thanks!

<!-- gh-comment-id:2680362689 --> @sobermh commented on GitHub (Feb 25, 2025): @mariaccc thanks!
Author
Owner

@leekaimao commented on GitHub (Feb 25, 2025):

@sobermh 8*A100 when I increase num_ctx, ollama ps PROCESSOR change to CPU and GPU, 8192 works full GPU

I can't understand

<!-- gh-comment-id:2681917237 --> @leekaimao commented on GitHub (Feb 25, 2025): > [@sobermh](https://github.com/sobermh) 8*A100 when I increase num_ctx, ollama ps PROCESSOR change to CPU and GPU, 8192 works full GPU I can't understand
Author
Owner

@leekaimao commented on GitHub (Feb 25, 2025):

I have the same setup as you, with 8 A100s and 500G. Could you please advise on the specific configurations, including the service setup and modelfile configurations? @mariaccc

<!-- gh-comment-id:2681931059 --> @leekaimao commented on GitHub (Feb 25, 2025): I have the same setup as you, with 8 A100s and 500G. Could you please advise on the specific configurations, including the service setup and modelfile configurations? @mariaccc
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52368