[GH-ISSUE #8614] Problems with deepseek-r1:671b, ollama keeps crashing on long answers #67629

Closed
opened 2026-05-04 11:05:51 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @fabiounixpi on GitHub (Jan 27, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8614

What is the issue?

Hi all,
I'm using an r960 with 2TB of ram, so ram is not a problem here. I'm experiencing constant crashes of ollama 0.5.7 and deepseek-r1:671b, even increasing the context window with the command /set parameter num_ctx 4096.
I also tried a second system, an r670 csp with 1TB of ram, but the problem occurs in the same way.
I'm not able to use gpu due to the massive size of the model, anyway plenty of cores do the job for my current pourposes.

os are ubuntu 22.04.5 and 24.04.1

OS

Linux

GPU

No response

CPU

Intel

Ollama version

0.5.7

Originally created by @fabiounixpi on GitHub (Jan 27, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8614 ### What is the issue? Hi all, I'm using an r960 with 2TB of ram, so ram is not a problem here. I'm experiencing constant crashes of ollama 0.5.7 and deepseek-r1:671b, even increasing the context window with the command /set parameter num_ctx 4096. I also tried a second system, an r670 csp with 1TB of ram, but the problem occurs in the same way. I'm not able to use gpu due to the massive size of the model, anyway plenty of cores do the job for my current pourposes. os are ubuntu 22.04.5 and 24.04.1 ### OS Linux ### GPU _No response_ ### CPU Intel ### Ollama version 0.5.7
GiteaMirror added the bug label 2026-05-04 11:05:51 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 27, 2025):

Server logs will give more info, but this is likely https://github.com/ollama/ollama/issues/5975

<!-- gh-comment-id:2616832697 --> @rick-github commented on GitHub (Jan 27, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will give more info, but this is likely https://github.com/ollama/ollama/issues/5975
Author
Owner

@fabiounixpi commented on GitHub (Jan 27, 2025):

ollama.txt

<!-- gh-comment-id:2616949892 --> @fabiounixpi commented on GitHub (Jan 27, 2025): [ollama.txt](https://github.com/user-attachments/files/18564800/ollama.txt)
Author
Owner

@rick-github commented on GitHub (Jan 27, 2025):

Jan 27 16:32:52 r670-1 ollama[1319558]: llama.cpp:11968: The current context does not support K-shift

#5975

<!-- gh-comment-id:2616954048 --> @rick-github commented on GitHub (Jan 27, 2025): ``` Jan 27 16:32:52 r670-1 ollama[1319558]: llama.cpp:11968: The current context does not support K-shift ``` #5975
Author
Owner

@fabiounixpi commented on GitHub (Jan 27, 2025):

@rick-github , so no chance to increase the context window and prevent the crash, i'm right?

<!-- gh-comment-id:2616983020 --> @fabiounixpi commented on GitHub (Jan 27, 2025): @rick-github , so no chance to increase the context window and prevent the crash, i'm right?
Author
Owner

@rick-github commented on GitHub (Jan 27, 2025):

Increase the context but also limit the number of output tokens. If the number of input tokens + number of output tokens < num_ctx, there will be no problem.

For example: you expect that you will have up to 10,000 input tokens. You expect to have a maximum of 5,000 output tokens. Set num_ctx to 15,000, set num_predict to 5,000.

<!-- gh-comment-id:2616992438 --> @rick-github commented on GitHub (Jan 27, 2025): Increase the context but also limit the number of output tokens. If the number of input tokens + number of output tokens < num_ctx, there will be no problem. For example: you expect that you will have up to 10,000 input tokens. You expect to have a maximum of 5,000 output tokens. Set `num_ctx` to 15,000, set `num_predict` to 5,000.
Author
Owner

@fabiounixpi commented on GitHub (Jan 27, 2025):

ok I'm trying... just a question, I don't understand why, in front of a command /set parameter num_ctx value the result for value in the execution line is four times as much of the one set.. I'll give an example, with the htop command I see the parameters passed at run-time and I see this discrepancy
example: from ollama prompt /set parameter num_ctx 8192, on htop I see ... --ctx-size 32768 ... is that normal ?

<!-- gh-comment-id:2617009178 --> @fabiounixpi commented on GitHub (Jan 27, 2025): ok I'm trying... just a question, I don't understand why, in front of a command /set parameter num_ctx **value** the result for **value** in the execution line is **four times as much** of the one set.. I'll give an example, with the htop command I see the parameters passed at run-time and I see this discrepancy example: from ollama prompt /set parameter num_ctx 8192, on htop I see ... --ctx-size 32768 ... is that normal ?
Author
Owner

@rick-github commented on GitHub (Jan 27, 2025):

ollama can do multiple parallel completions. This is controlled by OLLAMA_NUM_PARALLEL, if unset the default is either 4 or 1 depending on how much RAM your system has. Each completion engine has a context size as given by num_ctx. The total context allocated is given by ctx-size. So total context is num_ctx * OLLAMA_NUM_PARALLEL = 8192 * 4 = 32768.

<!-- gh-comment-id:2617021371 --> @rick-github commented on GitHub (Jan 27, 2025): ollama can do multiple parallel completions. This is controlled by `OLLAMA_NUM_PARALLEL`, if unset the default is either 4 or 1 depending on how much RAM your system has. Each completion engine has a context size as given by `num_ctx`. The total context allocated is given by `ctx-size`. So total context is `num_ctx` * `OLLAMA_NUM_PARALLEL` = 8192 * 4 = 32768.
Author
Owner

@usernaamee commented on GitHub (Jan 29, 2025):

Did this solve your issue?

<!-- gh-comment-id:2621991970 --> @usernaamee commented on GitHub (Jan 29, 2025): Did this solve your issue?
Author
Owner

@mdovgialo commented on GitHub (Jan 30, 2025):

This bug makes DeepSeek R1 unusable, as it requires to generate quite a long answer in it's chain of thought. The server also should not crash completely just from a long answer... I think it's a pretty important issue, preventing people from using one of the most advanced open source models.

<!-- gh-comment-id:2624343890 --> @mdovgialo commented on GitHub (Jan 30, 2025): This bug makes DeepSeek R1 unusable, as it requires to generate quite a long answer in it's chain of thought. The server also should not crash completely just from a long answer... I think it's a pretty important issue, preventing people from using one of the most advanced open source models.
Author
Owner

@fabiounixpi commented on GitHub (Jan 30, 2025):

write modelfile with this

FROM deepseek-r1:671b
PARAMETER num_gpu 0
PARAMETER num_ctx 16384
PARAMETER num_predict 10240

and "create" e new istance with create command

in this way, now is usable for me, keep in mind that with those settings ram usage go up to near 700GB. One definitely needs a server with at least 1TB of ram.

Also to mitigate loading the model every long pause, on the command line launching ollama put --keepalive=<num>h , where h stands for hours

<!-- gh-comment-id:2624415220 --> @fabiounixpi commented on GitHub (Jan 30, 2025): write modelfile with this FROM deepseek-r1:671b PARAMETER num_gpu 0 PARAMETER num_ctx 16384 PARAMETER num_predict 10240 and "create" e new istance with create command in this way, now is usable for me, keep in mind that with those settings ram usage go up to near 700GB. One definitely needs a server with at least 1TB of ram. Also to mitigate loading the model every long pause, on the command line launching ollama put --keepalive=\<num\>h , where h stands for hours
Author
Owner

@mdovgialo commented on GitHub (Jan 30, 2025):

Thanks for a tip, I will experiment with these numbers. I have less RAM, so I'm using the quantised versions from unsloth. But still, ollama server shouldn't try to use unsupported features just to crash on the most hyped model...

<!-- gh-comment-id:2624439858 --> @mdovgialo commented on GitHub (Jan 30, 2025): Thanks for a tip, I will experiment with these numbers. I have less RAM, so I'm using the quantised versions from unsloth. But still, ollama server shouldn't try to use unsupported features just to crash on the most hyped model...
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#67629