[GH-ISSUE #11143] ollama reasoning time is too long #7347

Closed
opened 2026-04-12 19:24:03 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @dajima on GitHub (Jun 20, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11143

What is the issue?

This is the Ollama log. There are many requests whose inference time is longer than 5 hours. At this time, my application has ended. What is the reason for this? Can I limit the inference time through configuration?
[GIN] 2025/06/18 - 17:15:26 | 200 | 20.368408329s | 172.22.0.1 | POST "/api/generate"
[GIN] 2025/06/18 - 17:16:09 | 200 | 5h55m37s | 192.168.18.71 | POST "/api/chat"
[GIN] 2025/06/18 - 17:18:19 | 200 | 12.776950299s | 172.22.0.1 | POST "/api/generate"
[GIN] 2025/06/18 - 17:21:17 | 200 | 9.018881037s | 172.22.0.1 | POST "/api/generate"

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @dajima on GitHub (Jun 20, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11143 ### What is the issue? This is the Ollama log. There are many requests whose inference time is longer than 5 hours. At this time, my application has ended. What is the reason for this? Can I limit the inference time through configuration? [GIN] 2025/06/18 - 17:15:26 | 200 | 20.368408329s | 172.22.0.1 | POST "/api/generate" [GIN] 2025/06/18 - 17:16:09 | 200 | 5h55m37s | 192.168.18.71 | POST "/api/chat" [GIN] 2025/06/18 - 17:18:19 | 200 | 12.776950299s | 172.22.0.1 | POST "/api/generate" [GIN] 2025/06/18 - 17:21:17 | 200 | 9.018881037s | 172.22.0.1 | POST "/api/generate" ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-12 19:24:03 -05:00
Author
Owner

@rick-github commented on GitHub (Jun 20, 2025):

Long processing time usually means that the model lost coherence and is generating random tokens. This can be triggered when the model exceeds the size of the context window and the buffer is shifted, resulting in the loss of tokens from the head of the buffer. There are two mitigations you can try: increasing the size of the context window, or putting a limit on the number of tokens that can be generated with num_predict.

<!-- gh-comment-id:2990378222 --> @rick-github commented on GitHub (Jun 20, 2025): Long processing time usually means that the model lost coherence and is generating random tokens. This can be triggered when the model exceeds the size of the context window and the buffer is shifted, resulting in the loss of tokens from the head of the buffer. There are two mitigations you can try: increasing the size of the [context window](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size), or putting a limit on the number of tokens that can be generated with [`num_predict`](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#valid-parameters-and-values:~:text=stop%20%22AI%20assistant%3A%22-,num_predict,-Maximum%20number%20of).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7347