[GH-ISSUE #7645] The parameter 'keep_alive' is invalid when cpu (100%) #66933

Closed
opened 2026-05-04 08:54:00 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @qiang1218 on GitHub (Nov 13, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7645

Originally assigned to: @jessegross on GitHub.

What is the issue?

I use 4090D GPU, model mannix/llamax3-8b-alpaca, keep_live set 5m, Initiate requests /api/generate every 3 seconds。
When the CPU is at 100% capacity, the subsequent request parameter 'keep_alive=5m' will become invalid and will get stuck in 'stopping' after the countdown. The request will continue to time out.
Snipaste_2024-11-13_16-34-27
Snipaste_2024-11-13_16-31-58

This can only be resolved by restarting the ollama service.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.4.1

Originally created by @qiang1218 on GitHub (Nov 13, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7645 Originally assigned to: @jessegross on GitHub. ### What is the issue? I use 4090D GPU, model mannix/llamax3-8b-alpaca, keep_live set 5m, Initiate requests /api/generate every 3 seconds。 When the CPU is at 100% capacity, the subsequent request parameter 'keep_alive=5m' will become invalid and will get stuck in 'stopping' after the countdown. The request will continue to time out. <img width="349" alt="Snipaste_2024-11-13_16-34-27" src="https://github.com/user-attachments/assets/f13dc3aa-5496-40a9-80bb-95bb2c4e98d4"> <img width="325" alt="Snipaste_2024-11-13_16-31-58" src="https://github.com/user-attachments/assets/011b9d85-ca2b-483e-99c2-9dace528b483"> This can only be resolved by restarting the ollama service. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.4.1
GiteaMirror added the topbug labels 2026-05-04 08:54:00 -05:00
Author
Owner

@qiang1218 commented on GitHub (Nov 13, 2024):

Request log
Snipaste_2024-11-13_16-40-45

<!-- gh-comment-id:2472862768 --> @qiang1218 commented on GitHub (Nov 13, 2024): Request log <img width="474" alt="Snipaste_2024-11-13_16-40-45" src="https://github.com/user-attachments/assets/9286f436-c71f-4c18-a79d-d0ddd43a4269">
Author
Owner

@dhiltgen commented on GitHub (Nov 13, 2024):

If you have a reproduction scenario, can you run the server with OLLAMA_DEBUG=1 and share the logs around the time when it first goes into the Stopping... state?

<!-- gh-comment-id:2474412825 --> @dhiltgen commented on GitHub (Nov 13, 2024): If you have a reproduction scenario, can you run the server with `OLLAMA_DEBUG=1` and share the logs around the time when it first goes into the `Stopping...` state?
Author
Owner

@qiang1218 commented on GitHub (Nov 14, 2024):

"ollama ps" Output screenshot
Snipaste_2024-11-14_09-45-10
"top" output screenshot
Snipaste_2024-11-14_09-45-31
debug logs file
ollama.log

<!-- gh-comment-id:2475193614 --> @qiang1218 commented on GitHub (Nov 14, 2024): "ollama ps" Output screenshot <img width="822" alt="Snipaste_2024-11-14_09-45-10" src="https://github.com/user-attachments/assets/58ee0a2d-8d3d-4281-89fe-235c944dd68e"> "top" output screenshot <img width="390" alt="Snipaste_2024-11-14_09-45-31" src="https://github.com/user-attachments/assets/b36c2d53-dfdd-408a-8807-6ddd03413601"> debug logs file [ollama.log](https://github.com/user-attachments/files/17741646/ollama.log)
Author
Owner

@jessegross commented on GitHub (Nov 14, 2024):

It's possible that this may be fixed already - if you are able to build from main and report back, that would be very helpful. Otherwise, we may have an RC for the next release soon.

<!-- gh-comment-id:2477138345 --> @jessegross commented on GitHub (Nov 14, 2024): It's possible that this may be fixed already - if you are able to build from `main` and report back, that would be very helpful. Otherwise, we may have an RC for the next release soon.
Author
Owner

@qiang1218 commented on GitHub (Nov 15, 2024):

It's possible that this may be fixed already - if you are able to build from main and report back, that would be very helpful. Otherwise, we may have an RC for the next release soon.

the new version v0.4.2-rc1 has the same issue

<!-- gh-comment-id:2477930370 --> @qiang1218 commented on GitHub (Nov 15, 2024): > It's possible that this may be fixed already - if you are able to build from `main` and report back, that would be very helpful. Otherwise, we may have an RC for the next release soon. the new version v0.4.2-rc1 has the same issue
Author
Owner

@jessegross commented on GitHub (Nov 21, 2024):

More fixes have gone into 0.4.3 - please test. If the problem persists, please also provide the model that you are running when the problem occurs - see the comments in #7779 for additional context.

<!-- gh-comment-id:2492535495 --> @jessegross commented on GitHub (Nov 21, 2024): More fixes have gone into 0.4.3 - please test. If the problem persists, please also provide the model that you are running when the problem occurs - see the comments in #7779 for additional context.
Author
Owner

@qiang1218 commented on GitHub (Nov 25, 2024):

More fixes have gone into 0.4.3 - please test. If the problem persists, please also provide the model that you are running when the problem occurs - see the comments in #7779 for additional context.

Version 0.4.4 still has the same issue

<!-- gh-comment-id:2496585387 --> @qiang1218 commented on GitHub (Nov 25, 2024): > More fixes have gone into 0.4.3 - please test. If the problem persists, please also provide the model that you are running when the problem occurs - see the comments in #7779 for additional context. Version 0.4.4 still has the same issue
Author
Owner

@jessegross commented on GitHub (Nov 25, 2024):

Thanks for testing. At this point, it sounds like this is actually a little different from the other issues that we had in this area. If possible, can you answer the following:

  • Are you still getting output from any of the generate requests?
  • Does this happen with any other models?
  • Do you know if it happens with Ollama 0.3.x?
<!-- gh-comment-id:2498897455 --> @jessegross commented on GitHub (Nov 25, 2024): Thanks for testing. At this point, it sounds like this is actually a little different from the other issues that we had in this area. If possible, can you answer the following: - Are you still getting output from any of the generate requests? - Does this happen with any other models? - Do you know if it happens with Ollama 0.3.x?
Author
Owner

@qiang1218 commented on GitHub (Nov 26, 2024):

The first request timeout occurred (request timeout=20s), at the same time, the CPU usage exceeded 100%. The model parameter (UNTIL) starts counting down, and after about one minute, new requests can receive output (timeout situations may occur occasionally).

image

After a while, The log shows that timeout requests have successfully responded with results (response exceeding 2m). CPU usage exceeds 100%,,the model keeps stopping status,and new requests can obtain results.

image

I have tried qwen2:7b and qwen2.5:7b, but the issue only occurs in mannix/llamax3-8b-alpaca. Maybe it's related to thread management in the startup model. I have only tested on 0.4.x .

<!-- gh-comment-id:2499403998 --> @qiang1218 commented on GitHub (Nov 26, 2024): The first request timeout occurred (request timeout=20s), at the same time, the CPU usage exceeded 100%. The model parameter (UNTIL) starts counting down, and after about one minute, new requests can receive output (timeout situations may occur occasionally). ![image](https://github.com/user-attachments/assets/e6bcbdd8-6494-46b3-982c-0aa62d0aa778) After a while, The log shows that timeout requests have successfully responded with results (response exceeding 2m). CPU usage exceeds 100%,,the model keeps stopping status,and new requests can obtain results.  ![image](https://github.com/user-attachments/assets/e17cdbac-b7f4-4195-ad78-b6d07650687a) I have tried qwen2:7b and qwen2.5:7b, but the issue only occurs in mannix/llamax3-8b-alpaca. Maybe it's related to thread management in the startup model. I have only tested on 0.4.x .
Author
Owner

@jessegross commented on GitHub (Nov 26, 2024):

My guess at this point is that it is related to the model - it's just continuing to generate text forever. Sometimes this can happen if the template or stop tokens are not correctly set up. One option to work around this is to use num_predict, which will set a limit on how much the model will generate.

Ollama won't unload a model as long as there is at least one request pending.

<!-- gh-comment-id:2501795921 --> @jessegross commented on GitHub (Nov 26, 2024): My guess at this point is that it is related to the model - it's just continuing to generate text forever. Sometimes this can happen if the template or stop tokens are not correctly set up. One option to work around this is to use `num_predict`, which will set a limit on how much the model will generate. Ollama won't unload a model as long as there is at least one request pending.
Author
Owner

@qiang1218 commented on GitHub (Nov 27, 2024):

Thank you for your solution.
By setting the 'num_predict' parameter to a smaller value (such as 4096) and the request interval time, after a timeout occurs for about 50 seconds, the log displays the output result of the timeout request, CPU usage decreases. The 'keep_alive' in the new request takes effect immediately, this model starts to become normal and can be unload.

<!-- gh-comment-id:2502484801 --> @qiang1218 commented on GitHub (Nov 27, 2024): Thank you for your solution. By setting the 'num_predict' parameter to a smaller value (such as 4096) and the request interval time, after a timeout occurs for about 50 seconds, the log displays the output result of the timeout request, CPU usage decreases. The 'keep_alive' in the new request takes effect immediately, this model starts to become normal and can be unload.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66933