[GH-ISSUE #10086] Long running ollama processes and high CPU usage #6611

Closed
opened 2026-04-12 18:16:35 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @somera on GitHub (Apr 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10086

What is the issue?

Since december 2024 I run Ollama in an Ubuntu VM on Proxmox. I started with Ollama v0.5.3 and now v0.6.3.

Sometimes I see that VRAM is used, but GPU only at ~10% and the CPU usage >70+%.

When this happens I see 1+ long running ollama process and RAM/VRAM usage.

Image

Image

or

Image

Image

or

Image

Image

In this case I restart the Ollama service.

Is this known problem?

What can I do to get more informations what happens here?

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.6.3

Originally created by @somera on GitHub (Apr 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10086 ### What is the issue? Since december 2024 I run Ollama in an Ubuntu VM on Proxmox. I started with Ollama v0.5.3 and now v0.6.3. Sometimes I see that VRAM is used, but GPU only at ~10% and the CPU usage >70+%. When this happens I see 1+ long running ollama process and RAM/VRAM usage. ![Image](https://github.com/user-attachments/assets/8c6aab76-3fdb-413e-80e1-d88deb2eca71) ![Image](https://github.com/user-attachments/assets/f8eed46e-bdda-487f-8206-6646929bb1e0) or ![Image](https://github.com/user-attachments/assets/eb4fb735-9792-40da-ac70-2e85f3aa4569) ![Image](https://github.com/user-attachments/assets/91734f59-29b4-4273-bc72-5b79f63dcb76) or ![Image](https://github.com/user-attachments/assets/0f35fd15-0690-4863-b829-144f6f431448) ![Image](https://github.com/user-attachments/assets/e1af6250-16bd-44c7-8295-ec643cec5b77) In this case I restart the Ollama service. Is this known problem? What can I do to get more informations what happens here? ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.6.3
GiteaMirror added the bug label 2026-04-12 18:16:35 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 2, 2025):

If a model can't fit entirely in VRAM, part of the model is loaded in system RAM. ollama uses the GPU for inference for the part of the model in VRAM and the CPU for the part of the model in system RAM. The GPU is much faster than the CPU, so it completes its inference quickly. Then the CPU starts doing its inference, and it takes longer. This leads to low GPU utilization and high CPU utilization.

In some inferences, the model can become confused and continues to generate tokens without hitting an end-of-sequence token. This can lead to the situation where the model is in "Stopping..." state for a long time - the model is scheduled for unloading but ollama won't interrupt a model that's in the middle of an inference. This can be mitigated by setting num_predict in the API call or Modelfile.

<!-- gh-comment-id:2772000800 --> @rick-github commented on GitHub (Apr 2, 2025): If a model can't fit entirely in VRAM, part of the model is loaded in system RAM. ollama uses the GPU for inference for the part of the model in VRAM and the CPU for the part of the model in system RAM. The GPU is much faster than the CPU, so it completes its inference quickly. Then the CPU starts doing its inference, and it takes longer. This leads to low GPU utilization and high CPU utilization. In some inferences, the model can become confused and continues to generate tokens without hitting an end-of-sequence token. This can lead to the situation where the model is in "Stopping..." state for a long time - the model is scheduled for unloading but ollama won't interrupt a model that's in the middle of an inference. This can be mitigated by setting [`num_predict`](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#valid-parameters-and-values:~:text=stop%20%22AI%20assistant%3A%22-,num_predict,-Maximum%20number%20of) in the API call or Modelfile.
Author
Owner

@somera commented on GitHub (Apr 2, 2025):

@rick-github thx for the response. I will test it.

<!-- gh-comment-id:2772006447 --> @somera commented on GitHub (Apr 2, 2025): @rick-github thx for the response. I will test it.
Author
Owner

@somera commented on GitHub (Apr 2, 2025):

In some inferences, the model can become confused and continues to generate tokens without hitting an end-of-sequence token. This can lead to the situation where the model is in "Stopping..." state for a long time - the model is scheduled for unloading but ollama won't interrupt a model that's in the middle of an inference.

Or you add an optional parameter to Ollama that allows you to say that if a model runs for longer than xx minutes/hours, it will be interrupted?

<!-- gh-comment-id:2772026918 --> @somera commented on GitHub (Apr 2, 2025): > In some inferences, the model can become confused and continues to generate tokens without hitting an end-of-sequence token. This can lead to the situation where the model is in "Stopping..." state for a long time - the model is scheduled for unloading but ollama won't interrupt a model that's in the middle of an inference. Or you add an optional parameter to Ollama that allows you to say that if a model runs for longer than xx minutes/hours, it will be interrupted?
Author
Owner

@somera commented on GitHub (Apr 28, 2025):

I'm closing this, cause this same with #10433

<!-- gh-comment-id:2836037937 --> @somera commented on GitHub (Apr 28, 2025): I'm closing this, cause this same with #10433
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6611