[GH-ISSUE #8390] API is not giving up memory after responce is complete #5386

Closed
opened 2026-04-12 16:36:47 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @sriyan-a on GitHub (Jan 12, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8390

What is the issue?

Environment Info

  1. 2024 Mac Mini M4 16GB
  2. Model: pi4:latest
  3. OLLAMA_HOST=0.0.0.0 and Open WebUI is running as a docker container - the container is in a different machine in the network so should have no bearing on what I'm seeing in my Mac.

Steps to reproduce

In the WebUI and after doing ollama run phi4:latest:

  1. Prompt 1: I provide a long essay (1000 words) and a prompt to rewrite it better.
  2. Prompt 2: Ask it to do it again. Because the model gives only points instead of an essay
  3. Prompt 3: Instruct the model to add new incites and check style of writing.

While executing the above steps I monitor the Activity Monitor in my Mac to see RAM usage.

Expected Behavior

That there is a spike in the RAM usage when the model is thinking and responding. But buttom out when there is no question asked or after a response is complete.

Actual Behavior

In the terminal alone it works as expected.

(Can post screenshots if needed)

Test in the Open WebUI (which calls the api):

  1. Just after posting the question:

new-question-asked

  1. After response is complete:

image

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.5.4

Originally created by @sriyan-a on GitHub (Jan 12, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8390 ### What is the issue? ## Environment Info 1. 2024 Mac Mini M4 16GB 2. Model: pi4:latest 3. `OLLAMA_HOST=0.0.0.0` and Open WebUI is running as a docker container - the container is in a different machine in the network so should have no bearing on what I'm seeing in my Mac. ## Steps to reproduce In the WebUI and after doing `ollama run phi4:latest`: 1. Prompt 1: I provide a long essay (1000 words) and a prompt to rewrite it better. 2. Prompt 2: Ask it to do it again. Because the model gives only points instead of an essay 3. Prompt 3: Instruct the model to add new incites and check style of writing. While executing the above steps I monitor the Activity Monitor in my Mac to see RAM usage. ## Expected Behavior That there is a spike in the RAM usage when the model is thinking and responding. But buttom out when there is no question asked or after a response is complete. ## Actual Behavior ### In the terminal alone it works as expected. (Can post screenshots if needed) ### Test in the Open WebUI (which calls the api): 1. Just after posting the question: ![new-question-asked](https://github.com/user-attachments/assets/f2ba47f0-3c69-4d73-bf0a-9646d06f30a3) 2. After response is complete: ![image](https://github.com/user-attachments/assets/8ec52688-0aee-4e1a-a3e8-438a939861dc) ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.5.4
GiteaMirror added the bug label 2026-04-12 16:36:47 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 12, 2025):

ollama loads the model when it receives a request. By default, the model is kept loaded for 5 minutes. You can change this behaviour with keep_alive.

<!-- gh-comment-id:2585550662 --> @rick-github commented on GitHub (Jan 12, 2025): ollama loads the model when it receives a request. By default, the model is kept loaded for 5 minutes. You can change this behaviour with [`keep_alive`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately).
Author
Owner

@sriyan-a commented on GitHub (Jan 12, 2025):

I see. Open WebUI has settings where I can update that as well. Got it. Thanks!

<!-- gh-comment-id:2585562326 --> @sriyan-a commented on GitHub (Jan 12, 2025): I see. Open WebUI has settings where I can update that as well. Got it. Thanks!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5386