[GH-ISSUE #1339] MacOS opens kernel tasks doesn't unload model #47209

Closed
opened 2026-04-28 03:26:00 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @igorcosta on GitHub (Dec 1, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1339

One of the things that makes me cringe is when swapping between models, it never releases the memory when I'm done using it. It's just piles up and I eventually have to restart my Mac.

Would memory optimisation being a target for next release?

Originally created by @igorcosta on GitHub (Dec 1, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1339 One of the things that makes me cringe is when swapping between models, it never releases the memory when I'm done using it. It's just piles up and I eventually have to restart my Mac. Would memory optimisation being a target for next release?
Author
Owner

@pdevine commented on GitHub (Dec 1, 2023):

@igorcosta the memory for each model is released after 5 minutes, and it should release between loading each model. Can you be more specific about what you're seeing?

<!-- gh-comment-id:1835463392 --> @pdevine commented on GitHub (Dec 1, 2023): @igorcosta the memory for each model is released after 5 minutes, and it should release between loading each model. Can you be more specific about what you're seeing?
Author
Owner

@tjthejuggler commented on GitHub (Dec 1, 2023):

is there any way to not have to wait 5 minutes? I'm on ubuntu and I have to go in and manually sudo kill ollama in order to get my memory back every time

<!-- gh-comment-id:1835981144 --> @tjthejuggler commented on GitHub (Dec 1, 2023): is there any way to not have to wait 5 minutes? I'm on ubuntu and I have to go in and manually sudo kill ollama in order to get my memory back every time
Author
Owner

@jakenvac commented on GitHub (Dec 15, 2023):

I'm on macos and came to this issue wondering why there was still memory allocated after I ended the ollama run command. I wasn't aware of the 5 minute rule.

I can confirm this works as @pdevine describes and is perfectly acceptable behavior in my mind.

<!-- gh-comment-id:1858056094 --> @jakenvac commented on GitHub (Dec 15, 2023): I'm on macos and came to this issue wondering why there was still memory allocated after I ended the `ollama run` command. I wasn't aware of the 5 minute rule. I can confirm this works as @pdevine describes and is perfectly acceptable behavior in my mind.
Author
Owner

@realazizk commented on GitHub (Dec 16, 2023):

I use other models like whisper on the side with ollama, so i do this to unload model when i need vram, might be useful to some people.

kill -SIGUSR1 $(pgrep -f ollama)

code:
c9a1ee8f1c

<!-- gh-comment-id:1858813133 --> @realazizk commented on GitHub (Dec 16, 2023): I use other models like whisper on the side with ollama, so i do this to unload model when i need vram, might be useful to some people. ``` kill -SIGUSR1 $(pgrep -f ollama) ``` code: https://github.com/mohamed-aziz/ollama/commit/c9a1ee8f1c1e3e464d36d81089b7b7d827675846
Author
Owner

@Natfan commented on GitHub (Jan 25, 2024):

I actually have the opposite issue, I would like my ollama models to be cached for longer. Is there any way that I can programmatically get the model currently loaded (outside of scanning the log), so I can poll /api/embeddings every few minutes to keep it up and running?

<!-- gh-comment-id:1911173965 --> @Natfan commented on GitHub (Jan 25, 2024): I actually have the opposite issue, I would like my ollama models to be cached for longer. Is there any way that I can programmatically get the model currently loaded (outside of scanning the log), so I can poll /api/embeddings every few minutes to keep it up and running?
Author
Owner

@mxyng commented on GitHub (Jan 25, 2024):

so I can poll /api/embeddings every few minutes to keep it up and running?

Yes, any of /api/embeddings, /api/chat, or /api/generate will work for this

<!-- gh-comment-id:1911184411 --> @mxyng commented on GitHub (Jan 25, 2024): > so I can poll /api/embeddings every few minutes to keep it up and running? Yes, any of `/api/embeddings`, `/api/chat`, or `/api/generate` will work for this
Author
Owner

@tjthejuggler commented on GitHub (Jan 26, 2024):

@mxyng is there anything that will cancel the ollama model immediately? So i don't have to do a sudo kill and can use the guy again immediately.

<!-- gh-comment-id:1912797597 --> @tjthejuggler commented on GitHub (Jan 26, 2024): @mxyng is there anything that will cancel the ollama model immediately? So i don't have to do a sudo kill and can use the guy again immediately.
Author
Owner

@mxyng commented on GitHub (Jan 26, 2024):

With #2146, you can set keep_alive: 0 to immediately unload the model once the request is complete

<!-- gh-comment-id:1912822388 --> @mxyng commented on GitHub (Jan 26, 2024): With #2146, you can set `keep_alive: 0` to immediately unload the model once the request is complete
Author
Owner

@tjthejuggler commented on GitHub (Jan 29, 2024):

With #2146, you can set keep_alive: 0 to immediately unload the model once the request is complete

Thanks so much! How exactly can this be used? I tried it in python code and as an argument in a terminal command and both ways it just doesn't run at all, no error, just no LLMing happening.

<!-- gh-comment-id:1914184726 --> @tjthejuggler commented on GitHub (Jan 29, 2024): > With #2146, you can set `keep_alive: 0` to immediately unload the model once the request is complete Thanks so much! How exactly can this be used? I tried it in python code and as an argument in a terminal command and both ways it just doesn't run at all, no error, just no LLMing happening.
Author
Owner

@OliChase404 commented on GitHub (Feb 28, 2024):

This might help, from the faq.md file:


How do I keep a model loaded in memory or make it unload immediately?

By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you are making numerous requests to the LLM. You may, however, want to free up the memory before the 5 minutes have elapsed or keep the model loaded indefinitely. Use the keep_alive parameter with either the /api/generate and /api/chat API endpoints to control how long the model is left in memory.
The keep_alive parameter can be set to:

  • a duration string (such as "10m" or "24h")
  • a number in seconds (such as 3600)
  • any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
  • '0' which will unload the model immediately after generating a response
    For example, to preload a model and leave it in memory use:
curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": -1}'

To unload the model and free up memory use:

curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": 0}'

----------------------------------------------

So to immediately unload a model and free your vram run:

curl http://localhost:11434/api/generate -d '{"model": "MODELNAME", "keep_alive": 0}'

Replace MODELNAME with the name of the model currently loaded.
<!-- gh-comment-id:1967948376 --> @OliChase404 commented on GitHub (Feb 28, 2024): This might help, from the [faq.md](https://faq.md/) file: ----------------------------------------------------------- ## How do I keep a model loaded in memory or make it unload immediately? By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you are making numerous requests to the LLM. You may, however, want to free up the memory before the 5 minutes have elapsed or keep the model loaded indefinitely. Use the `keep_alive` parameter with either the `/api/generate` and `/api/chat` API endpoints to control how long the model is left in memory. The `keep_alive` parameter can be set to: * a duration string (such as "10m" or "24h") * a number in seconds (such as 3600) * any negative number which will keep the model loaded in memory (e.g. -1 or "-1m") * '0' which will unload the model immediately after generating a response For example, to preload a model and leave it in memory use: ```shell curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": -1}' ``` To unload the model and free up memory use: ```shell curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": 0}' ---------------------------------------------- So to immediately unload a model and free your vram run: curl http://localhost:11434/api/generate -d '{"model": "MODELNAME", "keep_alive": 0}' Replace MODELNAME with the name of the model currently loaded.
Author
Owner

@AbdulmalekAI commented on GitHub (Aug 6, 2024):

This might help, from the faq.md file:

How do I keep a model loaded in memory or make it unload immediately?

By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you are making numerous requests to the LLM. You may, however, want to free up the memory before the 5 minutes have elapsed or keep the model loaded indefinitely. Use the keep_alive parameter with either the /api/generate and /api/chat API endpoints to control how long the model is left in memory. The keep_alive parameter can be set to:

  • a duration string (such as "10m" or "24h")
  • a number in seconds (such as 3600)
  • any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
  • '0' which will unload the model immediately after generating a response
    For example, to preload a model and leave it in memory use:
curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": -1}'

To unload the model and free up memory use:

curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": 0}'

----------------------------------------------

So to immediately unload a model and free your vram run:

curl http://localhost:11434/api/generate -d '{"model": "MODELNAME", "keep_alive": 0}'

Replace MODELNAME with the name of the model currently loaded.

I used keep_alive in this command , But not working ollm = OllamaLLM(model="aya",keep_alive=0,temperature=0.1)
Is this parameter woriking for /api/generate and /api/chat API endpoints only ???

<!-- gh-comment-id:2270587439 --> @AbdulmalekAI commented on GitHub (Aug 6, 2024): > This might help, from the [faq.md](https://faq.md/) file: > > ## How do I keep a model loaded in memory or make it unload immediately? > By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you are making numerous requests to the LLM. You may, however, want to free up the memory before the 5 minutes have elapsed or keep the model loaded indefinitely. Use the `keep_alive` parameter with either the `/api/generate` and `/api/chat` API endpoints to control how long the model is left in memory. The `keep_alive` parameter can be set to: > > * a duration string (such as "10m" or "24h") > * a number in seconds (such as 3600) > * any negative number which will keep the model loaded in memory (e.g. -1 or "-1m") > * '0' which will unload the model immediately after generating a response > For example, to preload a model and leave it in memory use: > > ```shell > curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": -1}' > ``` > > To unload the model and free up memory use: > > ```shell > curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": 0}' > > ---------------------------------------------- > > So to immediately unload a model and free your vram run: > > curl http://localhost:11434/api/generate -d '{"model": "MODELNAME", "keep_alive": 0}' > > Replace MODELNAME with the name of the model currently loaded. > ``` I used keep_alive in this command , But not working ollm = OllamaLLM(model="aya",keep_alive=0,temperature=0.1) Is this parameter woriking for /api/generate and /api/chat API endpoints only ???
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47209