[GH-ISSUE #3009] feat: add "unload model" command/endpoint #63885

Closed
opened 2026-05-03 15:19:02 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @knoopx on GitHub (Mar 8, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3009

There's no way to unload a model from VRAM other than killing/restarting ollama and that requires local system access and privileges. Given ollama is mostly used on limited devices, a command/api endpoint would be fantastic.

Originally created by @knoopx on GitHub (Mar 8, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3009 There's no way to unload a model from VRAM other than killing/restarting ollama and that requires local system access and privileges. Given ollama is mostly used on limited devices, a command/api endpoint would be fantastic.
GiteaMirror added the feature request label 2026-05-03 15:19:02 -05:00
Author
Owner

@phong-phuong commented on GitHub (Mar 8, 2024):

From what I've seen, the model gets unloaded after a certain amount of time of inactivity, or using a model switch.

<!-- gh-comment-id:1986560967 --> @phong-phuong commented on GitHub (Mar 8, 2024): From what I've seen, the model gets unloaded after a certain amount of time of inactivity, or using a model switch.
Author
Owner

@commit4ever commented on GitHub (Mar 9, 2024):

ecc133d843/docs/faq.md (L189) use keep_alive param

<!-- gh-comment-id:1986749896 --> @commit4ever commented on GitHub (Mar 9, 2024): https://github.com/ollama/ollama/blob/ecc133d843c8567b27ff3bdc9ff811ecad99281a/docs/faq.md?plain=1#L189 use keep_alive param
Author
Owner

@knoopx commented on GitHub (Mar 9, 2024):

ecc133d843/docs/faq.md (L189)

use keep_alive param

  • any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")

Does the opposite actually.

<!-- gh-comment-id:1986823344 --> @knoopx commented on GitHub (Mar 9, 2024): > https://github.com/ollama/ollama/blob/ecc133d843c8567b27ff3bdc9ff811ecad99281a/docs/faq.md?plain=1#L189 > > use keep_alive param * any negative number which will keep the model loaded in memory (e.g. -1 or "-1m") Does the opposite actually.
Author
Owner

@alexandervnuchkov commented on GitHub (Mar 11, 2024):

Would be great if this default behavior could be set via an environment variable (for Docker), if that is possible

<!-- gh-comment-id:1988620250 --> @alexandervnuchkov commented on GitHub (Mar 11, 2024): Would be great if this default behavior could be set via an environment variable (for Docker), if that is possible
Author
Owner

@commit4ever commented on GitHub (Mar 11, 2024):

I use a set interval in secs and that has worked well.

<!-- gh-comment-id:1989391589 --> @commit4ever commented on GitHub (Mar 11, 2024): I use a set interval in secs and that has worked well.
Author
Owner

@pdevine commented on GitHub (Mar 12, 2024):

Hey @knoopx You can actually do this by calling curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": 0}' (not with -1 which will always leave the model loaded). That will immediately unload the model without doing any generation.

You can confirm this with:

lsof -n | grep ollama | grep blobs | awk "{ print \$9 }"

This will show which blob is loaded into memory, and then when you hit the generate endpoint you can see that the blob was unloaded. I'm going to go ahead and close the issue, but feel free to keep commenting.

<!-- gh-comment-id:1992588837 --> @pdevine commented on GitHub (Mar 12, 2024): Hey @knoopx You can actually do this by calling `curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": 0}'` (not with `-1` which will always leave the model loaded). That will immediately unload the model without doing any generation. You can confirm this with: ``` lsof -n | grep ollama | grep blobs | awk "{ print \$9 }" ``` This will show which blob is loaded into memory, and then when you hit the generate endpoint you can see that the blob was unloaded. I'm going to go ahead and close the issue, but feel free to keep commenting.
Author
Owner

@knoopx commented on GitHub (Mar 14, 2024):

Hey @knoopx You can actually do this by calling curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": 0}' (not with -1 which will always leave the model loaded). That will immediately unload the model without doing any generation.

You can confirm this with:

lsof -n | grep ollama | grep blobs | awk "{ print \$9 }"

This will show which blob is loaded into memory, and then when you hit the generate endpoint you can see that the blob was unloaded. I'm going to go ahead and close the issue, but feel free to keep commenting.

nice as a workaround but still loads another model unnecessarily.

<!-- gh-comment-id:1997373117 --> @knoopx commented on GitHub (Mar 14, 2024): > Hey @knoopx You can actually do this by calling `curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": 0}'` (not with `-1` which will always leave the model loaded). That will immediately unload the model without doing any generation. > > You can confirm this with: > > ``` > lsof -n | grep ollama | grep blobs | awk "{ print \$9 }" > ``` > > This will show which blob is loaded into memory, and then when you hit the generate endpoint you can see that the blob was unloaded. I'm going to go ahead and close the issue, but feel free to keep commenting. nice as a workaround but still loads another model unnecessarily.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63885