[GH-ISSUE #1782] Model kept unloading no matter what #47531

Closed
opened 2026-04-28 04:06:36 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @TheCowboyHermit on GitHub (Jan 4, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1782

Greeting, I have modified the ollama/server/routes.go to set the following variable:

var defaultSessionDuration = 1440 * time.Minute

However when running the ollama, it kept unloading the exact same model over and over for every single API invocation for /api/generate endpoint and this is visible from nvtop CLI where I can observe the Host Memory climbing first and then GPU finally have the model loaded.

This makes Ollama very impractical for production environment when it takes significant amount of time to load the model for each and every API invocation. It should be noted that this is NOT running from docker as it is an intentional decision.

Is there an alternative recommendation to workaround this?

Please and thank you.

Originally created by @TheCowboyHermit on GitHub (Jan 4, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1782 Greeting, I have modified the ollama/server/routes.go to set the following variable: ```go var defaultSessionDuration = 1440 * time.Minute ``` However when running the ollama, it kept unloading the **exact same** model over and over for every single API invocation for /api/generate endpoint and this is visible from nvtop CLI where I can observe the Host Memory climbing first and then GPU finally have the model loaded. This makes Ollama very impractical for production environment when it takes significant amount of time to load the model for each and every API invocation. It should be noted that this is **NOT** running from docker as it is an intentional decision. Is there an alternative recommendation to workaround this? Please and thank you.
Author
Owner

@TheCowboyHermit commented on GitHub (Jan 8, 2024):

Ok, I assume this is intended behavior.

<!-- gh-comment-id:1880500136 --> @TheCowboyHermit commented on GitHub (Jan 8, 2024): Ok, I assume this is intended behavior.
Author
Owner

@DaveMayfield commented on GitHub (Jan 8, 2024):

I have the same experience, and the same concerns about performance. Is there a way to avoid the delay?

<!-- gh-comment-id:1881405564 --> @DaveMayfield commented on GitHub (Jan 8, 2024): I have the same experience, and the same concerns about performance. Is there a way to avoid the delay?
Author
Owner

@pdevine commented on GitHub (Jan 8, 2024):

@TechScribe-Deaf The behaviour right now is to unload after 5 minutes. That was really a compromise because some people want it to unload immediately, and others want it to never unload.

I've been looking at making it so that you can specify how long you want it loaded for as part of the call to /api/generate or /api/chat. It takes a duration called keep_alive as a duration (e.g. 20m). If a different model is requested, it will unload it immediately and load in the other model.

<!-- gh-comment-id:1881529894 --> @pdevine commented on GitHub (Jan 8, 2024): @TechScribe-Deaf The behaviour right now is to unload after 5 minutes. That was really a compromise because some people want it to unload immediately, and others want it to never unload. I've been looking at making it so that you can specify how long you want it loaded for as part of the call to `/api/generate` or `/api/chat`. It takes a duration called `keep_alive` as a duration (e.g. 20m). If a different model is requested, it will unload it immediately and load in the other model.
Author
Owner

@TheCowboyHermit commented on GitHub (Jan 8, 2024):

But what if we have sufficient VRAM to support multiple models on the same device? Do we need to spin up instances of Ollama separately?

I think realistically, there should be a full disablement of ANY unloading of the model as an option.

<!-- gh-comment-id:1881645684 --> @TheCowboyHermit commented on GitHub (Jan 8, 2024): But what if we have sufficient VRAM to support multiple models on the same device? Do we need to spin up instances of Ollama separately? I think realistically, there should be a full disablement of **ANY** unloading of the model as an option.
Author
Owner

@pdevine commented on GitHub (Jan 8, 2024):

@TechScribe-Deaf yeah, that's part of the problem. There are potentially a lot of corner cases, particularly around multiple models and scheduling.

<!-- gh-comment-id:1881675687 --> @pdevine commented on GitHub (Jan 8, 2024): @TechScribe-Deaf yeah, that's part of the problem. There are potentially a lot of corner cases, particularly around multiple models and scheduling.
Author
Owner

@RSLLES commented on GitHub (Jan 10, 2024):

I would appreciate having this feature as well. @pdevine, have you been able to implement it and submit a pull request? Thank you for your hard work.

<!-- gh-comment-id:1884624624 --> @RSLLES commented on GitHub (Jan 10, 2024): I would appreciate having this feature as well. @pdevine, have you been able to implement it and submit a pull request? Thank you for your hard work.
Author
Owner

@noahhaon commented on GitHub (Jan 11, 2024):

It takes a duration called keep_alive as a duration (e.g. 20m). If a different model is requested, it will unload it immediately and load in the other model.

@pdevine is this currently supported, and if so, is it documented? My use case is to use this with Continue (a co-pilot replacement for VSCode) and having the model unload regularly significantly impacts performance, as you may imagine. Ideally I could pass a flag to ollama serve to keep the model loaded indefinitely (unless another model is called, of course), but a similar parameter on the request would work as well.

<!-- gh-comment-id:1887382583 --> @noahhaon commented on GitHub (Jan 11, 2024): > It takes a duration called keep_alive as a duration (e.g. 20m). If a different model is requested, it will unload it immediately and load in the other model. @pdevine is this currently supported, and if so, is it documented? My use case is to use this with Continue (a co-pilot replacement for VSCode) and having the model unload regularly significantly impacts performance, as you may imagine. Ideally I could pass a flag to ollama serve to keep the model loaded indefinitely (unless another model is called, of course), but a similar parameter on the request would work as well.
Author
Owner

@pdevine commented on GitHub (Jan 11, 2024):

@pdevine is this currently supported, and if so, is it documented? My use case is to use this with Continue (a co-pilot replacement for VSCode) and having the model unload regularly significantly impacts performance, as you may imagine. Ideally I could pass a flag to ollama serve to keep the model loaded indefinitely (unless another model is called, of course), but a similar parameter on the request would work as well.

It's just something I was tinkering around with. If it's useful we can get it in, but there are a lot of corner cases here so I want to make sure the UI is correct.

<!-- gh-comment-id:1887656421 --> @pdevine commented on GitHub (Jan 11, 2024): > @pdevine is this currently supported, and if so, is it documented? My use case is to use this with Continue (a co-pilot replacement for VSCode) and having the model unload regularly significantly impacts performance, as you may imagine. Ideally I could pass a flag to ollama serve to keep the model loaded indefinitely (unless another model is called, of course), but a similar parameter on the request would work as well. It's just something I was tinkering around with. If it's useful we can get it in, but there are a lot of corner cases here so I want to make sure the UI is correct.
Author
Owner

@TheBitmonkey commented on GitHub (Jan 13, 2024):

@TechScribe-Deaf "there should be a full disablement of ANY unloading of the model as an option." well just to prove @pdevine correct (a fellow Devine here), I came here to find the opposite. I very much need the ability to unload a model on command. I have limited vram and need to load up other (Stable Diffusion stuff), as a workaround I can just wait out the 5 minutes if I am not there, but that is not a great long term solution.

An option to load a model indefinitely unless a "release model command" or "load new model command" is sent would be cool.

edit: Another "option" which may be possible would be to create a tiny fake model that essentially just drops the current model from memory replacing it with nothing. Does anyone know how to make a fake tiny model?

<!-- gh-comment-id:1890466353 --> @TheBitmonkey commented on GitHub (Jan 13, 2024): @TechScribe-Deaf "there should be a full disablement of ANY unloading of the model as an option." well just to prove @pdevine correct (a fellow Devine here), I came here to find the opposite. I very much need the ability to unload a model on command. I have limited vram and need to load up other (Stable Diffusion stuff), as a workaround I can just wait out the 5 minutes if I am not there, but that is not a great long term solution. An option to load a model indefinitely unless a "release model command" or "load new model command" is sent would be cool. edit: Another "option" which may be possible would be to create a tiny fake model that essentially just drops the current model from memory replacing it with nothing. Does anyone know how to make a fake tiny model?
Author
Owner

@mholtzhausen commented on GitHub (Jan 16, 2024):

I'm using ollama primarily as api. It should be simple to create an endpoint that will reset the timeout when called. Same mechanism could be reused in the cli. For those of us that want a long-running session, we can cron a call to that endpoint.

Fairly simple - I can just call a complete endpoint with a short instruction for a short response. Like "respond only with 'yes'"

<!-- gh-comment-id:1893624933 --> @mholtzhausen commented on GitHub (Jan 16, 2024): ~I'm using ollama primarily as api. It should be simple to create an endpoint that will reset the timeout when called. Same mechanism could be reused in the cli. For those of us that want a long-running session, we can cron a call to that endpoint.~ Fairly simple - I can just call a complete endpoint with a short instruction for a short response. Like "respond only with 'yes'"
Author
Owner

@pdevine commented on GitHub (Jan 28, 2024):

#2146 has merged, so I'm going to go ahead and close this. You'll be able to use it in 0.1.23.

<!-- gh-comment-id:1913743603 --> @pdevine commented on GitHub (Jan 28, 2024): #2146 has merged, so I'm going to go ahead and close this. You'll be able to use it in `0.1.23`.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47531