[GH-ISSUE #1536] Ability to keep a model in memory for longer #62874

Closed
opened 2026-05-03 10:35:06 -05:00 by GiteaMirror · 21 comments
Owner

Originally created by @helloimcx on GitHub (Dec 15, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1536

is there a way to keep the model in memory or gpu memory ?

Originally created by @helloimcx on GitHub (Dec 15, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1536 is there a way to keep the model in memory or gpu memory ?
GiteaMirror added the feature request label 2026-05-03 10:35:06 -05:00
Author
Owner

@nelsongomes commented on GitHub (Dec 15, 2023):

This a very important feature and models should be kept in memory by default. In reality, it makes sense even to keep multiple instances of same model if memory is available and the loaded models are already in use. This way Ollama can be cost effective and performant @jmorganca.

This is needed to make Ollama a usable server, just came out of a meeting and this was the main reason not to choose it, it needs to cost effective and performant.

<!-- gh-comment-id:1857915693 --> @nelsongomes commented on GitHub (Dec 15, 2023): This a very important feature and models should be kept in memory by default. In reality, it makes sense even to keep multiple instances of same model if memory is available and the loaded models are already in use. This way Ollama can be cost effective and performant @jmorganca. This is needed to make Ollama a usable server, just came out of a meeting and this was the main reason not to choose it, it needs to cost effective and performant.
Author
Owner

@rgaidot commented on GitHub (Dec 17, 2023):

Maybe you can do a curl every second (crontab or others) with the model (it's not great but it should work)

curl http://0.0.0.0:11434/api/chat -d '{
    "model": "mistral"
 }'
<!-- gh-comment-id:1859209612 --> @rgaidot commented on GitHub (Dec 17, 2023): Maybe you can do a curl every second (crontab or others) with the model (it's not great but it should work) ``` curl http://0.0.0.0:11434/api/chat -d '{ "model": "mistral" }' ```
Author
Owner

@mLpenguin commented on GitHub (Dec 17, 2023):

There is a potential solution to change the timeout from the hardcoded 5 minutes to an env variable but it is still waiting to be merged #1257

<!-- gh-comment-id:1859245147 --> @mLpenguin commented on GitHub (Dec 17, 2023): There is a potential solution to change the timeout from the hardcoded 5 minutes to an env variable but it is still waiting to be merged #1257
Author
Owner

@nelsongomes commented on GitHub (Dec 17, 2023):

Thanks for replying.

The issue is that GPUs are expensive and majority of models could have
several replicas inside of a GPU memory. This would allow several threads
running in parallel answering different requests.
Otherwise a 64gb memory GPU is almost wasted.

Ideally you should keep adding replicas until memory is either full or
there is the need to evict an instance to make room for a different model
to be loaded and keep all instances always loaded, no eviction if not
needed.

No one wants a server to load models on the fly if memory is enough and
make requests slower.

Do you get my point?

Thanks
Nelson Gomes

A domingo, 17/12/2023, 15:59, Régis Gaidot @.***>
escreveu:

Maybe you can do a curl every second with the model (it's not great but it
should work))

curl http://0.0.0.0:11434/api/chat -d '{
"model": "mistral"
}'


Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1536#issuecomment-1859209612,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAITSRD2LWWUUX2JYFRZQZDYJ4JGFAVCNFSM6AAAAABAVZNPB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJZGIYDSNRRGI
.
You are receiving this because you commented.Message ID:
@.***>

<!-- gh-comment-id:1859271281 --> @nelsongomes commented on GitHub (Dec 17, 2023): Thanks for replying. The issue is that GPUs are expensive and majority of models could have several replicas inside of a GPU memory. This would allow several threads running in parallel answering different requests. Otherwise a 64gb memory GPU is almost wasted. Ideally you should keep adding replicas until memory is either full or there is the need to evict an instance to make room for a different model to be loaded and keep all instances always loaded, no eviction if not needed. No one wants a server to load models on the fly if memory is enough and make requests slower. Do you get my point? Thanks Nelson Gomes A domingo, 17/12/2023, 15:59, Régis Gaidot ***@***.***> escreveu: > Maybe you can do a curl every second with the model (it's not great but it > should work)) > > curl http://0.0.0.0:11434/api/chat -d '{ > "model": "mistral" > }' > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1536#issuecomment-1859209612>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAITSRD2LWWUUX2JYFRZQZDYJ4JGFAVCNFSM6AAAAABAVZNPB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJZGIYDSNRRGI> > . > You are receiving this because you commented.Message ID: > ***@***.***> >
Author
Owner

@helloimcx commented on GitHub (Dec 18, 2023):

There is a potential solution to change the timeout from the hardcoded 5 minutes to an env variable but it is still waiting to be merged #1257

thanks, i changed defaultSessionDuration in routes.go with a longer period, it worked.

<!-- gh-comment-id:1859636463 --> @helloimcx commented on GitHub (Dec 18, 2023): > There is a potential solution to change the timeout from the hardcoded 5 minutes to an env variable but it is still waiting to be merged #1257 thanks, i changed ` defaultSessionDuration` in routes.go with a longer period, it worked.
Author
Owner

@nelsongomes commented on GitHub (Dec 18, 2023):

Ideally, we could have a fixed config file being passed to server:
{
"mistral": { "instances": 3, "target": "gpu" },
"llama2": { "instances": 1, "target": "cpu"}
}
that would allow to set up the intended usage we wanna give to the server.

<!-- gh-comment-id:1861850454 --> @nelsongomes commented on GitHub (Dec 18, 2023): Ideally, we could have a fixed config file being passed to server: { "mistral": { "instances": 3, "target": "gpu" }, "llama2": { "instances": 1, "target": "cpu"} } that would allow to set up the intended usage we wanna give to the server.
Author
Owner

@helloimcx commented on GitHub (Dec 19, 2023):

Ideally, we could have a fixed config file being passed to server: { "mistral": { "instances": 3, "target": "gpu" }, "llama2": { "instances": 1, "target": "cpu"} } that would allow to set up the intended usage we wanna give to the server.

yea, that would be great!

<!-- gh-comment-id:1861948763 --> @helloimcx commented on GitHub (Dec 19, 2023): > Ideally, we could have a fixed config file being passed to server: { "mistral": { "instances": 3, "target": "gpu" }, "llama2": { "instances": 1, "target": "cpu"} } that would allow to set up the intended usage we wanna give to the server. yea, that would be great!
Author
Owner

@nelsongomes commented on GitHub (Dec 26, 2023):

Was I able to convince you guys that memory strategy needs some revisiting?

<!-- gh-comment-id:1869390365 --> @nelsongomes commented on GitHub (Dec 26, 2023): Was I able to convince you guys that memory strategy needs some revisiting?
Author
Owner

@nelsongomes commented on GitHub (Dec 26, 2023):

Another option for memory management would be to add commands to manage permanently loaded models and their count... this way we could tell the server what we wanted in memory.

<!-- gh-comment-id:1869468084 --> @nelsongomes commented on GitHub (Dec 26, 2023): Another option for memory management would be to add commands to manage permanently loaded models and their count... this way we could tell the server what we wanted in memory.
Author
Owner

@jooneyp commented on GitHub (Dec 26, 2023):

@nelsongomes +1, I'm the another one thinking of make user decide the manage strategy of the memory. Need to make env variable or even dynamic option for each model. I have two systems with i7 6700k & 4080 + 3070 llm machine and amd 5600 & 3070 gaming machine but my gaming machine is much faster loading speed of models.

<!-- gh-comment-id:1869540051 --> @jooneyp commented on GitHub (Dec 26, 2023): @nelsongomes +1, I'm the another one thinking of make user decide the manage strategy of the memory. Need to make env variable or even dynamic option for each model. I have two systems with i7 6700k & 4080 + 3070 llm machine and amd 5600 & 3070 gaming machine but my gaming machine is much faster loading speed of models.
Author
Owner

@rgaidot commented on GitHub (Dec 29, 2023):

In the meantime, have you tested this 'solution' (use_mlock) https://github.com/jmorganca/ollama/issues/1672 ?

<!-- gh-comment-id:1872234781 --> @rgaidot commented on GitHub (Dec 29, 2023): In the meantime, have you tested this 'solution' (use_mlock) https://github.com/jmorganca/ollama/issues/1672 ?
Author
Owner

@easp commented on GitHub (Jan 2, 2024):

@nelsongomes, keeping multiple instances of a single model in memory doesn't make any sense, at all. It's a waste of memory and memory bandwidth. Concurrency is better handled by batching requests and amortizing the cost of memory access across the whole batch.

If they are going to add features to support concurrency they should focus on batching, rather than a kludge.

<!-- gh-comment-id:1874598332 --> @easp commented on GitHub (Jan 2, 2024): @nelsongomes, keeping multiple instances of a single model in memory doesn't make any sense, at all. It's a waste of memory and memory bandwidth. Concurrency is better handled by batching requests and amortizing the cost of memory access across the whole batch. If they are going to add features to support concurrency they should focus on batching, rather than a kludge.
Author
Owner

@nelsongomes commented on GitHub (Jan 3, 2024):

That is a valid point. Does Ollama support concurrency? Currently running
it on my MAC it doesn't but it's obvious because my MAC does not have an
usable GPU.

Erik S @.***> escreveu no dia terça, 2/01/2024 à(s)
21:38:

@nelsongomes https://github.com/nelsongomes, keeping multiple instances
of a single model in memory doesn't make any sense, at all. It's a waste of
memory and memory bandwidth. Concurrency is better handled by batching
requests and amortizing the cost of memory access across the whole batch.

If they are going to add features to support concurrency they should focus
on batching, rather than a kludge.


Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1536#issuecomment-1874598332,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAITSRHVRNLEE65ED6BGBBLYMR45JAVCNFSM6AAAAABAVZNPB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZUGU4TQMZTGI
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:1875463472 --> @nelsongomes commented on GitHub (Jan 3, 2024): That is a valid point. Does Ollama support concurrency? Currently running it on my MAC it doesn't but it's obvious because my MAC does not have an usable GPU. Erik S ***@***.***> escreveu no dia terça, 2/01/2024 à(s) 21:38: > @nelsongomes <https://github.com/nelsongomes>, keeping multiple instances > of a single model in memory doesn't make any sense, at all. It's a waste of > memory and memory bandwidth. Concurrency is better handled by batching > requests and amortizing the cost of memory access across the whole batch. > > If they are going to add features to support concurrency they should focus > on batching, rather than a kludge. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1536#issuecomment-1874598332>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAITSRHVRNLEE65ED6BGBBLYMR45JAVCNFSM6AAAAABAVZNPB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZUGU4TQMZTGI> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@nelsongomes commented on GitHub (Jan 4, 2024):

@easp I've done more reading about it and if I understood correctly batching is the way to go, but if a request does not enter the batch in time it needs to wait for next batch which may take seconds to start. Could it be that if we have several models loaded into GPU memory we could have several batches running continuously? Making full use of memory and concurrency?

<!-- gh-comment-id:1877378508 --> @nelsongomes commented on GitHub (Jan 4, 2024): @easp I've done more reading about it and if I understood correctly batching is the way to go, but if a request does not enter the batch in time it needs to wait for next batch which may take seconds to start. Could it be that if we have several models loaded into GPU memory we could have several batches running continuously? Making full use of memory and concurrency?
Author
Owner

@easp commented on GitHub (Jan 4, 2024):

For a single completion request, tokens are generated serially, which each token generated depending on the initial context + prompt and preceding tokens generated for the response. Generating each token involves computing, end to end, over the entirety of the model weights. With batch generation, the process is the same, but next tokens for a set of individual completions threads are calculated in a single traversal of the model weights.

Batches are continuous, so a new request only waits for the next set of "in-flight" tokens to complete before joining the batch for the next token(s). New completions can be started with each new batch, and in-progress completions may conclude with each new batch, even as others continue. So, at, say, 25 batch generation rate, a new request doesn't wait more than 80ms for its first token to be generated.

So, it only makes sense to have a single copy of the model loaded. It does make sense to have multiple models loaded, if resources permit, to allow quick switching between models.

As I said though, Ollama doesn't support this, at least not yet. It is supported by llama.cpp, which Ollama uses to "run" models, but I'd expect that it would require some work in the Ollama server as well to support and so far Ollama seems to be pretty focused on single-user scenarios.

<!-- gh-comment-id:1877649095 --> @easp commented on GitHub (Jan 4, 2024): For a single completion request, tokens are generated serially, which each token generated depending on the initial context + prompt and preceding tokens generated for the response. Generating each token involves computing, end to end, over the entirety of the model weights. With batch generation, the process is the same, but next tokens for a set of individual completions threads are calculated in a single traversal of the model weights. Batches are continuous, so a new request only waits for the next set of "in-flight" tokens to complete before joining the batch for the next token(s). New completions can be started with each new batch, and in-progress completions may conclude with each new batch, even as others continue. So, at, say, 25 batch generation rate, a new request doesn't wait more than 80ms for its first token to be generated. So, it only makes sense to have a single copy of the model loaded. It does make sense to have multiple models loaded, if resources permit, to allow quick switching between models. As I said though, Ollama doesn't support this, at least not yet. It is supported by llama.cpp, which Ollama uses to "run" models, but I'd expect that it would require some work in the Ollama server as well to support and so far Ollama seems to be pretty focused on single-user scenarios.
Author
Owner

@jmorganca commented on GitHub (Jan 4, 2024):

Thanks for creating this issue folks – sorry this isn't easier yet. Will investigate how to better support keeping models loaded for longer

<!-- gh-comment-id:1877675067 --> @jmorganca commented on GitHub (Jan 4, 2024): Thanks for creating this issue folks – sorry this isn't easier yet. Will investigate how to better support keeping models loaded for longer
Author
Owner

@jmorganca commented on GitHub (Jan 25, 2024):

Keep an eye on this PR which will enable keeping a model in memory indefinitely (or for a custom amount of time): https://github.com/ollama/ollama/pull/2146

<!-- gh-comment-id:1909324723 --> @jmorganca commented on GitHub (Jan 25, 2024): Keep an eye on this PR which will enable keeping a model in memory indefinitely (or for a custom amount of time): https://github.com/ollama/ollama/pull/2146
Author
Owner

@pdevine commented on GitHub (Jan 26, 2024):

This is merged now, so I'm going to go ahead and close the issue. You'll be able to use this in 0.1.23.

<!-- gh-comment-id:1912841850 --> @pdevine commented on GitHub (Jan 26, 2024): This is merged now, so I'm going to go ahead and close the issue. You'll be able to use this in `0.1.23`.
Author
Owner

@nelsongomes commented on GitHub (Jan 28, 2024):

Thank you!

Patrick Devine @.***> escreveu (sexta, 26/01/2024 à(s)
23:53):

Closed #1536 https://github.com/ollama/ollama/issues/1536 as completed.


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/1536#event-11614331680, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAITSRDD5VW765AHSMCLG2TYQQ6YRAVCNFSM6AAAAABAVZNPB6VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRGYYTIMZTGE3DQMA
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:1913738998 --> @nelsongomes commented on GitHub (Jan 28, 2024): Thank you! Patrick Devine ***@***.***> escreveu (sexta, 26/01/2024 à(s) 23:53): > Closed #1536 <https://github.com/ollama/ollama/issues/1536> as completed. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/1536#event-11614331680>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAITSRDD5VW765AHSMCLG2TYQQ6YRAVCNFSM6AAAAABAVZNPB6VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRGYYTIMZTGE3DQMA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@sirkuttin commented on GitHub (Apr 22, 2024):

62be2050dd/cmd/cmd.go (L1005)

<!-- gh-comment-id:2070530976 --> @sirkuttin commented on GitHub (Apr 22, 2024): https://github.com/ollama/ollama/blob/62be2050dd83197864d771fe6891fc47486ee6a1/cmd/cmd.go#L1005
Author
Owner

@loke-x commented on GitHub (Dec 29, 2024):

62be2050dd/cmd/cmd.go (L1005)

it will be great if there is lot more control given via command line

like for stropping server
increasing model in memory alive state and many more

<!-- gh-comment-id:2564809713 --> @loke-x commented on GitHub (Dec 29, 2024): > https://github.com/ollama/ollama/blob/62be2050dd83197864d771fe6891fc47486ee6a1/cmd/cmd.go#L1005 it will be great if there is lot more control given via command line like for stropping server increasing model in memory alive state and many more
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62874