[GH-ISSUE #13929] Need a suggestion : Is this a good idea to load model forever in VRAM? #34874

Closed
opened 2026-04-22 18:48:38 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @ajmal-yazdani on GitHub (Jan 27, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13929

Hi,

We have GPU hardware setup in on-prem env. and this will be in customer network like any other VM and mostly it will be powered on.

I have multiple instance of ollama pod running and wanted to keep them into VRAM and don't want to unload.

I see this gives benefits like save time to load the model into VRAM.

- name: OLLAMA_KEEP_ALIVE value: "-1" # Keep models in VRAM indefinitely (never unload) - name: OLLAMA_MAX_LOADED_MODELS value: "0" # No limit on number of loaded models (unlimited)

Is this correct settings ? shall we do this ? Is there other issue if I am keeping all time memory laded into VRAM?, etc.

Please suggest.

Originally created by @ajmal-yazdani on GitHub (Jan 27, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13929 Hi, We have GPU hardware setup in on-prem env. and this will be in customer network like any other VM and mostly it will be powered on. I have multiple instance of ollama pod running and wanted to keep them into VRAM and don't want to unload. I see this gives benefits like save time to load the model into VRAM. ` - name: OLLAMA_KEEP_ALIVE value: "-1" # Keep models in VRAM indefinitely (never unload) - name: OLLAMA_MAX_LOADED_MODELS value: "0" # No limit on number of loaded models (unlimited)` Is this correct settings ? shall we do this ? Is there other issue if I am keeping all time memory laded into VRAM?, etc. Please suggest.
GiteaMirror added the question label 2026-04-22 18:48:38 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 27, 2026):

These settings are correct. Note that models may still be unloaded if there is insufficient free VRAM when a new model is loaded. There are no issues with keeping models loaded indefinitely.

<!-- gh-comment-id:3803442239 --> @rick-github commented on GitHub (Jan 27, 2026): These settings are correct. Note that models may still be unloaded if there is insufficient free VRAM when a new model is loaded. There are no issues with keeping models loaded indefinitely.
Author
Owner

@ajmal-yazdani commented on GitHub (Jan 27, 2026):

Thanks @rick-github for us VRAM size is fixed model is fixed (example llama 8b). so llama is around 5GB and I am giving some hard room around 3 GB (example), then I am using only 90% of VRAM. So if if I have 90GB VRAM then I am running max 11 instance of ollama pod. (11*8, rest I am not utilizing)

Question, do you think even there is a case of insufficient free VRAM?

<!-- gh-comment-id:3803757221 --> @ajmal-yazdani commented on GitHub (Jan 27, 2026): Thanks @rick-github for us VRAM size is fixed model is fixed (example llama 8b). so llama is around 5GB and I am giving some hard room around 3 GB (example), then I am using only 90% of VRAM. So if if I have 90GB VRAM then I am running max 11 instance of ollama pod. (11*8, rest I am not utilizing) Question, do you think even there is a case of insufficient free VRAM?
Author
Owner

@rick-github commented on GitHub (Jan 27, 2026):

If you are only loading one model and controlling context size, that should be fine. But then why set OLLAMA_MAX_LOADED_MODELS?

<!-- gh-comment-id:3803870824 --> @rick-github commented on GitHub (Jan 27, 2026): If you are only loading one model and controlling context size, that should be fine. But then why set `OLLAMA_MAX_LOADED_MODELS`?
Author
Owner

@ajmal-yazdani commented on GitHub (Jan 27, 2026):

yeah, this is OLLAMA_MAX_LOADED_MODELS for our future implementation where I have one chat and one embed model and wanted to keep both on VRAM.

But let please give me more information on context controlling? and how model size and inferencing will impact this ?

<!-- gh-comment-id:3803955394 --> @ajmal-yazdani commented on GitHub (Jan 27, 2026): yeah, this is `OLLAMA_MAX_LOADED_MODELS` for our future implementation where I have one chat and one embed model and wanted to keep both on VRAM. But let please give me more information on context controlling? and how model size and inferencing will impact this ?
Author
Owner

@rick-github commented on GitHub (Jan 27, 2026):

If you make the model available via the Ollama API, clients can send parameters in the request that will alter the context size. This will cause a model reload and a requirement for more VRAM if the context size is increased. If you make the model available via the OpenAI API, clients cannot change the context size and the model will use the default value or whatever you configure in OLLAMA_CONTEXT_LENGTH.

<!-- gh-comment-id:3804002221 --> @rick-github commented on GitHub (Jan 27, 2026): If you make the model available via the Ollama API, clients can send parameters in the request that will alter the context size. This will cause a model reload and a requirement for more VRAM if the context size is increased. If you make the model available via the OpenAI API, clients cannot change the context size and the model will use the default value or whatever you configure in `OLLAMA_CONTEXT_LENGTH`.
Author
Owner

@ajmal-yazdani commented on GitHub (Jan 27, 2026):

aah. Interesting one. I don't have any restrictions and both Open API and non APT's are available via NGINX proxy.

By the way what's the default OLLAMA_CONTEXT_LENGTH ?

<!-- gh-comment-id:3804097175 --> @ajmal-yazdani commented on GitHub (Jan 27, 2026): aah. Interesting one. I don't have any restrictions and both Open API and non APT's are available via NGINX proxy. By the way what's the default `OLLAMA_CONTEXT_LENGTH` ?
Author
Owner
<!-- gh-comment-id:3804339314 --> @rick-github commented on GitHub (Jan 27, 2026): https://github.com/ollama/ollama/blob/main/docs/faq.mdx#how-can-i-specify-the-context-window-size
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#34874