[GH-ISSUE #7041] Variable OLLAMA_MAX_LOADED_MODELS is being ignored #50979

Closed
opened 2026-04-28 17:44:11 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @hdnh2006 on GitHub (Sep 30, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7041

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Hello!

Thanks for this fantastic framework you have created.

I'm using Llama for deploying my private LLMs, but I'm experiencing an issue where it seems like the environment variable is being ignored. I'm not sure if I'm setting it incorrectly, and I'd appreciate any help in troubleshooting the issue.

I am setting Environment="OLLAMA_MAX_LOADED_MODELS=2" (or 3 or 4, no matter the number) in the config file /etc/systemd/system/ollama.service and after restarting the service with sudo systemctl daemon-reload and sudo systemctl restart ollama just one model is being loaded.

I have an RTX 4060 with 16GB VRAM but I still have 64 GB of RAM available to run the second model on CPU, but it looks like this doesn't work.

Am I missing something here? is it possible to load model on both CPU and GPU at the same time?

Thanks in advance.

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

ollama version is 0.3.12

Originally created by @hdnh2006 on GitHub (Sep 30, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7041 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Hello! Thanks for this fantastic framework you have created. I'm using Llama for deploying my private LLMs, but I'm experiencing an issue where it seems like the environment variable is being ignored. I'm not sure if I'm setting it incorrectly, and I'd appreciate any help in troubleshooting the issue. I am setting `Environment="OLLAMA_MAX_LOADED_MODELS=2"` (or 3 or 4, no matter the number) in the config file `/etc/systemd/system/ollama.service` and after restarting the service with `sudo systemctl daemon-reload` and `sudo systemctl restart ollama` just one model is being loaded. I have an RTX 4060 with 16GB VRAM but I still have 64 GB of RAM available to run the second model on CPU, but it looks like this doesn't work. Am I missing something here? is it possible to load model on both CPU and GPU at the same time? Thanks in advance. ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version ollama version is 0.3.12
GiteaMirror added the feature request label 2026-04-28 17:44:11 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 30, 2024):

Ollama currently prioritizes loading models in the GPU. It will evict models from the GPU to load a new one if both models won't fit in the GPU. If models will fit, it won't load more than OLLAMA_MAX_LOADED_MODELS in GPU. To force loading models into RAM, you need to set num_gpu to 0. See https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650 for details (and the other comments in the ticket for other thoughts on model management).

<!-- gh-comment-id:2382722267 --> @rick-github commented on GitHub (Sep 30, 2024): Ollama currently prioritizes loading models in the GPU. It will evict models from the GPU to load a new one if both models won't fit in the GPU. If models will fit, it won't load more than `OLLAMA_MAX_LOADED_MODELS` in GPU. To force loading models into RAM, you need to set `num_gpu` to 0. See https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650 for details (and the other comments in the ticket for other thoughts on model management).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50979