[GH-ISSUE #4855] Environment variable OLLAMA_MAX_LOADED_MODELS does not seem to work #28831

Closed
opened 2026-04-22 07:23:15 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @troy256 on GitHub (Jun 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4855

What is the issue?

We are setting OLLAMA_MAX_LOADED_MODELS=4 in our systemd override file for the ollama service:
image

And run "systemctl daemon-reload" and restart the service. The other environment variables seem to be working. However, when I run "ollama ps" I only ever see 1 model loaded, even if I send different API requests to different models in quick succession:

image

Maybe I am using the environment variable wrong or have a misunderstanding of how it works. Thank you for taking a look.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.41

Originally created by @troy256 on GitHub (Jun 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4855 ### What is the issue? We are setting OLLAMA_MAX_LOADED_MODELS=4 in our systemd override file for the ollama service: ![image](https://github.com/ollama/ollama/assets/48829375/b09c1dda-a196-4b89-b349-a92bf2280b76) And run "systemctl daemon-reload" and restart the service. The other environment variables seem to be working. However, when I run "ollama ps" I only ever see 1 model loaded, even if I send different API requests to different models in quick succession: ![image](https://github.com/ollama/ollama/assets/48829375/c911640b-54de-43b1-9381-4d1c4522c56f) Maybe I am using the environment variable wrong or have a misunderstanding of how it works. Thank you for taking a look. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.41
GiteaMirror added the bugneeds more info labels 2026-04-22 07:23:16 -05:00
Author
Owner

@pdevine commented on GitHub (Jun 6, 2024):

What kind of GPU are you using and how much VRAM does it have? If you don't have enough VRAM for both models it will unload the one in memory and load in the other one.

<!-- gh-comment-id:2153439156 --> @pdevine commented on GitHub (Jun 6, 2024): What kind of GPU are you using and how much VRAM does it have? If you don't have enough VRAM for both models it will unload the one in memory and load in the other one.
Author
Owner

@troy256 commented on GitHub (Jun 7, 2024):

That could very well be the case, it is not a powerful GPU. Is it not possible for it to take advantage of system memory when GPU memory is exhausted, albeit at lower performance? I saw this in the doc at https://github.com/ollama/ollama/blob/main/docs/faq.md:

image

<!-- gh-comment-id:2154664105 --> @troy256 commented on GitHub (Jun 7, 2024): That could very well be the case, it is not a powerful GPU. Is it not possible for it to take advantage of system memory when GPU memory is exhausted, albeit at lower performance? I saw this in the doc at https://github.com/ollama/ollama/blob/main/docs/faq.md: ![image](https://github.com/ollama/ollama/assets/48829375/58c5ced1-b09e-4ee9-a2bd-1fb48aea8744)
Author
Owner

@pdevine commented on GitHub (Jun 7, 2024):

@troy256 yes, it will do that however it's generally faster to load the other model back onto the GPU if it'll fit because hybrid mode will result in very slow output. You can test this on the model you're using with the /set parameter num_gpu X command where you can specify the number of layers to offload to the GPU.

<!-- gh-comment-id:2155088482 --> @pdevine commented on GitHub (Jun 7, 2024): @troy256 yes, it will do that however it's generally faster to load the other model back onto the GPU if it'll fit because hybrid mode will result in very slow output. You can test this on the model you're using with the `/set parameter num_gpu X` command where you can specify the number of layers to offload to the GPU.
Author
Owner

@pdevine commented on GitHub (Jun 7, 2024):

I'll go ahead and close the issue since it's working as intended, but feel free to keep commenting.

<!-- gh-comment-id:2155653712 --> @pdevine commented on GitHub (Jun 7, 2024): I'll go ahead and close the issue since it's working as intended, but feel free to keep commenting.
Author
Owner

@EJStaats commented on GitHub (Jun 14, 2024):

Came across a similar (or perhaps same) problem. I have 12GB of VRAM and trying to simultaneously run llama3 and phi3 mini. It seemed like it wasn't working.

In the end it came down to a balancing act between OLLAMA_MAX_LOADED_MODELS and OLLAMA_NUM_PARALLEL. As you crank up Num_Parallel (at least in my testing), it seems like ollama reserves a bit more memory for that model.
Example: if I set OLLAMA_NUM_PARALLEL=4 and then run llama3, the output of ollama ps shows 6.6GB as the llama3 "size". If I turn it down to OLLAMA_NUM_PARALLEL=2 and then run llama3, the output of ollama ps shows the size as 5.8GB.

So, with OLLAMA_NUM_PARALLEL=4 and OLLAMA_MAX_LOADED_MODELS=2 I was unable to load both models simultaneously because of the memory requirements.

It does seem like the variables are working as expected.
I settled on the following:
OLLAMA_MAX_LOADED_MODELS=2 and OLLAMA_NUM_PARALLEL=2 which works for my config. This allows me to have one of each model simultaneously and/or two of the same model simultaneously. But not more than that.
Perhaps the only question I have is whether OLLAMA_NUM_PARALLEL can be set on a per model basis? This would allow the running of multiple small models in parallel while also having just one larger model. eg. 3x phi3 and 1x llama3

<!-- gh-comment-id:2168276969 --> @EJStaats commented on GitHub (Jun 14, 2024): Came across a similar (or perhaps same) problem. I have 12GB of VRAM and trying to simultaneously run llama3 and phi3 mini. It seemed like it wasn't working. In the end it came down to a balancing act between `OLLAMA_MAX_LOADED_MODELS` and `OLLAMA_NUM_PARALLEL`. As you crank up Num_Parallel (at least in my testing), it seems like ollama reserves a bit more memory for that model. Example: if I set `OLLAMA_NUM_PARALLEL=4` and then run llama3, the output of `ollama ps` shows 6.6GB as the llama3 "size". If I turn it down to `OLLAMA_NUM_PARALLEL=2` and then run llama3, the output of `ollama ps` shows the size as 5.8GB. So, with `OLLAMA_NUM_PARALLEL=4` and `OLLAMA_MAX_LOADED_MODELS=2` I was unable to load both models simultaneously because of the memory requirements. It does seem like the variables are working as expected. I settled on the following: `OLLAMA_MAX_LOADED_MODELS=2` and `OLLAMA_NUM_PARALLEL=2` which works for my config. This allows me to have one of each model simultaneously and/or two of the same model simultaneously. But not more than that. Perhaps the only question I have is whether `OLLAMA_NUM_PARALLEL` can be set on a per model basis? This would allow the running of multiple small models in parallel while also having just one larger model. eg. 3x phi3 and 1x llama3
Author
Owner

@sogawa-sps commented on GitHub (Jun 28, 2024):

Are OLLAMA_MAX_LOADED_MODELS and OLLAMA_NUM_PARALLEL supposed to work on Windows? That doesn't seem the case for me however, maybe I don't have enough GPU RAM.

<!-- gh-comment-id:2197023198 --> @sogawa-sps commented on GitHub (Jun 28, 2024): Are OLLAMA_MAX_LOADED_MODELS and OLLAMA_NUM_PARALLEL supposed to work on Windows? That doesn't seem the case for me however, maybe I don't have enough GPU RAM.
Author
Owner

@Terristen commented on GitHub (Jun 29, 2024):

I have Windows 11, 4090 (24GB vram), and 192 GB system ram and I am trying to load several 4-6GB models in memory. It is definitely swapping, despite having OLLAMA_MAX_LOADED_MODELS set to 3 in the system variables. These are modelfiles with different base models, not just 3 modelfiles pointing at the same model blob.

I've restarted Ollama after changing the environmental variables, and still not being maintained in memory. I'd love clarification on these settings to know if they're supposed to be working on Winblows.

<!-- gh-comment-id:2198042070 --> @Terristen commented on GitHub (Jun 29, 2024): I have Windows 11, 4090 (24GB vram), and 192 GB system ram and I am trying to load several 4-6GB models in memory. It is definitely swapping, despite having OLLAMA_MAX_LOADED_MODELS set to 3 in the system variables. These are modelfiles with different base models, not just 3 modelfiles pointing at the same model blob. I've restarted Ollama after changing the environmental variables, and still not being maintained in memory. I'd love clarification on these settings to know if they're supposed to be working on Winblows.
Author
Owner

@EJStaats commented on GitHub (Jun 29, 2024):

Sorry folks, I do not know the windows-related answers. Should have mentioned in my comment that I’m running Ollama on Ubuntu with WSL2. Would highly recommend this route for any windows users. I avoided the windows build of Ollama after seeing it was a “preview” build.

<!-- gh-comment-id:2198123511 --> @EJStaats commented on GitHub (Jun 29, 2024): Sorry folks, I do not know the windows-related answers. Should have mentioned in my comment that I’m running Ollama on Ubuntu with WSL2. Would highly recommend this route for any windows users. I avoided the windows build of Ollama after seeing it was a “preview” build.
Author
Owner

@sogawa-sps commented on GitHub (Jun 30, 2024):

Probably we need to create a separate ticket for Windows then. @Terristen could you please report it?

<!-- gh-comment-id:2198587198 --> @sogawa-sps commented on GitHub (Jun 30, 2024): Probably we need to create a separate ticket for Windows then. @Terristen could you please report it?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#28831