[GH-ISSUE #976] Suggestion: Option to "Save / Cache model in RAM" for faster switching #62513

Closed
opened 2026-05-03 09:19:09 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @ziontee113 on GitHub (Nov 2, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/976

Hi, thank you all so much for the amazing project.

Today I was testing out using multiple models at the same and the switch is surprisingly acceptable.
I symlinked my models to my HDD. The initial load for each model is slow, but once it's loaded, I can use & switch back and forth between models with only a few seconds delay.

Through out the usage & switching process I noticed that's Ollama isn't using any of my RAM at all. Maybe if we could have the option to cache frequently switched models we could improve the switching time to be even faster.

Originally created by @ziontee113 on GitHub (Nov 2, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/976 Hi, thank you all so much for the amazing project. Today I was testing out using multiple models at the same and the switch is surprisingly acceptable. I symlinked my models to my HDD. The initial load for each model is slow, but once it's loaded, I can use & switch back and forth between models with only a few seconds delay. Through out the usage & switching process I noticed that's Ollama isn't using any of my RAM at all. Maybe if we could have the option to cache frequently switched models we could improve the switching time to be even faster.
GiteaMirror added the feature request label 2026-05-03 09:19:09 -05:00
Author
Owner

@nleve commented on GitHub (Jan 30, 2024):

Most OSes already do this automatically pretty well, as you noted. Wouldn't the preexisting OS-level disk caching be just as good and more robust than creating a custom caching solution?

The models that I load from disk most often are already the ones that will be cached by the OS in RAM... and those are by definition the models that I use most often.

To manage the filesystem cache more actively, this looks like a neat tool: https://github.com/hoytech/vmtouch

<!-- gh-comment-id:1917696317 --> @nleve commented on GitHub (Jan 30, 2024): Most OSes already do this automatically pretty well, as you noted. Wouldn't the preexisting OS-level disk caching be just as good and more robust than creating a custom caching solution? The models that I load from disk most often are already the ones that will be cached by the OS in RAM... and those are by definition the models that I use most often. To manage the filesystem cache more actively, this looks like a neat tool: https://github.com/hoytech/vmtouch
Author
Owner

@19h commented on GitHub (Feb 7, 2024):

Llama.cpp has a mlock option, which I always use when working with Llama.cpp directly, this should be exposed by Ollama to enable exactly this.

<!-- gh-comment-id:1932117877 --> @19h commented on GitHub (Feb 7, 2024): Llama.cpp has a mlock option, which I always use when working with Llama.cpp directly, this should be exposed by Ollama to enable exactly this.
Author
Owner

@NinjaPerson24119 commented on GitHub (Feb 7, 2024):

This is a problem for me too. Ollama is basically useless because whenever I want to make a query, it takes at least a minute to load everything into memory / init.

The behavior I need is for it to keep the last model(s) in memory until I try to use a different one that there's no space for, then I need it to evict the old one

<!-- gh-comment-id:1932533168 --> @NinjaPerson24119 commented on GitHub (Feb 7, 2024): This is a problem for me too. Ollama is basically useless because whenever I want to make a query, it takes at least a minute to load everything into memory / init. The behavior I need is for it to keep the last model(s) in memory until I try to use a different one that there's no space for, then I need it to evict the old one
Author
Owner

@pdevine commented on GitHub (May 14, 2024):

With concurrency you can do this now. Set the OLLAMA_MAX_LOADED_MODELS env variable for ollama serve to something greater than one. Set the OLLAMA_KEEP_ALIVE flag either to a negative number or some large amount.

The new version of ollama will also include a new ollama run --keepalive setting to make this easier. It also will have ollama ps which will easily let you see what models are currently loaded.

<!-- gh-comment-id:2111038071 --> @pdevine commented on GitHub (May 14, 2024): With concurrency you can do this now. Set the `OLLAMA_MAX_LOADED_MODELS` env variable for `ollama serve` to something greater than one. Set the `OLLAMA_KEEP_ALIVE` flag either to a negative number or some large amount. The new version of ollama will also include a new `ollama run --keepalive` setting to make this easier. It also will have `ollama ps` which will easily let you see what models are currently loaded.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62513