[GH-ISSUE #1307] multiple models at once #676

Closed
opened 2026-04-12 10:21:38 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @iplayfast on GitHub (Nov 28, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1307

I've found that some models are good a coding, while others are good for speaking and others are good for logic.
Some of these models are actually quite small, and could possibly fit two or three into the gpu at the same time, (given a high end gpu).

As an enhancement, it would be good to keep models in memory if possible. That way if I have several processes going, that expect different models there is no delay while swapping them out.

I suggest /set priority=0-9
where a priority 0 (the highest) is always stay in memory unless it absolutely cannot be (ie another model needs the space)
priority 1-9 is stay in memory unless another model has a higher priority (ie another model is lower).

Originally created by @iplayfast on GitHub (Nov 28, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1307 I've found that some models are good a coding, while others are good for speaking and others are good for logic. Some of these models are actually quite small, and could possibly fit two or three into the gpu at the same time, (given a high end gpu). As an enhancement, it would be good to keep models in memory if possible. That way if I have several processes going, that expect different models there is no delay while swapping them out. I suggest /set priority=0-9 where a priority 0 (the highest) is always stay in memory unless it absolutely cannot be (ie another model needs the space) priority 1-9 is stay in memory unless another model has a higher priority (ie another model is lower).
Author
Owner

@igorschlum commented on GitHub (Nov 29, 2023):

I purchased a 192 GB RAM MacStation to run Ollama with various LLMs. I concur that it would be advantageous to keep them resident in memory if possible.

<!-- gh-comment-id:1830991591 --> @igorschlum commented on GitHub (Nov 29, 2023): I purchased a 192 GB RAM MacStation to run Ollama with various LLMs. I concur that it would be advantageous to keep them resident in memory if possible.
Author
Owner

@gatepoet commented on GitHub (Nov 29, 2023):

My current workaround is to start several ollama servers, then using litellm as a proxy, configuring secific models to specific ollama instances. I also modified routes.go line 60 to prevent the model from getting killed too often.

var defaultSessionDuration = 30 * time.Minute

It's a bit messy, but it sort of works for now.

But it would be really nice to have ollama manage it.

<!-- gh-comment-id:1832413684 --> @gatepoet commented on GitHub (Nov 29, 2023): My current workaround is to start several ollama servers, then using litellm as a proxy, configuring secific models to specific ollama instances. I also modified `routes.go` line 60 to prevent the model from getting killed too often. ```go var defaultSessionDuration = 30 * time.Minute ``` It's a bit messy, but it sort of works for now. But it would be really nice to have ollama manage it.
Author
Owner

@easp commented on GitHub (Dec 1, 2023):

@igorschlum The model data should remain in RAM the file cache. So switching between models will be relatively fast as long as you have enough RAM.

I just checked with a 7.7GB model on my 32GB machine. First load took ~10s. I restarted the Ollama app (to kill the ollama-runner) and then did ollama run again and got the interactive prompt in ~1s.

<!-- gh-comment-id:1835029276 --> @easp commented on GitHub (Dec 1, 2023): @igorschlum The model data should remain in RAM the file cache. So switching between models will be relatively fast as long as you have enough RAM. I just checked with a 7.7GB model on my 32GB machine. First load took ~10s. I restarted the Ollama app (to kill the ollama-runner) and then did `ollama run` again and got the interactive prompt in ~1s.
Author
Owner

@Lissanro commented on GitHub (Dec 2, 2023):

My current workaround is to start several ollama servers, then using litellm as a proxy, configuring secific models to specific ollama instances. I also modified routes.go line 60 to prevent the model from getting killed too often.

var defaultSessionDuration = 30 * time.Minute

It's a bit messy, but it sort of works for now.

But it would be really nice to have ollama manage it.

Thank you for sharing your workaround! Running multiple ollama servers worked to achieve this. The main issue with this workaround is that it does not work with frontends which usually only use one ollama server, this is why I agree it would be better if it was managed by ollama itself, but for a custom scripts, using multiple ollama servers works just fine.

For modifying session duration, there is PR https://github.com/jmorganca/ollama/pull/1257 which allows to control it with a environment variable, hopefully it get accepted, so there would be no need modify source code to adjust this setting.

<!-- gh-comment-id:1836984685 --> @Lissanro commented on GitHub (Dec 2, 2023): > My current workaround is to start several ollama servers, then using litellm as a proxy, configuring secific models to specific ollama instances. I also modified `routes.go` line 60 to prevent the model from getting killed too often. > > ```go > var defaultSessionDuration = 30 * time.Minute > ``` > > It's a bit messy, but it sort of works for now. > > But it would be really nice to have ollama manage it. Thank you for sharing your workaround! Running multiple ollama servers worked to achieve this. The main issue with this workaround is that it does not work with frontends which usually only use one ollama server, this is why I agree it would be better if it was managed by ollama itself, but for a custom scripts, using multiple ollama servers works just fine. For modifying session duration, there is PR https://github.com/jmorganca/ollama/pull/1257 which allows to control it with a environment variable, hopefully it get accepted, so there would be no need modify source code to adjust this setting.
Author
Owner

@ishaan-jaff commented on GitHub (Dec 2, 2023):

@gatepoet we provide a config to start the proxy, if you want to start multiple ollama models too
https://docs.litellm.ai/docs/simple_proxy#proxy-configs

model_list:
  - model_name: zephyr-alpha # the 1st model is the default on the proxy
    litellm_params: 
      model: huggingface/HuggingFaceH4/zephyr-7b-alpha
      api_base: http://0.0.0.0:8001
  - model_name: gpt-4
    litellm_params:
      model: ollama/code-llama
      api_base: 
  - model_name: claude-2
    litellm_params:
      model: ollama/llama2
      api_base:    
<!-- gh-comment-id:1837210064 --> @ishaan-jaff commented on GitHub (Dec 2, 2023): @gatepoet we provide a config to start the proxy, if you want to start multiple ollama models too https://docs.litellm.ai/docs/simple_proxy#proxy-configs ```yaml model_list: - model_name: zephyr-alpha # the 1st model is the default on the proxy litellm_params: model: huggingface/HuggingFaceH4/zephyr-7b-alpha api_base: http://0.0.0.0:8001 - model_name: gpt-4 litellm_params: model: ollama/code-llama api_base: - model_name: claude-2 litellm_params: model: ollama/llama2 api_base: ```
Author
Owner

@shroominic commented on GitHub (Dec 24, 2023):

something native in ollama would be very helpful as it would enable deploying with this functionallity without having the client doing manual work

<!-- gh-comment-id:1868599546 --> @shroominic commented on GitHub (Dec 24, 2023): something native in ollama would be very helpful as it would enable deploying with this functionallity without having the client doing manual work
Author
Owner

@CallMeLaNN commented on GitHub (Jun 20, 2024):

I just notice that ollama serve already have this but default to 1:

> ollama serve --help
...
Environment Variables:
    ...
    OLLAMA_MAX_LOADED_MODELS   Maximum number of loaded models (default 1)
<!-- gh-comment-id:2180122480 --> @CallMeLaNN commented on GitHub (Jun 20, 2024): I just notice that `ollama serve` already have this but default to 1: ```sh > ollama serve --help ... Environment Variables: ... OLLAMA_MAX_LOADED_MODELS Maximum number of loaded models (default 1) ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#676