[GH-ISSUE #10605] Load the same base model multiple times #32736

Closed
opened 2026-04-22 14:34:27 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @louistiti on GitHub (May 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10605

Hi friends, first of all, thanks for this masterpiece! It really helps to abstract different layers of running local LLMs, and with an active community.

Regarding this request, I tried to look around and found different env variables, but I don't think they fit my needs. Of course, please correct me if I'm wrong. I'm new to Ollama.

So here is the thing:

I use Qwen3:4b. I try to load the same base model 2 times, with different identifiers. I created a simple modelFile via:

ollama create "qwen3:4b-model1" -f ./qwen3-4b
ollama create "qwen3:4b-model2" -f ./qwen3-4b

The modelFile content is:

FROM qwen3:4b

I need to run a completion in model1 and another completion in model2. However, since it has the same base model, it seems that after the completion of model1, it will unload it and then load the model2.

The completion is different for model1 and model2, it has different context size, temperature, prompts, etc. Hence, I want to make use of the KV cache for each of these models.

How can I achieve this? Thanks! 😃

Originally created by @louistiti on GitHub (May 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10605 Hi friends, first of all, thanks for this masterpiece! It really helps to abstract different layers of running local LLMs, and with an active community. Regarding this request, I tried to look around and found different env variables, but I don't think they fit my needs. Of course, please correct me if I'm wrong. I'm new to Ollama. So here is the thing: I use Qwen3:4b. I try to load the same base model 2 times, with different identifiers. I created a simple modelFile via: ```bash ollama create "qwen3:4b-model1" -f ./qwen3-4b ``` ```bash ollama create "qwen3:4b-model2" -f ./qwen3-4b ``` The modelFile content is: ``` FROM qwen3:4b ``` I need to run a completion in model1 and another completion in model2. However, since it has the same base model, it seems that after the completion of model1, it will unload it and then load the model2. The completion is different for model1 and model2, it has different context size, temperature, prompts, etc. Hence, I want to make use of the KV cache for each of these models. How can I achieve this? Thanks! 😃
GiteaMirror added the feature request label 2026-04-22 14:34:27 -05:00
Author
Owner

@rick-github commented on GitHub (May 7, 2025):

Ollama doesn't support loading the same model more than once (#3902). In your config, set the context size (OLLAMA_CONTEXT_LENGTH) to the largest of the required values and then just send the requests without an explicit num_ctx. Differences in temperature, prompts, etc will be handled as individual requests and will be cached accordingly.

<!-- gh-comment-id:2858476684 --> @rick-github commented on GitHub (May 7, 2025): Ollama doesn't support loading the same model more than once (#3902). In your config, set the context size ([`OLLAMA_CONTEXT_LENGTH`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size)) to the largest of the required values and then just send the requests without an explicit `num_ctx`. Differences in temperature, prompts, etc will be handled as individual requests and will be cached accordingly.
Author
Owner

@louistiti commented on GitHub (May 8, 2025):

Hi @rick-github,

Thanks for your reply. I tried what you described but with no success so far, maybe I missed something?

I modified my modelFile as below:

FROM qwen3:4b

PARAMETER num_ctx 4096

I created the model via:

ollama create "qwen3:4b-custom" -f ./modelFile

Then I have done 2 completions using the JavaScript SDK:

console.time('first completion')
await ollama.generate({
  model: 'qwen3:4b-custom',
  system: 'A long prompt...',
  prompt: `User Query: "${userQuery}"`,
  keep_alive: -1,
  stream: false,
  options: {
    temperature: 0,
    num_predict: 32
  }
})
console.timeEnd('first completion')

console.time('second completion')
await ollama.generate({
  model: 'qwen3:4b-custom',
  system: 'A smaller prompt...',
  prompt: `User Query: "${userQuery}"`,
  keep_alive: -1,
  stream: false,
  options: {
    temperature: 0,
    num_predict: 8
  }
})
console.timeEnd('second completion')

As we can see below, when I run these completions for the second time it does not seem to hit the KV cache:

Image

However, if I only run the first completion via the code below:

console.time('first completion')
await ollama.generate({
  model: 'qwen3:4b-custom',
  system: 'A long prompt...',
  prompt: `User Query: "${userQuery}",
  keep_alive: -1,
  stream: false,
  options: {
    temperature: 0,
    num_predict: 32
  }
})
console.timeEnd('first completion')

In this case it will hit the KV cache for the second run:

Image

Between the runs, the system prompt remains unchanged, only the prompt might be dynamic according to the userQuery.

Hence, why the first case does not hit the KV cache and how can I achieve it? 😃

Edit: the reason is that the system prompt is different for the first completion and the second completion. But I need to separate these system prompts instead of having a very large system prompt and lose in accuracy. Hence, that's why I'd like to load two models. So how to achieve this result or similar result?

<!-- gh-comment-id:2860905923 --> @louistiti commented on GitHub (May 8, 2025): Hi @rick-github, Thanks for your reply. I tried what you described but with no success so far, maybe I missed something? I modified my modelFile as below: ``` FROM qwen3:4b PARAMETER num_ctx 4096 ``` I created the model via: ```bash ollama create "qwen3:4b-custom" -f ./modelFile ``` Then I have done 2 completions using the JavaScript SDK: ```js console.time('first completion') await ollama.generate({ model: 'qwen3:4b-custom', system: 'A long prompt...', prompt: `User Query: "${userQuery}"`, keep_alive: -1, stream: false, options: { temperature: 0, num_predict: 32 } }) console.timeEnd('first completion') console.time('second completion') await ollama.generate({ model: 'qwen3:4b-custom', system: 'A smaller prompt...', prompt: `User Query: "${userQuery}"`, keep_alive: -1, stream: false, options: { temperature: 0, num_predict: 8 } }) console.timeEnd('second completion') ``` As we can see below, when I run these completions for the second time it does not seem to hit the KV cache: ![Image](https://github.com/user-attachments/assets/825f4aee-cd1d-4ef1-8fc3-f08aefe38dea) However, if I only run the first completion via the code below: ```js console.time('first completion') await ollama.generate({ model: 'qwen3:4b-custom', system: 'A long prompt...', prompt: `User Query: "${userQuery}", keep_alive: -1, stream: false, options: { temperature: 0, num_predict: 32 } }) console.timeEnd('first completion') ``` In this case it will hit the KV cache for the second run: ![Image](https://github.com/user-attachments/assets/a0e835d0-52d9-41c9-9c15-c5ff748967c0) Between the runs, the system prompt remains unchanged, only the prompt might be dynamic according to the userQuery. Hence, why the first case does not hit the KV cache and how can I achieve it? 😃 **Edit**: the reason is that the system prompt is different for the first completion and the second completion. But I need to separate these system prompts instead of having a very large system prompt and lose in accuracy. Hence, that's why I'd like to load two models. So how to achieve this result or similar result?
Author
Owner

@louistiti commented on GitHub (May 8, 2025):

Hi @rick-github, I'm bumping this. Do you know anybody else who could help with this? 🙏

<!-- gh-comment-id:2862858119 --> @louistiti commented on GitHub (May 8, 2025): Hi @rick-github, I'm bumping this. Do you know anybody else who could help with this? 🙏
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

Run two ollama servers.

<!-- gh-comment-id:2862883815 --> @rick-github commented on GitHub (May 8, 2025): Run two ollama servers.
Author
Owner

@louistiti commented on GitHub (May 8, 2025):

Run two ollama servers.

That's a smart move! It works well this way.

So does it mean in the current state of Ollama it is not possible without launching a second server? I wonder if that's something the team is considering 🤔

<!-- gh-comment-id:2862965420 --> @louistiti commented on GitHub (May 8, 2025): > Run two ollama servers. That's a smart move! It works well this way. So does it mean in the current state of Ollama it is not possible without launching a second server? I wonder if that's something the team is considering 🤔
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

#3902

<!-- gh-comment-id:2862968486 --> @rick-github commented on GitHub (May 8, 2025): #3902
Author
Owner

@louistiti commented on GitHub (May 8, 2025):

Alright, we can close this then. Thanks for your help.

<!-- gh-comment-id:2863101015 --> @louistiti commented on GitHub (May 8, 2025): Alright, we can close this then. Thanks for your help.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32736