[GH-ISSUE #12783] Allow a loaded model to be used with less context size than what it was initialized for. #8480

Open
opened 2026-04-12 21:10:39 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @Aelentel on GitHub (Oct 26, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12783

Here's the issue we are encountering in our production system :

let's say we're using GPT-OSS model (but any model will do)

  1. Application A is making a chat call with GPT-OSS model with a context_size of 20k tokens
    1.1. Model is loaded, everything is fine
  2. Application B is making a chat call with GPT-OSS model with a context_size of 8k tokens
    2.1. Model GPT-OSS/20k context is unloaded and GPT-OSS/8k is loaded (and bad things happen on the performance side).

so here's the feature request, either :
1- Allow a Model to be used with a context_size inferior to the initialization context size

or

2- Allow the same Model to be loaded multiple times with different context size.

why that feature request.

1- the unload/mfree reload/malloc can take a few seconds were all applications using the same model are waiting for the model
2- that lead to load/unload trashing and very low performance level
3- in some organization GPU server are shared and we may NOT align all the parameters across applications.

well, as a side note i really hope it's possible to do option 1 as it'll me the more memory/performance efficient, but 2 can also be a very very good contender against load/unload latency when multiple app use the same model.

i know we can alias model by changing it's ID and/or creating multiple copy of the same model with a model file, but thats the same as option 2 and need a lot of communication across the organization.

Originally created by @Aelentel on GitHub (Oct 26, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12783 Here's the issue we are encountering in our production system : let's say we're using GPT-OSS model (but any model will do) 1. Application A is making a chat call with GPT-OSS model with a context_size of 20k tokens 1.1. Model is loaded, everything is fine 2. Application B is making a chat call with GPT-OSS model with a context_size of 8k tokens 2.1. Model GPT-OSS/20k context is unloaded and GPT-OSS/8k is loaded (and bad things happen on the performance side). so here's the feature request, either : 1- Allow a Model to be used with a context_size inferior to the initialization context size or 2- Allow the same Model to be loaded multiple times with different context size. why that feature request. 1- the unload/mfree reload/malloc can take a few seconds were all applications using the same model are waiting for the model 2- that lead to load/unload trashing and very low performance level 3- in some organization GPU server are shared and we may NOT align all the parameters across applications. well, as a side note i really hope it's possible to do option 1 as it'll me the more memory/performance efficient, but 2 can also be a very very good contender against load/unload latency when multiple app use the same model. i know we can alias model by changing it's ID and/or creating multiple copy of the same model with a model file, but thats the same as option 2 and need a lot of communication across the organization.
GiteaMirror added the feature request label 2026-04-12 21:10:39 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 26, 2025):

Related: https://github.com/ollama/ollama/pull/10003

<!-- gh-comment-id:3448151303 --> @rick-github commented on GitHub (Oct 26, 2025): Related: https://github.com/ollama/ollama/pull/10003
Author
Owner

@ordex commented on GitHub (Jan 9, 2026):

just an upvote from me :)
I was currently hitting this "perpetual model reload" and I couldn't wrap my head around it.

After reading various issues, I realized that the culprit is the num_ctx parameter changing value from time to time.
In my setup this is happening because I have Home Assistant using num_ctx=8192 and open-webui using num_ctx=4096.

I just tried to increase num_ctx to 8192 in the account settings, but it seems that open-webui is still making some extra request to /api/chat again with the default value (4096), thus triggering again a reload.

I don't have the knowledge to judge is this PR is taking the right approach or not, but this is definitely hurting due to different applications using the same model with different params.

<!-- gh-comment-id:3730989170 --> @ordex commented on GitHub (Jan 9, 2026): just an upvote from me :) I was currently hitting this "perpetual model reload" and I couldn't wrap my head around it. After reading various issues, I realized that the culprit is the `num_ctx` parameter changing value from time to time. In my setup this is happening because I have Home Assistant using `num_ctx=8192` and open-webui using `num_ctx=4096`. I just tried to increase `num_ctx` to 8192 in the account settings, but it seems that open-webui is still making some extra request to `/api/chat` again with the default value (4096), thus triggering again a reload. I don't have the knowledge to judge is this PR is taking the right approach or not, but this is definitely hurting due to different applications using the same model with different params.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8480