[GH-ISSUE #1005] Improved context window size management #488

Open
opened 2026-04-12 10:09:59 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @jmorganca on GitHub (Nov 4, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1005

Context window size is largely manual right now – it can be specified via {"options": {"num_ctx": 32768}} in the API or via PARAMETER num_ctx 32768 in the Modelfile. Otherwise the default value is set to 2048 unless specified (some models in the [library](https://ollama.ai/ will use a larger context window size by default)

Context size should be determined dynamically at runtime based on the amount of memory available.

Originally created by @jmorganca on GitHub (Nov 4, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1005 Context window size is largely manual right now – it can be specified via `{"options": {"num_ctx": 32768}}` in the API or via `PARAMETER num_ctx 32768` in the Modelfile. Otherwise the default value is set to `2048` unless specified (some models in the [library](https://ollama.ai/ will use a larger context window size by default) Context size should be determined dynamically at runtime based on the amount of memory available.
GiteaMirror added the memoryfeature requestperformance labels 2026-04-12 10:09:59 -05:00
Author
Owner

@nevakrien commented on GitHub (Jan 30, 2024):

is there a way to run no limitation? I am aware this is probably a bad idea but I need to run with no limitation on 1 prompt this is for a scince project so I need to be 100% there is no hidden truncation.

this would be done with mixtral... I am probably crashing my pc or maping memory to disk space but its just for a few runs so I can run on a cluster

<!-- gh-comment-id:1918005182 --> @nevakrien commented on GitHub (Jan 30, 2024): is there a way to run no limitation? I am aware this is probably a bad idea but I need to run with no limitation on 1 prompt this is for a scince project so I need to be 100% there is no hidden truncation. this would be done with mixtral... I am probably crashing my pc or maping memory to disk space but its just for a few runs so I can run on a cluster
Author
Owner

@tomdavenport commented on GitHub (Apr 3, 2024):

Upvote!

<!-- gh-comment-id:2034197702 --> @tomdavenport commented on GitHub (Apr 3, 2024): Upvote!
Author
Owner

@robertvazan commented on GitHub (May 25, 2024):

Llama3 will readily write responses to simple questions that are 700 tokens long, so a 2048-token context can be exhausted by the third turn. People rarely tinker with modelfiles to fix this. Ollama is intended to work out of the box. It needs default context length that does not cripple models. I suggest to implement the following algorithm:

  1. Start with context size the model was trained on. I believe this is already in model metadata.
  2. Clamp context to model size (e.g. max 4GB of context for 4GB model) to deal with context-extended models that declare huge training context. The idea is that users most likely want to balance model size and context size.
  3. If possible, fit model+context in GPU VRAM minus some 2GB for the desktop. Otherwise fit model+context in 60% of system RAM. The idea is to avoid costly overflow from VRAM to RAM and from RAM to SSD/swap.
  4. Always prefer longer context over loading multiple models or keeping around multiple contexts. The idea is that Ollama must work well with single model/context before trying to run concurrent chats.
  5. If the context is too small after applying the above rules, set it to some reasonable minimum that nobody would consider excessive, for example 10% of model size.

What do you think? Would this work for most people?

PS: This reminds me that num_predict probably shouldn't default to 128, which cripples models as well.

<!-- gh-comment-id:2131242267 --> @robertvazan commented on GitHub (May 25, 2024): Llama3 will readily write responses to simple questions that are 700 tokens long, so a 2048-token context can be exhausted by the third turn. People rarely tinker with modelfiles to fix this. Ollama is intended to work out of the box. It needs default context length that does not cripple models. I suggest to implement the following algorithm: 1. Start with context size the model was trained on. I believe this is already in model metadata. 2. Clamp context to model size (e.g. max 4GB of context for 4GB model) to deal with context-extended models that declare huge training context. The idea is that users most likely want to balance model size and context size. 3. If possible, fit model+context in GPU VRAM minus some 2GB for the desktop. Otherwise fit model+context in 60% of system RAM. The idea is to avoid costly overflow from VRAM to RAM and from RAM to SSD/swap. 4. Always prefer longer context over loading multiple models or keeping around multiple contexts. The idea is that Ollama must work well with single model/context before trying to run concurrent chats. 5. If the context is too small after applying the above rules, set it to some reasonable minimum that nobody would consider excessive, for example 10% of model size. What do you think? Would this work for most people? PS: This reminds me that `num_predict` probably shouldn't default to 128, which cripples models as well.
Author
Owner

@StarPet commented on GitHub (Jun 16, 2024):

Can someone explain to me why I need to set the context window size? It is a property of the model. Hence it should be part of the GGUF's content.

<!-- gh-comment-id:2171790543 --> @StarPet commented on GitHub (Jun 16, 2024): Can someone explain to me why I need to set the context window size? It is a property of the model. Hence it should be part of the GGUF's content.
Author
Owner

@robertvazan commented on GitHub (Jun 16, 2024):

@StarPet Large context window costs a lot of memory, which is severely limited on GPUs and to some degree also on CPUs. You don't want the large context to cause your model to overflow from GPU to system RAM or even to SSD (via paging). We need something smarter.

<!-- gh-comment-id:2171806410 --> @robertvazan commented on GitHub (Jun 16, 2024): @StarPet Large context window costs a lot of memory, which is severely limited on GPUs and to some degree also on CPUs. You don't want the large context to cause your model to overflow from GPU to system RAM or even to SSD (via paging). We need something smarter.
Author
Owner

@StarPet commented on GitHub (Jun 17, 2024):

@StarPet Large context window costs a lot of memory, which is severely limited on GPUs and to some degree also on CPUs. You don't want the large context to cause your model to overflow from GPU to system RAM or even to SSD (via paging). We need something smarter.
Thanks Robert. Though, it may be reasonable to reduce the context window size from the model's max in those cases where it would consume to much memory and slow things down, it should still be a property of the model itself which Ollama could provide as a reference point. AFAIK, currently I cannot query the model's max using the list() API. Having this information can be interesting for use cases such as creating a summary of a larger text chunk.

<!-- gh-comment-id:2172322779 --> @StarPet commented on GitHub (Jun 17, 2024): > @StarPet Large context window costs a lot of memory, which is severely limited on GPUs and to some degree also on CPUs. You don't want the large context to cause your model to overflow from GPU to system RAM or even to SSD (via paging). We need something smarter. Thanks Robert. Though, it may be reasonable to reduce the context window size from the model's max in those cases where it would consume to much memory and slow things down, it should still be a property of the model itself which Ollama could provide as a reference point. AFAIK, currently I cannot query the model's max using the list() API. Having this information can be interesting for use cases such as creating a summary of a larger text chunk.
Author
Owner

@alaeddine-hash commented on GitHub (Jul 12, 2024):

How can fix it to avoid restarting the service of ollama i am with gemma:2b

<!-- gh-comment-id:2225461779 --> @alaeddine-hash commented on GitHub (Jul 12, 2024): How can fix it to avoid restarting the service of ollama i am with gemma:2b
Author
Owner

@nikhil-swamix commented on GitHub (Oct 22, 2024):

A simple binary search to find the limit of max tokens will be sufficient, model will find full usage of memory, and set upper limit with percentage like allow -max_mem 95% of gpu mem for ctx expansion. the absolute ignorance of this deadly issue (which im coming back and checking every week if fixed) ... #5949 #2927 #2442 (cross linking)

Experiences:

  • an educational instute faced consistently poor result as parsed pdf overflowed ctx! (had suggested gemini free tier 1M tok)
  • not every applications which use ollama can set themselves, so a pyth based chat-completion option modifying proxy is being developed like berry Litellm... (side effect of many stuff not configurable/insane defaults)
  • if you want some extra horsepower post the the engineering challenges in discussions, people will pickup, rather that waiting for them to open issue . (first principles)

yours painfully,
Swamix.

<!-- gh-comment-id:2429752561 --> @nikhil-swamix commented on GitHub (Oct 22, 2024): A simple binary search to find the limit of max tokens will be sufficient, model will find full usage of memory, and set upper limit with percentage like allow -max_mem 95% of gpu mem for ctx expansion. the absolute ignorance of this deadly issue (which im coming back and checking every week if fixed) ... #5949 #2927 #2442 (cross linking) Experiences: - an educational instute faced consistently poor result as parsed pdf overflowed ctx! (had suggested gemini free tier 1M tok) - not every applications which use ollama can set themselves, so a pyth based chat-completion option modifying proxy is being developed like berry Litellm... (side effect of many stuff not configurable/insane defaults) - if you want some extra horsepower post the the engineering challenges in discussions, people will pickup, rather that waiting for them to open issue . (first principles) yours painfully, Swamix.
Author
Owner

@montvid commented on GitHub (Nov 27, 2024):

One year later not fixed yet :P could at least someone update the faq with this instruction so people can find how to configure that? https://github.com/ollama/ollama/issues/5965#issuecomment-2252354726

<!-- gh-comment-id:2503460471 --> @montvid commented on GitHub (Nov 27, 2024): One year later not fixed yet :P could at least someone update the faq with this instruction so people can find how to configure that? https://github.com/ollama/ollama/issues/5965#issuecomment-2252354726
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#488