[GH-ISSUE #10183] Understanding context length #6681

Closed
opened 2026-04-12 18:24:40 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @Bardo-Konrad on GitHub (Apr 8, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10183

How do you deal with context length issues?

Let's say a model has 16k tokens. Do you divide the length of the system prompt + user prompt by 4 or 3 for good measure to get tokens and subtract them from those 16k to then assign that as RequestOptions.NumCtx?

I did and it makes some models ignore the system prompt, so it doesn't seem enough. Or is RequestOptions.NumCtx the amount of tokens for input?

What does a good pseudocode for using the context length look like, so it's properly used and managed.

Originally created by @Bardo-Konrad on GitHub (Apr 8, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10183 How do you deal with context length issues? Let's say a model has 16k tokens. Do you divide the length of the system prompt + user prompt by 4 or 3 for good measure to get tokens and subtract them from those 16k to then assign that as RequestOptions.NumCtx? I did and it makes some models ignore the system prompt, so it doesn't seem enough. Or is RequestOptions.NumCtx the amount of tokens for input? What does a good pseudocode for using the context length look like, so it's properly used and managed.
GiteaMirror added the question label 2026-04-12 18:24:40 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 8, 2025):

Don't manage the context length. Just set it (in the Modelfile, or OLLAMA_CONTEXT_LENGTH, or in the API call) to what you expect the maximum input length + room for some output tokens, and then just make the API calls you need to accomplish your inference needs. Each time you change NumCtx you are causing a model reload which is going to slow the clients down.

Some notes on how context length affects prompt processing: lets say you have a system prompt S, user message U1, assistant message A1, and more messages U2, A2, U3, where U3 is the newest user message you want to send to the model. Let's also say you have set NumCtx to X, and the length of S is 0.3X and the length of U3 is 0.8X. When ollama receives the request via /api/chat it process the messages through the templating system and then compares the result against the size of the context buffer. In this case, it will determine that the length of the processed messages is greater than the available context, and will drop U1. It will then continue to compute the length and drop messages, trying to make them fit - it will drop A1, then U2, and then A2. Finally it will concatenate S and U3 into prompt P, and then remove enough tokens from the start of P to make it fit in the context buffer. So the final prompt that will be used for inference will be that last 0.2X of S combined with U3. As inference starts and tokens are generated, ollama will see that it's going to run off the end of the context buffer, so will shift the contents to make room for new tokens. As this proceeds, less and less of S (and eventually U3) is going to be left in the context buffer, so if it contains instructions essential to the output, the quality of output may decrease. This is also why a model will sometimes lose coherence and start to produce a stream tokens without generating an end-of-sequence token - the act of shifting the context buffer to make room causes earlier tokens that might have provided guidance to be lost.

<!-- gh-comment-id:2787435488 --> @rick-github commented on GitHub (Apr 8, 2025): Don't manage the context length. Just set it (in the Modelfile, or OLLAMA_CONTEXT_LENGTH, or in the API call) to what you expect the maximum input length + room for some output tokens, and then just make the API calls you need to accomplish your inference needs. Each time you change NumCtx you are causing a model reload which is going to slow the clients down. Some notes on how context length affects prompt processing: lets say you have a system prompt S, user message U1, assistant message A1, and more messages U2, A2, U3, where U3 is the newest user message you want to send to the model. Let's also say you have set NumCtx to X, and the length of S is 0.3X and the length of U3 is 0.8X. When ollama receives the request via `/api/chat` it process the messages through the templating system and then compares the result against the size of the context buffer. In this case, it will determine that the length of the processed messages is greater than the available context, and will drop U1. It will then continue to compute the length and drop messages, trying to make them fit - it will drop A1, then U2, and then A2. Finally it will concatenate S and U3 into prompt P, and then remove enough tokens from the start of P to make it fit in the context buffer. So the final prompt that will be used for inference will be that last 0.2X of S combined with U3. As inference starts and tokens are generated, ollama will see that it's going to run off the end of the context buffer, so will shift the contents to make room for new tokens. As this proceeds, less and less of S (and eventually U3) is going to be left in the context buffer, so if it contains instructions essential to the output, the quality of output may decrease. This is also why a model will sometimes lose coherence and start to produce a stream tokens without generating an end-of-sequence token - the act of shifting the context buffer to make room causes earlier tokens that might have provided guidance to be lost.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6681