[GH-ISSUE #2714] Misunderstanding of ollama num_ctx parameter and context window #48139

New Issue

GiteaMirror · 2026-04-28T06:49:35-05:00

GiteaMirror commented

2026-04-28 06:49:35 -05:00

Originally created by @PhilipAmadasun on GitHub (Feb 23, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2714

I'm trying to understand the relationship between the context window and the num_ctx parameter. Let's say I'm using mistral, and mistral's max context (according to google) is 8000, and "attention span" (according to google) is 128000. If I have a 27000 length user query. What exactly happens? If I set num_ctx: 4096. Does mistral just grab the last 4096 token sequence from the 27K user query? Then process the 4096 sequence along with the 128K window it grabs from the previously established overall context (In the case of the RESTful API, I'm talking about that body['context'] thing)?

Originally created by @PhilipAmadasun on GitHub (Feb 23, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2714 I'm trying to understand the relationship between the context window and the `num_ctx` parameter. Let's say I'm using mistral, and mistral's max context (according to google) is 8000, and "attention span" (according to google) is 128000. If I have a 27000 length user query. What exactly happens? If I set `num_ctx: 4096`. Does mistral just grab the last 4096 token sequence from the 27K user query? Then process the 4096 sequence along with the 128K window it grabs from the previously established overall context (In the case of the RESTful API, I'm talking about that body['context'] thing)?

GiteaMirror closed this issue

2026-04-28 06:49:36 -05:00

GiteaMirror commented

2026-04-28 06:49:37 -05:00

@jmorganca commented on GitHub (Feb 23, 2024):

Hi there,

Two things happen,

If you are using the Chat API, it will only send as many messages as can fit in the context window.
If it's still too big (e.g. a huge user message), then the prompt will roughly be split in half, opening up another 1/2 of the context window for new token generations (and it will continue doing this as tokens are generated)

There's a lot of work to do to improve this further - would love any feedback

Hope this helps

@jmorganca commented on GitHub (Feb 23, 2024): Hi there, Two things happen, 1. If you are using the Chat API, it will only send as many messages as can fit in the context window. 2. If it's still too big (e.g. a huge user message), then the prompt will roughly be split in half, opening up another 1/2 of the context window for new token generations (and it will continue doing this as tokens are generated) There's a lot of work to do to improve this further - would love any feedback Hope this helps

GiteaMirror commented

2026-04-28 06:49:37 -05:00

@PhilipAmadasun commented on GitHub (Feb 23, 2024):

@jmorganca So if user query is 27K tokens, and mistrals max tokens it can take as input from the current user query is 8K. The 27K will be be split to 14K and then to 7K? If so, then we have 4 sets of 7K tokens. Then Each set goes in as input to the model one at a time? I'm sorry for my confusion, if possible please use numbers in your explanation so maybe it can be clearer to me.

Just to make sure, when you say "context window" do you mean "attention span"? As in how much of the previous query and answer pairs the model can take in for context?

Or do you means "context window" as in maximum amount of tokens from the current user query that the model can take in as input?

I ask in this way because according to the mistral doc, mistral has a "8k context length and fixed cache size, with a theoretical attention span of 128K tokens". nit sure what the difference between "context length" and "attention span" means according to the docs.

@PhilipAmadasun commented on GitHub (Feb 23, 2024): @jmorganca So if user query is 27K tokens, and mistrals max tokens it can take as input from the current user query is 8K. The 27K will be be split to 14K and then to 7K? If so, then we have 4 sets of 7K tokens. Then Each set goes in as input to the model one at a time? I'm sorry for my confusion, if possible please use numbers in your explanation so maybe it can be clearer to me. Just to make sure, when you say "context window" do you mean "attention span"? As in how much of the previous query and answer pairs the model can take in for context? Or do you means "context window" as in maximum amount of tokens from the current user query that the model can take in as input? I ask in this way because according to the mistral [doc](https://huggingface.co/docs/transformers/main/en/model_doc/mistral), mistral has a "8k context length and fixed cache size, with a theoretical attention span of 128K tokens". nit sure what the difference between "context length" and "attention span" means according to the docs.

GiteaMirror commented

2026-04-28 06:49:37 -05:00

@luc99hen commented on GitHub (Apr 9, 2024):

I also have this question. There is ambiguity between num_ctx for ollama and context window for a model. Or in other words, could you give some advice on how to set this parameter for a new model? @jmorganca

@luc99hen commented on GitHub (Apr 9, 2024): I also have this question. There is ambiguity between `num_ctx` for ollama and `context window` for a model. Or in other words, could you give some advice on how to set this parameter for a new model? @jmorganca

GiteaMirror commented

2026-04-28 06:49:38 -05:00

@fgenie commented on GitHub (May 19, 2024):

Why is this closed w/o any conclusion?

@fgenie commented on GitHub (May 19, 2024): Why is this closed w/o any conclusion?

GiteaMirror commented

2026-04-28 06:49:38 -05:00

@mitar commented on GitHub (May 23, 2024):

We also had to debug this for days. Until we found a hidden num_ctx in model documentation (not API documentation) which was artificially lowering the allowed context from model's 8k to 2k (default), making outputs really bad for our 4k prompt. Why is num_ctx not set to model's max context by default?

@mitar commented on GitHub (May 23, 2024): We also had to debug this for days. Until we found a hidden `num_ctx` in model documentation (not API documentation) which was artificially lowering the allowed context from model's 8k to 2k (default), making outputs really bad for our 4k prompt. Why is `num_ctx` not set to model's max context by default?

GiteaMirror commented

2026-04-28 06:49:40 -05:00

@FellowTraveler commented on GitHub (May 26, 2024):

I think I have to create a custom modelfile whenever I want to be able to load up any model with a different num_ctx than the default of 2048.
This leaves me wondering if Ollama is making a copy of every model that I use in this way, or if it just references the original model without having to copy it on the hard drive.
Also it seems strange that I have to create a custom modelfile at all, and otherwise be stuck with a 2048 context window for every single model by default.

@FellowTraveler commented on GitHub (May 26, 2024): I think I have to create a custom modelfile whenever I want to be able to load up any model with a different num_ctx than the default of 2048. This leaves me wondering if Ollama is making a copy of every model that I use in this way, or if it just references the original model without having to copy it on the hard drive. Also it seems strange that I have to create a custom modelfile at all, and otherwise be stuck with a 2048 context window for every single model by default.

GiteaMirror commented

2026-04-28 06:49:41 -05:00

@mitar commented on GitHub (May 26, 2024):

You can provide num_ctx in the API call.

@mitar commented on GitHub (May 26, 2024): You can provide `num_ctx` in the API call.

GiteaMirror commented

2026-04-28 06:49:41 -05:00

@andreashappe commented on GitHub (May 27, 2024):

You can provide num_ctx in the API call.

I tried this yesterday, but when using llama3 the used token count (according to the HTTP response) always stayed below 2k

@andreashappe commented on GitHub (May 27, 2024): > You can provide `num_ctx` in the API call. I tried this yesterday, but when using llama3 the used token count (according to the HTTP response) always stayed below 2k

GiteaMirror commented

2026-04-28 06:49:42 -05:00

@mitar commented on GitHub (May 28, 2024):

But is your prompt larger than 2k? Does your response require more than 2k?

@mitar commented on GitHub (May 28, 2024): But is your prompt larger than 2k? Does your response require more than 2k?

GiteaMirror commented

2026-04-28 06:49:42 -05:00

@andreashappe commented on GitHub (May 28, 2024):

But is your prompt larger than 2k? Does your response require more than 2k?

Yes, I am using it within https://github.com/ipa-lab/hackingBuddyGPT to compare llama3 and openai LLMs and when looking at the stats reported by the HTTP response I can see that openai uses 8k tokens for the request while llama3 always caps at <= 2k, even when setting the num_ctx parameter.

@andreashappe commented on GitHub (May 28, 2024): > But is your prompt larger than 2k? Does your response require more than 2k? Yes, I am using it within https://github.com/ipa-lab/hackingBuddyGPT to compare llama3 and openai LLMs and when looking at the stats reported by the HTTP response I can see that openai uses 8k tokens for the request while llama3 always caps at <= 2k, even when setting the `num_ctx` parameter.

GiteaMirror commented

2026-04-28 06:49:43 -05:00

@mitar commented on GitHub (May 28, 2024):

Maybe it is a bug in the tool you are using.

@mitar commented on GitHub (May 28, 2024): Maybe it is a bug in the tool you are using.

GiteaMirror commented

2026-04-28 06:49:43 -05:00

@andreashappe commented on GitHub (May 28, 2024):

quites sure it's not (I've written that tool on my own).

If I am using openAI API directly, it uses (depending upon the use-case) 8-100k context size.

When I start llama3 with ollama and use its OpenAI-compatible API (and add the options -> num_ctx parameter, setting it to 4096 or 8192 does not matter) and keep all other things identical -> used context size is hard limited to 2k. I am using the token counts reported by the ollama openai-compatible API, so I am not counting them myself.

So I am quite sure that it is not the tool itself..

@andreashappe commented on GitHub (May 28, 2024): quites sure it's not (I've written that tool on my own). If I am using openAI API directly, it uses (depending upon the use-case) 8-100k context size. When I start llama3 with ollama and use its OpenAI-compatible API (and add the options -> num_ctx parameter, setting it to 4096 or 8192 does not matter) and keep all other things identical -> used context size is hard limited to 2k. I am using the token counts reported by the ollama openai-compatible API, so I am not counting them myself. So I am quite sure that it is not the tool itself..

GiteaMirror commented

2026-04-28 06:49:44 -05:00

@MarkoSagadin commented on GitHub (May 28, 2024):

@andreashappe I think I know what might be your issue with the token counting.

If you are using OpenAI-compatible API, that means that your text passes through OpenAI SDK logic, gets to the Ollama server, which generates a response and passes that back again through the OpenAI SDK logic.

The number of input and generated tokens that you see in the response object is calculated by the OpenAI SDK, not by the Ollama.

How can OpenAI SDK know what kind of model are you using to use the correct tokenizer? I think it can't, it just uses TikToken library to calculate the number of tokens in strings.

The only way to count the tokens going in or out of Ollama (AFAIK) is with the Huggingface's Tokenizer class in the Transformers library.

See this function that I wrote to count the tokens: https://gitlab.com/peerdb/llm/-/blob/main/llm-tester/src/llm_tester/modules/llm_clients/ollama_token_counter.py?ref_type=heads#L48

One caveat: I was comparing this to the token count returned by the Groq API (specifically for llama3:70b and llama3:8b models). I am 8 tokens short compared to it, but don't know why exactly.

@MarkoSagadin commented on GitHub (May 28, 2024): @andreashappe I think I know what might be your issue with the token counting. If you are using OpenAI-compatible API, that means that your text passes through OpenAI SDK logic, gets to the Ollama server, which generates a response and passes that back again through the OpenAI SDK logic. The number of input and generated tokens that you see in the response object is calculated by the OpenAI SDK, not by the Ollama. How can OpenAI SDK know what kind of model are you using to use the correct tokenizer? I think it can't, it just uses TikToken library to calculate the number of tokens in strings. The only way to count the tokens going in or out of Ollama (AFAIK) is with the Huggingface's Tokenizer class in the Transformers library. See this function that I wrote to count the tokens: https://gitlab.com/peerdb/llm/-/blob/main/llm-tester/src/llm_tester/modules/llm_clients/ollama_token_counter.py?ref_type=heads#L48 One caveat: I was comparing this to the token count returned by the Groq API (specifically for llama3:70b and llama3:8b models). I am 8 tokens short compared to it, but don't know why exactly.

GiteaMirror commented

2026-04-28 06:49:45 -05:00

@andreashappe commented on GitHub (May 28, 2024):

@MarkoSagadin With API I meant the ollama openAI-compatible API with the model set to 'llama3' (so that it will use the correct one). I myself am only using the HTTP interface with a direct HTTP call (through the python requests library). I never mentioned the SDK.

And the token count is the count that is returned by ollama's HTTP response, so I am quite sure that it knows what it is dealing with. Or am I getting the architecture totally wrong (which can always be the case)?

@andreashappe commented on GitHub (May 28, 2024): @MarkoSagadin With API I meant the [ollama openAI-compatible API](https://github.com/ollama/ollama/blob/main/docs/api.md) with the model set to 'llama3' (so that it will use the correct one). I myself am only using the HTTP interface with a direct HTTP call (through the python requests library). I never mentioned the SDK. And the token count is the count that is returned by ollama's HTTP response, so I am quite sure that it knows what it is dealing with. Or am I getting the architecture totally wrong (which can always be the case)?

GiteaMirror commented

2026-04-28 06:49:45 -05:00

@MarkoSagadin commented on GitHub (May 28, 2024):

Aha, I thought that you were talking about this OpenAi compatible API.

I agree with you, the Ollama should know how to count tokens.

I need to check how values in prompt_eval_count and eval_count compare against Groq.

@MarkoSagadin commented on GitHub (May 28, 2024): Aha, I thought that you were talking about this [OpenAi compatible API](https://github.com/ollama/ollama/blob/main/docs/openai.md). I agree with you, the Ollama should know how to count tokens. I need to check how values in `prompt_eval_count` and `eval_count` compare against Groq.

GiteaMirror commented

2026-04-28 06:49:46 -05:00

@MarkoSagadin commented on GitHub (May 28, 2024):

They compare badly.

Below table is from the tool that I linked earlier.

"Max context" is the context window size.
"Combined" is combined token count of system message and chat history (a list of user/assistant messages)
Each column after the “Combined” column shows the number of tokens for the combined content of system message, chat history and the input file that was tested. The values for Ollama models are using prompt_eval_count values.

For the first file the there is only one token of difference between the Groq and Ollama models.
After that the difference significantly increases.

Something is off with token counting...

@MarkoSagadin commented on GitHub (May 28, 2024): They compare badly. Below table is from the tool that I linked earlier. ![image](https://github.com/ollama/ollama/assets/41839945/acded338-5c6c-4a7d-a987-661f10b0918e) * "Max context" is the context window size. * "Combined" is combined token count of system message and chat history (a list of user/assistant messages) * Each column after the “Combined” column shows the number of tokens for the combined content of system message, chat history and the input file that was tested. The values for Ollama models are using `prompt_eval_count` values. For the first file the there is only one token of difference between the Groq and Ollama models. After that the difference significantly increases. Something is off with token counting...

GiteaMirror commented

2026-04-28 06:49:47 -05:00

@andreashappe commented on GitHub (May 28, 2024):

@MarkoSagadin I am also trying this locally right now (using llama3-8b).. I am using the following answer fields (for the token counts): response['usage']['prompt_tokens'] and response['usage']['completion_tokens'].

I do log away the prompts and the responding answer fields. When I look into the output, then I see the following:

prompt: 2526 characters -> 436 request tokens, 21 completion tokens
prompt: 25325 characters -> 1660 request tokens, 98 completion tokens
prompt: 26090 characters -> 203 request tokens, 12 completion tokens

I am not sure why a larger prompt now creates a smaller request token count :-/ this is with the default 2048 context size.

When I now add the num_ctx option:

prompt: 2399 characters -> 62 request tokens, 16 completion tokens
prompt: 25764 characters -> 1977 request tokens, 153 completion tokens (this is why I believe that there is a 2k limit in place)
prompt: 25772 characters -> 1980 request tokens, 113 completion tokens

So in the first run I do not understand how the request token count goes down while the prompt itself is more or less the same. In the second run, the request token count keeps high, but seems to be cut-off at roughly 2k.

I am not sure if this is a token counting problem or if I have setup the num_ctx parameter wrong (or if the setting is ignored)

@andreashappe commented on GitHub (May 28, 2024): @MarkoSagadin I am also trying this locally right now (using llama3-8b).. I am using the following answer fields (for the token counts): response['usage']['prompt_tokens'] and response['usage']['completion_tokens']. I do log away the prompts and the responding answer fields. When I look into the output, then I see the following: - prompt: 2526 characters -> 436 request tokens, 21 completion tokens - prompt: 25325 characters -> 1660 request tokens, 98 completion tokens - prompt: 26090 characters -> 203 request tokens, 12 completion tokens I am not sure why a larger prompt now creates a smaller request token count :-/ this is with the default 2048 context size. When I now add the [num_ctx option](https://github.com/ollama/ollama/blob/main/docs/faq.md): - prompt: 2399 characters -> 62 request tokens, 16 completion tokens - prompt: 25764 characters -> 1977 request tokens, 153 completion tokens (this is why I believe that there is a 2k limit in place) - prompt: 25772 characters -> 1980 request tokens, 113 completion tokens So in the first run I do not understand how the request token count goes down while the prompt itself is more or less the same. In the second run, the request token count keeps high, but seems to be cut-off at roughly 2k. I am not sure if this is a token counting problem or if I have setup the num_ctx parameter wrong (or if the setting is ignored)

GiteaMirror commented

2026-04-28 06:49:47 -05:00

@mitar commented on GitHub (May 28, 2024):

Possibly related: https://github.com/ollama/ollama/issues/3427

@mitar commented on GitHub (May 28, 2024): Possibly related: https://github.com/ollama/ollama/issues/3427

GiteaMirror commented

2026-04-28 06:49:48 -05:00

@andreashappe commented on GitHub (May 28, 2024):

okay, I just switched the api url from ollama (localhost:10434) to groq (and removed the options from the http request headers) and the context size looks as expected (grows to around 8000). So I am quite sure that it might be something with the ollama http openai api.

@andreashappe commented on GitHub (May 28, 2024): okay, I just switched the api url from ollama (localhost:10434) to groq (and removed the `options` from the http request headers) and the context size looks as expected (grows to around 8000). So I am quite sure that it might be something with the ollama http openai api.

GiteaMirror commented

2026-04-28 06:49:49 -05:00

@andreashappe commented on GitHub (May 28, 2024):

Possibly related: #3427

Good point. I think this might explain my first testrun (where the token count gets lower).. but would this also explain the second run where the context size seems to hit 2k and then is never increased (to the 8k that I am passing as http request header)? Can I check somehow what is used as context size? maybe i am setting it wrong (but I don't get any error -.-)

@andreashappe commented on GitHub (May 28, 2024): > Possibly related: #3427 Good point. I think this might explain my first testrun (where the token count gets lower).. but would this also explain the second run where the context size seems to hit 2k and then is never increased (to the 8k that I am passing as http request header)? Can I check somehow what is used as context size? maybe i am setting it wrong (but I don't get any error -.-)

GiteaMirror commented

2026-04-28 06:49:49 -05:00

@FellowTraveler commented on GitHub (May 29, 2024):

You can provide num_ctx in the API call.

Thank you, I found the global setting after you said this. I was looking for a way to configure it per-model, but this is better than nothing for sure.

I'm still curious if it actually makes a separate copy of the model weights when I "copy" a model to make a custom modelfile. Hopefully not.

I'm using "Ollama Web UI" FYI. Aka Open-WebUI

@FellowTraveler commented on GitHub (May 29, 2024): > You can provide `num_ctx` in the API call. Thank you, I found the global setting after you said this. I was looking for a way to configure it per-model, but this is better than nothing for sure. I'm still curious if it actually makes a separate copy of the model weights when I "copy" a model to make a custom modelfile. Hopefully not. I'm using "Ollama Web UI" FYI. Aka Open-WebUI

GiteaMirror commented

2026-04-28 06:49:50 -05:00

@itsPreto commented on GitHub (Jul 16, 2024):

Why is num_ctx not set to model's max context by default?

any updates on this?

@itsPreto commented on GitHub (Jul 16, 2024): > Why is num_ctx not set to model's max context by default? any updates on this?

GiteaMirror commented

2026-04-28 06:49:51 -05:00

@anrgct commented on GitHub (Jul 26, 2024):

When using open-webui, I've noticed that long contextual messages sent to ollama consistently result in poor responses. After investigating the issue, it appears that the /api/chat and /v1/chat/completions endpoints are defaulting to a 1k context limit. This means that when the content exceeds this length, the system automatically discards the earlier portions, leading to subpar answers. What follows is the captured network request data for open-webui version 0.3.8.

curl 'http://localhost:11434/api/chat' \
-X POST \
-H 'Host: localhost:11434' \
-H 'Accept: */*' \
-H 'User-Agent: Python/3.11 aiohttp/3.9.5' \
-H 'Content-Type: text/plain; charset=utf-8' \
--data-raw '{"model": "qwen1_5-4b-chat-q4_k_m", "messages": [{"role": "user", "content": "<long context>"}], "options": {}, "stream": true}'

@anrgct commented on GitHub (Jul 26, 2024): When using open-webui, I've noticed that long contextual messages sent to ollama consistently result in poor responses. After investigating the issue, it appears that the `/api/chat` and `/v1/chat/completions` endpoints are defaulting to a 1k context limit. This means that when the content exceeds this length, the system automatically discards the earlier portions, leading to subpar answers. What follows is the captured network request data for open-webui version 0.3.8. ``` curl 'http://localhost:11434/api/chat' \ -X POST \ -H 'Host: localhost:11434' \ -H 'Accept: */*' \ -H 'User-Agent: Python/3.11 aiohttp/3.9.5' \ -H 'Content-Type: text/plain; charset=utf-8' \ --data-raw '{"model": "qwen1_5-4b-chat-q4_k_m", "messages": [{"role": "user", "content": "<long context>"}], "options": {}, "stream": true}' ```

GiteaMirror commented

2026-04-28 06:49:52 -05:00

@chris-31337 commented on GitHub (Jul 28, 2024):

When using open-webui, I've noticed that long contextual messages sent to ollama consistently result in poor responses. After investigating the issue, it appears that the /api/chat and /v1/chat/completions endpoints are defaulting to a 1k context limit. This means that when the content exceeds this length, the system automatically discards the earlier portions, leading to subpar answers. What follows is the captured network request data for open-webui version 0.3.8.
curl 'http://localhost:11434/api/chat' \
-X POST \
-H 'Host: localhost:11434' \
-H 'Accept: */*' \
-H 'User-Agent: Python/3.11 aiohttp/3.9.5' \
-H 'Content-Type: text/plain; charset=utf-8' \
--data-raw '{"model": "qwen1_5-4b-chat-q4_k_m", "messages": [{"role": "user", "content": "<long context>"}], "options": {}, "stream": true}' 

@anrgct I'm experiencing the same in open-webui 0.3.10. Maybe you could help me to file a bug report in the open-webui repository based on your advanced analysis of the issue? How did you figure out the 1k context limit imposed on the chat api?

In my tests, I've compared the performance supplying the following test prompt to mistral-nemo either directly to ollama via console (with '/set parameter num_ctx 128000') or by using the webui (with "Context Length set to 128000"). The prompt is 'Please read the following scientific text and be prepared to answer questions. Do not summarize the text, just wait for my questions and confirm if you've read the entire text.', followed by a 96k character scientific text (estimated to be 20k tokens according to openai tokenizer).

Using ollama via webui gives an unsolicited summary and answers most questions wrong, indicating that it had not comprehended the entire text. Using ollama in console leads to 'I have read the scientific text provided. Please ask your questions' and correct answers on the text. I obtained similar results and differences between console and webui in other large context models, e.g. llama3.1, so it is not an issue of the model.

@chris-31337 commented on GitHub (Jul 28, 2024): > When using open-webui, I've noticed that long contextual messages sent to ollama consistently result in poor responses. After investigating the issue, it appears that the `/api/chat` and `/v1/chat/completions` endpoints are defaulting to a 1k context limit. This means that when the content exceeds this length, the system automatically discards the earlier portions, leading to subpar answers. What follows is the captured network request data for open-webui version 0.3.8. > > ``` > curl 'http://localhost:11434/api/chat' \ > -X POST \ > -H 'Host: localhost:11434' \ > -H 'Accept: */*' \ > -H 'User-Agent: Python/3.11 aiohttp/3.9.5' \ > -H 'Content-Type: text/plain; charset=utf-8' \ > --data-raw '{"model": "qwen1_5-4b-chat-q4_k_m", "messages": [{"role": "user", "content": "<long context>"}], "options": {}, "stream": true}' > ``` @anrgct I'm experiencing the same in open-webui 0.3.10. Maybe you could help me to file a bug report in the open-webui repository based on your advanced analysis of the issue? How did you figure out the 1k context limit imposed on the chat api? In my tests, I've compared the performance supplying the following test prompt to mistral-nemo either directly to ollama via console (with '/set parameter num_ctx 128000') or by using the webui (with "Context Length set to 128000"). The prompt is 'Please read the following scientific text and be prepared to answer questions. Do not summarize the text, just wait for my questions and confirm if you've read the entire text.', followed by a 96k character scientific text (estimated to be 20k tokens according to openai tokenizer). Using ollama via webui gives an unsolicited summary and answers most questions wrong, indicating that it had not comprehended the entire text. Using ollama in console leads to 'I have read the scientific text provided. Please ask your questions' and correct answers on the text. I obtained similar results and differences between console and webui in other large context models, e.g. llama3.1, so it is not an issue of the model.

GiteaMirror commented

2026-04-28 06:49:54 -05:00

@anrgct commented on GitHub (Jul 28, 2024):

I opened a new issue about this bug. @chris-31337 https://github.com/ollama/ollama/issues/6026

@anrgct commented on GitHub (Jul 28, 2024): I opened a new issue about this bug. @chris-31337 https://github.com/ollama/ollama/issues/6026

GiteaMirror commented

2026-04-28 06:49:58 -05:00

@shenhai-ran commented on GitHub (Dec 9, 2024):

I found under each model's model page, there is something called context_length (see below image). I am wondering, is this JUST a piece of information listed here, or is it a kind of parameter related to the model?

Thanks!

@shenhai-ran commented on GitHub (Dec 9, 2024): I found under each model's `model` page, there is something called `context_length` (see below image). I am wondering, is this JUST a piece of information listed here, or is it a kind of parameter related to the model? ![image](https://github.com/user-attachments/assets/96ee51a6-fab6-4cce-9241-511c83203df5) Thanks!

GiteaMirror commented

2026-04-28 06:49:59 -05:00

@neel6762 commented on GitHub (Oct 29, 2025):

2. If it's still too big (e.g. a huge user message), then the prompt will roughly be split in half, opening up another 1/2 of the context window for new token generations (and it will continue doing this as tokens are generated)

@jmorganca I noticed this, while playing with a larger input. For instance, if I set the num_ctx to 50_000 and the input message exceeds this length, the chat method only reads 1/2 of the tokens. Is there a way to compute the # of tokens before passing it to the model? This would simply mean using the tokenizer of the model (but it's not accessible via Ollama) to get the size of the input tokens, before passing it to the model. OR am I missing something here ..

@neel6762 commented on GitHub (Oct 29, 2025): > 2\. If it's still too big (e.g. a huge user message), then the prompt will roughly be split in half, opening up another 1/2 of the context window for new token generations (and it will continue doing this as tokens are generated) @jmorganca I noticed this, while playing with a larger input. For instance, if I set the num_ctx to 50_000 and the input message exceeds this length, the chat method only reads 1/2 of the tokens. Is there a way to compute the # of tokens before passing it to the model? This would simply mean using the tokenizer of the model (but it's not accessible via Ollama) to get the size of the input tokens, before passing it to the model. OR am I missing something here ..

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#48139