[GH-ISSUE #2714] Misunderstanding of ollama num_ctx parameter and context window #48139

Closed
opened 2026-04-28 06:49:35 -05:00 by GiteaMirror · 27 comments
Owner

Originally created by @PhilipAmadasun on GitHub (Feb 23, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2714

I'm trying to understand the relationship between the context window and the num_ctx parameter. Let's say I'm using mistral, and mistral's max context (according to google) is 8000, and "attention span" (according to google) is 128000. If I have a 27000 length user query. What exactly happens? If I set num_ctx: 4096. Does mistral just grab the last 4096 token sequence from the 27K user query? Then process the 4096 sequence along with the 128K window it grabs from the previously established overall context (In the case of the RESTful API, I'm talking about that body['context'] thing)?

Originally created by @PhilipAmadasun on GitHub (Feb 23, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2714 I'm trying to understand the relationship between the context window and the `num_ctx` parameter. Let's say I'm using mistral, and mistral's max context (according to google) is 8000, and "attention span" (according to google) is 128000. If I have a 27000 length user query. What exactly happens? If I set `num_ctx: 4096`. Does mistral just grab the last 4096 token sequence from the 27K user query? Then process the 4096 sequence along with the 128K window it grabs from the previously established overall context (In the case of the RESTful API, I'm talking about that body['context'] thing)?
Author
Owner

@jmorganca commented on GitHub (Feb 23, 2024):

Hi there,

Two things happen,

  1. If you are using the Chat API, it will only send as many messages as can fit in the context window.
  2. If it's still too big (e.g. a huge user message), then the prompt will roughly be split in half, opening up another 1/2 of the context window for new token generations (and it will continue doing this as tokens are generated)

There's a lot of work to do to improve this further - would love any feedback

Hope this helps

<!-- gh-comment-id:1961882175 --> @jmorganca commented on GitHub (Feb 23, 2024): Hi there, Two things happen, 1. If you are using the Chat API, it will only send as many messages as can fit in the context window. 2. If it's still too big (e.g. a huge user message), then the prompt will roughly be split in half, opening up another 1/2 of the context window for new token generations (and it will continue doing this as tokens are generated) There's a lot of work to do to improve this further - would love any feedback Hope this helps
Author
Owner

@PhilipAmadasun commented on GitHub (Feb 23, 2024):

@jmorganca So if user query is 27K tokens, and mistrals max tokens it can take as input from the current user query is 8K. The 27K will be be split to 14K and then to 7K? If so, then we have 4 sets of 7K tokens. Then Each set goes in as input to the model one at a time? I'm sorry for my confusion, if possible please use numbers in your explanation so maybe it can be clearer to me.

Just to make sure, when you say "context window" do you mean "attention span"? As in how much of the previous query and answer pairs the model can take in for context?

Or do you means "context window" as in maximum amount of tokens from the current user query that the model can take in as input?

I ask in this way because according to the mistral doc, mistral has a "8k context length and fixed cache size, with a theoretical attention span of 128K tokens". nit sure what the difference between "context length" and "attention span" means according to the docs.

<!-- gh-comment-id:1961958785 --> @PhilipAmadasun commented on GitHub (Feb 23, 2024): @jmorganca So if user query is 27K tokens, and mistrals max tokens it can take as input from the current user query is 8K. The 27K will be be split to 14K and then to 7K? If so, then we have 4 sets of 7K tokens. Then Each set goes in as input to the model one at a time? I'm sorry for my confusion, if possible please use numbers in your explanation so maybe it can be clearer to me. Just to make sure, when you say "context window" do you mean "attention span"? As in how much of the previous query and answer pairs the model can take in for context? Or do you means "context window" as in maximum amount of tokens from the current user query that the model can take in as input? I ask in this way because according to the mistral [doc](https://huggingface.co/docs/transformers/main/en/model_doc/mistral), mistral has a "8k context length and fixed cache size, with a theoretical attention span of 128K tokens". nit sure what the difference between "context length" and "attention span" means according to the docs.
Author
Owner

@luc99hen commented on GitHub (Apr 9, 2024):

I also have this question. There is ambiguity between num_ctx for ollama and context window for a model. Or in other words, could you give some advice on how to set this parameter for a new model? @jmorganca

<!-- gh-comment-id:2044559374 --> @luc99hen commented on GitHub (Apr 9, 2024): I also have this question. There is ambiguity between `num_ctx` for ollama and `context window` for a model. Or in other words, could you give some advice on how to set this parameter for a new model? @jmorganca
Author
Owner

@fgenie commented on GitHub (May 19, 2024):

Why is this closed w/o any conclusion?

<!-- gh-comment-id:2119054684 --> @fgenie commented on GitHub (May 19, 2024): Why is this closed w/o any conclusion?
Author
Owner

@mitar commented on GitHub (May 23, 2024):

We also had to debug this for days. Until we found a hidden num_ctx in model documentation (not API documentation) which was artificially lowering the allowed context from model's 8k to 2k (default), making outputs really bad for our 4k prompt. Why is num_ctx not set to model's max context by default?

<!-- gh-comment-id:2126867421 --> @mitar commented on GitHub (May 23, 2024): We also had to debug this for days. Until we found a hidden `num_ctx` in model documentation (not API documentation) which was artificially lowering the allowed context from model's 8k to 2k (default), making outputs really bad for our 4k prompt. Why is `num_ctx` not set to model's max context by default?
Author
Owner

@FellowTraveler commented on GitHub (May 26, 2024):

I think I have to create a custom modelfile whenever I want to be able to load up any model with a different num_ctx than the default of 2048.
This leaves me wondering if Ollama is making a copy of every model that I use in this way, or if it just references the original model without having to copy it on the hard drive.
Also it seems strange that I have to create a custom modelfile at all, and otherwise be stuck with a 2048 context window for every single model by default.

<!-- gh-comment-id:2132204476 --> @FellowTraveler commented on GitHub (May 26, 2024): I think I have to create a custom modelfile whenever I want to be able to load up any model with a different num_ctx than the default of 2048. This leaves me wondering if Ollama is making a copy of every model that I use in this way, or if it just references the original model without having to copy it on the hard drive. Also it seems strange that I have to create a custom modelfile at all, and otherwise be stuck with a 2048 context window for every single model by default.
Author
Owner

@mitar commented on GitHub (May 26, 2024):

You can provide num_ctx in the API call.

<!-- gh-comment-id:2132206204 --> @mitar commented on GitHub (May 26, 2024): You can provide `num_ctx` in the API call.
Author
Owner

@andreashappe commented on GitHub (May 27, 2024):

You can provide num_ctx in the API call.

I tried this yesterday, but when using llama3 the used token count (according to the HTTP response) always stayed below 2k

<!-- gh-comment-id:2134032685 --> @andreashappe commented on GitHub (May 27, 2024): > You can provide `num_ctx` in the API call. I tried this yesterday, but when using llama3 the used token count (according to the HTTP response) always stayed below 2k
Author
Owner

@mitar commented on GitHub (May 28, 2024):

But is your prompt larger than 2k? Does your response require more than 2k?

<!-- gh-comment-id:2134500273 --> @mitar commented on GitHub (May 28, 2024): But is your prompt larger than 2k? Does your response require more than 2k?
Author
Owner

@andreashappe commented on GitHub (May 28, 2024):

But is your prompt larger than 2k? Does your response require more than 2k?

Yes, I am using it within https://github.com/ipa-lab/hackingBuddyGPT to compare llama3 and openai LLMs and when looking at the stats reported by the HTTP response I can see that openai uses 8k tokens for the request while llama3 always caps at <= 2k, even when setting the num_ctx parameter.

<!-- gh-comment-id:2135041370 --> @andreashappe commented on GitHub (May 28, 2024): > But is your prompt larger than 2k? Does your response require more than 2k? Yes, I am using it within https://github.com/ipa-lab/hackingBuddyGPT to compare llama3 and openai LLMs and when looking at the stats reported by the HTTP response I can see that openai uses 8k tokens for the request while llama3 always caps at <= 2k, even when setting the `num_ctx` parameter.
Author
Owner

@mitar commented on GitHub (May 28, 2024):

Maybe it is a bug in the tool you are using.

<!-- gh-comment-id:2135054080 --> @mitar commented on GitHub (May 28, 2024): Maybe it is a bug in the tool you are using.
Author
Owner

@andreashappe commented on GitHub (May 28, 2024):

quites sure it's not (I've written that tool on my own).

If I am using openAI API directly, it uses (depending upon the use-case) 8-100k context size.

When I start llama3 with ollama and use its OpenAI-compatible API (and add the options -> num_ctx parameter, setting it to 4096 or 8192 does not matter) and keep all other things identical -> used context size is hard limited to 2k. I am using the token counts reported by the ollama openai-compatible API, so I am not counting them myself.

So I am quite sure that it is not the tool itself..

<!-- gh-comment-id:2135069252 --> @andreashappe commented on GitHub (May 28, 2024): quites sure it's not (I've written that tool on my own). If I am using openAI API directly, it uses (depending upon the use-case) 8-100k context size. When I start llama3 with ollama and use its OpenAI-compatible API (and add the options -> num_ctx parameter, setting it to 4096 or 8192 does not matter) and keep all other things identical -> used context size is hard limited to 2k. I am using the token counts reported by the ollama openai-compatible API, so I am not counting them myself. So I am quite sure that it is not the tool itself..
Author
Owner

@MarkoSagadin commented on GitHub (May 28, 2024):

@andreashappe I think I know what might be your issue with the token counting.

If you are using OpenAI-compatible API, that means that your text passes through OpenAI SDK logic, gets to the Ollama server, which generates a response and passes that back again through the OpenAI SDK logic.

The number of input and generated tokens that you see in the response object is calculated by the OpenAI SDK, not by the Ollama.

How can OpenAI SDK know what kind of model are you using to use the correct tokenizer? I think it can't, it just uses TikToken library to calculate the number of tokens in strings.

The only way to count the tokens going in or out of Ollama (AFAIK) is with the Huggingface's Tokenizer class in the Transformers library.

See this function that I wrote to count the tokens: https://gitlab.com/peerdb/llm/-/blob/main/llm-tester/src/llm_tester/modules/llm_clients/ollama_token_counter.py?ref_type=heads#L48

One caveat: I was comparing this to the token count returned by the Groq API (specifically for llama3:70b and llama3:8b models). I am 8 tokens short compared to it, but don't know why exactly.

<!-- gh-comment-id:2135842388 --> @MarkoSagadin commented on GitHub (May 28, 2024): @andreashappe I think I know what might be your issue with the token counting. If you are using OpenAI-compatible API, that means that your text passes through OpenAI SDK logic, gets to the Ollama server, which generates a response and passes that back again through the OpenAI SDK logic. The number of input and generated tokens that you see in the response object is calculated by the OpenAI SDK, not by the Ollama. How can OpenAI SDK know what kind of model are you using to use the correct tokenizer? I think it can't, it just uses TikToken library to calculate the number of tokens in strings. The only way to count the tokens going in or out of Ollama (AFAIK) is with the Huggingface's Tokenizer class in the Transformers library. See this function that I wrote to count the tokens: https://gitlab.com/peerdb/llm/-/blob/main/llm-tester/src/llm_tester/modules/llm_clients/ollama_token_counter.py?ref_type=heads#L48 One caveat: I was comparing this to the token count returned by the Groq API (specifically for llama3:70b and llama3:8b models). I am 8 tokens short compared to it, but don't know why exactly.
Author
Owner

@andreashappe commented on GitHub (May 28, 2024):

@MarkoSagadin With API I meant the ollama openAI-compatible API with the model set to 'llama3' (so that it will use the correct one). I myself am only using the HTTP interface with a direct HTTP call (through the python requests library). I never mentioned the SDK.

And the token count is the count that is returned by ollama's HTTP response, so I am quite sure that it knows what it is dealing with. Or am I getting the architecture totally wrong (which can always be the case)?

<!-- gh-comment-id:2135885182 --> @andreashappe commented on GitHub (May 28, 2024): @MarkoSagadin With API I meant the [ollama openAI-compatible API](https://github.com/ollama/ollama/blob/main/docs/api.md) with the model set to 'llama3' (so that it will use the correct one). I myself am only using the HTTP interface with a direct HTTP call (through the python requests library). I never mentioned the SDK. And the token count is the count that is returned by ollama's HTTP response, so I am quite sure that it knows what it is dealing with. Or am I getting the architecture totally wrong (which can always be the case)?
Author
Owner

@MarkoSagadin commented on GitHub (May 28, 2024):

Aha, I thought that you were talking about this OpenAi compatible API.

I agree with you, the Ollama should know how to count tokens.

I need to check how values in prompt_eval_count and eval_count compare against Groq.

<!-- gh-comment-id:2135921028 --> @MarkoSagadin commented on GitHub (May 28, 2024): Aha, I thought that you were talking about this [OpenAi compatible API](https://github.com/ollama/ollama/blob/main/docs/openai.md). I agree with you, the Ollama should know how to count tokens. I need to check how values in `prompt_eval_count` and `eval_count` compare against Groq.
Author
Owner

@MarkoSagadin commented on GitHub (May 28, 2024):

They compare badly.

Below table is from the tool that I linked earlier.

image

  • "Max context" is the context window size.
  • "Combined" is combined token count of system message and chat history (a list of user/assistant messages)
  • Each column after the “Combined” column shows the number of tokens for the combined content of system message, chat history and the input file that was tested. The values for Ollama models are using prompt_eval_count values.

For the first file the there is only one token of difference between the Groq and Ollama models.
After that the difference significantly increases.

Something is off with token counting...

<!-- gh-comment-id:2135975960 --> @MarkoSagadin commented on GitHub (May 28, 2024): They compare badly. Below table is from the tool that I linked earlier. ![image](https://github.com/ollama/ollama/assets/41839945/acded338-5c6c-4a7d-a987-661f10b0918e) * "Max context" is the context window size. * "Combined" is combined token count of system message and chat history (a list of user/assistant messages) * Each column after the “Combined” column shows the number of tokens for the combined content of system message, chat history and the input file that was tested. The values for Ollama models are using `prompt_eval_count` values. For the first file the there is only one token of difference between the Groq and Ollama models. After that the difference significantly increases. Something is off with token counting...
Author
Owner

@andreashappe commented on GitHub (May 28, 2024):

@MarkoSagadin I am also trying this locally right now (using llama3-8b).. I am using the following answer fields (for the token counts): response['usage']['prompt_tokens'] and response['usage']['completion_tokens'].

I do log away the prompts and the responding answer fields. When I look into the output, then I see the following:

  • prompt: 2526 characters -> 436 request tokens, 21 completion tokens
  • prompt: 25325 characters -> 1660 request tokens, 98 completion tokens
  • prompt: 26090 characters -> 203 request tokens, 12 completion tokens

I am not sure why a larger prompt now creates a smaller request token count :-/ this is with the default 2048 context size.

When I now add the num_ctx option:

  • prompt: 2399 characters -> 62 request tokens, 16 completion tokens
  • prompt: 25764 characters -> 1977 request tokens, 153 completion tokens (this is why I believe that there is a 2k limit in place)
  • prompt: 25772 characters -> 1980 request tokens, 113 completion tokens

So in the first run I do not understand how the request token count goes down while the prompt itself is more or less the same. In the second run, the request token count keeps high, but seems to be cut-off at roughly 2k.

I am not sure if this is a token counting problem or if I have setup the num_ctx parameter wrong (or if the setting is ignored)

<!-- gh-comment-id:2135987864 --> @andreashappe commented on GitHub (May 28, 2024): @MarkoSagadin I am also trying this locally right now (using llama3-8b).. I am using the following answer fields (for the token counts): response['usage']['prompt_tokens'] and response['usage']['completion_tokens']. I do log away the prompts and the responding answer fields. When I look into the output, then I see the following: - prompt: 2526 characters -> 436 request tokens, 21 completion tokens - prompt: 25325 characters -> 1660 request tokens, 98 completion tokens - prompt: 26090 characters -> 203 request tokens, 12 completion tokens I am not sure why a larger prompt now creates a smaller request token count :-/ this is with the default 2048 context size. When I now add the [num_ctx option](https://github.com/ollama/ollama/blob/main/docs/faq.md): - prompt: 2399 characters -> 62 request tokens, 16 completion tokens - prompt: 25764 characters -> 1977 request tokens, 153 completion tokens (this is why I believe that there is a 2k limit in place) - prompt: 25772 characters -> 1980 request tokens, 113 completion tokens So in the first run I do not understand how the request token count goes down while the prompt itself is more or less the same. In the second run, the request token count keeps high, but seems to be cut-off at roughly 2k. I am not sure if this is a token counting problem or if I have setup the num_ctx parameter wrong (or if the setting is ignored)
Author
Owner

@mitar commented on GitHub (May 28, 2024):

Possibly related: https://github.com/ollama/ollama/issues/3427

<!-- gh-comment-id:2135995431 --> @mitar commented on GitHub (May 28, 2024): Possibly related: https://github.com/ollama/ollama/issues/3427
Author
Owner

@andreashappe commented on GitHub (May 28, 2024):

okay, I just switched the api url from ollama (localhost:10434) to groq (and removed the options from the http request headers) and the context size looks as expected (grows to around 8000). So I am quite sure that it might be something with the ollama http openai api.

<!-- gh-comment-id:2136001702 --> @andreashappe commented on GitHub (May 28, 2024): okay, I just switched the api url from ollama (localhost:10434) to groq (and removed the `options` from the http request headers) and the context size looks as expected (grows to around 8000). So I am quite sure that it might be something with the ollama http openai api.
Author
Owner

@andreashappe commented on GitHub (May 28, 2024):

Possibly related: #3427

Good point. I think this might explain my first testrun (where the token count gets lower).. but would this also explain the second run where the context size seems to hit 2k and then is never increased (to the 8k that I am passing as http request header)? Can I check somehow what is used as context size? maybe i am setting it wrong (but I don't get any error -.-)

<!-- gh-comment-id:2136004336 --> @andreashappe commented on GitHub (May 28, 2024): > Possibly related: #3427 Good point. I think this might explain my first testrun (where the token count gets lower).. but would this also explain the second run where the context size seems to hit 2k and then is never increased (to the 8k that I am passing as http request header)? Can I check somehow what is used as context size? maybe i am setting it wrong (but I don't get any error -.-)
Author
Owner

@FellowTraveler commented on GitHub (May 29, 2024):

You can provide num_ctx in the API call.

Thank you, I found the global setting after you said this. I was looking for a way to configure it per-model, but this is better than nothing for sure.

I'm still curious if it actually makes a separate copy of the model weights when I "copy" a model to make a custom modelfile. Hopefully not.

I'm using "Ollama Web UI" FYI. Aka Open-WebUI

<!-- gh-comment-id:2136526418 --> @FellowTraveler commented on GitHub (May 29, 2024): > You can provide `num_ctx` in the API call. Thank you, I found the global setting after you said this. I was looking for a way to configure it per-model, but this is better than nothing for sure. I'm still curious if it actually makes a separate copy of the model weights when I "copy" a model to make a custom modelfile. Hopefully not. I'm using "Ollama Web UI" FYI. Aka Open-WebUI
Author
Owner

@itsPreto commented on GitHub (Jul 16, 2024):

Why is num_ctx not set to model's max context by default?

any updates on this?

<!-- gh-comment-id:2231302577 --> @itsPreto commented on GitHub (Jul 16, 2024): > Why is num_ctx not set to model's max context by default? any updates on this?
Author
Owner

@anrgct commented on GitHub (Jul 26, 2024):

When using open-webui, I've noticed that long contextual messages sent to ollama consistently result in poor responses. After investigating the issue, it appears that the /api/chat and /v1/chat/completions endpoints are defaulting to a 1k context limit. This means that when the content exceeds this length, the system automatically discards the earlier portions, leading to subpar answers. What follows is the captured network request data for open-webui version 0.3.8.

curl 'http://localhost:11434/api/chat' \
-X POST \
-H 'Host: localhost:11434' \
-H 'Accept: */*' \
-H 'User-Agent: Python/3.11 aiohttp/3.9.5' \
-H 'Content-Type: text/plain; charset=utf-8' \
--data-raw '{"model": "qwen1_5-4b-chat-q4_k_m", "messages": [{"role": "user", "content": "<long context>"}], "options": {}, "stream": true}' 
<!-- gh-comment-id:2252649777 --> @anrgct commented on GitHub (Jul 26, 2024): When using open-webui, I've noticed that long contextual messages sent to ollama consistently result in poor responses. After investigating the issue, it appears that the `/api/chat` and `/v1/chat/completions` endpoints are defaulting to a 1k context limit. This means that when the content exceeds this length, the system automatically discards the earlier portions, leading to subpar answers. What follows is the captured network request data for open-webui version 0.3.8. ``` curl 'http://localhost:11434/api/chat' \ -X POST \ -H 'Host: localhost:11434' \ -H 'Accept: */*' \ -H 'User-Agent: Python/3.11 aiohttp/3.9.5' \ -H 'Content-Type: text/plain; charset=utf-8' \ --data-raw '{"model": "qwen1_5-4b-chat-q4_k_m", "messages": [{"role": "user", "content": "<long context>"}], "options": {}, "stream": true}' ```
Author
Owner

@chris-31337 commented on GitHub (Jul 28, 2024):

When using open-webui, I've noticed that long contextual messages sent to ollama consistently result in poor responses. After investigating the issue, it appears that the /api/chat and /v1/chat/completions endpoints are defaulting to a 1k context limit. This means that when the content exceeds this length, the system automatically discards the earlier portions, leading to subpar answers. What follows is the captured network request data for open-webui version 0.3.8.

curl 'http://localhost:11434/api/chat' \
-X POST \
-H 'Host: localhost:11434' \
-H 'Accept: */*' \
-H 'User-Agent: Python/3.11 aiohttp/3.9.5' \
-H 'Content-Type: text/plain; charset=utf-8' \
--data-raw '{"model": "qwen1_5-4b-chat-q4_k_m", "messages": [{"role": "user", "content": "<long context>"}], "options": {}, "stream": true}' 

@anrgct I'm experiencing the same in open-webui 0.3.10. Maybe you could help me to file a bug report in the open-webui repository based on your advanced analysis of the issue? How did you figure out the 1k context limit imposed on the chat api?

In my tests, I've compared the performance supplying the following test prompt to mistral-nemo either directly to ollama via console (with '/set parameter num_ctx 128000') or by using the webui (with "Context Length set to 128000"). The prompt is 'Please read the following scientific text and be prepared to answer questions. Do not summarize the text, just wait for my questions and confirm if you've read the entire text.', followed by a 96k character scientific text (estimated to be 20k tokens according to openai tokenizer).

Using ollama via webui gives an unsolicited summary and answers most questions wrong, indicating that it had not comprehended the entire text. Using ollama in console leads to 'I have read the scientific text provided. Please ask your questions' and correct answers on the text. I obtained similar results and differences between console and webui in other large context models, e.g. llama3.1, so it is not an issue of the model.

<!-- gh-comment-id:2254455550 --> @chris-31337 commented on GitHub (Jul 28, 2024): > When using open-webui, I've noticed that long contextual messages sent to ollama consistently result in poor responses. After investigating the issue, it appears that the `/api/chat` and `/v1/chat/completions` endpoints are defaulting to a 1k context limit. This means that when the content exceeds this length, the system automatically discards the earlier portions, leading to subpar answers. What follows is the captured network request data for open-webui version 0.3.8. > > ``` > curl 'http://localhost:11434/api/chat' \ > -X POST \ > -H 'Host: localhost:11434' \ > -H 'Accept: */*' \ > -H 'User-Agent: Python/3.11 aiohttp/3.9.5' \ > -H 'Content-Type: text/plain; charset=utf-8' \ > --data-raw '{"model": "qwen1_5-4b-chat-q4_k_m", "messages": [{"role": "user", "content": "<long context>"}], "options": {}, "stream": true}' > ``` @anrgct I'm experiencing the same in open-webui 0.3.10. Maybe you could help me to file a bug report in the open-webui repository based on your advanced analysis of the issue? How did you figure out the 1k context limit imposed on the chat api? In my tests, I've compared the performance supplying the following test prompt to mistral-nemo either directly to ollama via console (with '/set parameter num_ctx 128000') or by using the webui (with "Context Length set to 128000"). The prompt is 'Please read the following scientific text and be prepared to answer questions. Do not summarize the text, just wait for my questions and confirm if you've read the entire text.', followed by a 96k character scientific text (estimated to be 20k tokens according to openai tokenizer). Using ollama via webui gives an unsolicited summary and answers most questions wrong, indicating that it had not comprehended the entire text. Using ollama in console leads to 'I have read the scientific text provided. Please ask your questions' and correct answers on the text. I obtained similar results and differences between console and webui in other large context models, e.g. llama3.1, so it is not an issue of the model.
Author
Owner

@anrgct commented on GitHub (Jul 28, 2024):

I opened a new issue about this bug. @chris-31337 https://github.com/ollama/ollama/issues/6026

<!-- gh-comment-id:2254513182 --> @anrgct commented on GitHub (Jul 28, 2024): I opened a new issue about this bug. @chris-31337 https://github.com/ollama/ollama/issues/6026
Author
Owner

@shenhai-ran commented on GitHub (Dec 9, 2024):

I found under each model's model page, there is something called context_length (see below image). I am wondering, is this JUST a piece of information listed here, or is it a kind of parameter related to the model?

image

Thanks!

<!-- gh-comment-id:2527379484 --> @shenhai-ran commented on GitHub (Dec 9, 2024): I found under each model's `model` page, there is something called `context_length` (see below image). I am wondering, is this JUST a piece of information listed here, or is it a kind of parameter related to the model? ![image](https://github.com/user-attachments/assets/96ee51a6-fab6-4cce-9241-511c83203df5) Thanks!
Author
Owner

@neel6762 commented on GitHub (Oct 29, 2025):

2. If it's still too big (e.g. a huge user message), then the prompt will roughly be split in half, opening up another 1/2 of the context window for new token generations (and it will continue doing this as tokens are generated)

@jmorganca I noticed this, while playing with a larger input. For instance, if I set the num_ctx to 50_000 and the input message exceeds this length, the chat method only reads 1/2 of the tokens. Is there a way to compute the # of tokens before passing it to the model? This would simply mean using the tokenizer of the model (but it's not accessible via Ollama) to get the size of the input tokens, before passing it to the model. OR am I missing something here ..

<!-- gh-comment-id:3461268163 --> @neel6762 commented on GitHub (Oct 29, 2025): > 2\. If it's still too big (e.g. a huge user message), then the prompt will roughly be split in half, opening up another 1/2 of the context window for new token generations (and it will continue doing this as tokens are generated) @jmorganca I noticed this, while playing with a larger input. For instance, if I set the num_ctx to 50_000 and the input message exceeds this length, the chat method only reads 1/2 of the tokens. Is there a way to compute the # of tokens before passing it to the model? This would simply mean using the tokenizer of the model (but it's not accessible via Ollama) to get the size of the input tokens, before passing it to the model. OR am I missing something here ..
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#48139