[GH-ISSUE #2204] Questions about context size #1259

Closed
opened 2026-04-12 11:02:50 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @swip3798 on GitHub (Jan 26, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2204

Before I start, thank you for this amazing project! It's really great to run LLMs on my own hardware this easily.

I am currently building a small story writing application that uses ollama to have a "cowriter" AI, that will write along with the user, similar to how AIDungeon or NovelAI work. Since the stories have no limit in size, they will eventually become large than the context size of the model. This now has led me to multiple questions on how exactly ollama handles cases, where the prompt is larger than the context size of the chosen model. Will it get trimmed, and if yes how exactly? Is the template always in the context and just the prompt trimmed, or will it be cut off too? Or do I understand this completely wrong?

Additionally the users of my app should be able to add a "long term memory", essentially just more text that will be put at the beginning of the prompt, so that the AI can have info of the story that is already outside of the context size. That of course makes it necessary, that this memory text will definitely be in the context of the model.

Now, all of this would be fairly simple to implement myself, if there would be a tokenize/detokenize endpoint. I have seen the issues regarding that, so maybe this can also be achieved using the chat endpoint? But then again, what happens when the context size is exceeded?

Sorry for all those questions at once, I would be really thankful, if you could share some insights on how this works.

Originally created by @swip3798 on GitHub (Jan 26, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2204 Before I start, thank you for this amazing project! It's really great to run LLMs on my own hardware this easily. I am currently building a small story writing application that uses ollama to have a "cowriter" AI, that will write along with the user, similar to how AIDungeon or NovelAI work. Since the stories have no limit in size, they will eventually become large than the context size of the model. This now has led me to multiple questions on how exactly ollama handles cases, where the prompt is larger than the context size of the chosen model. Will it get trimmed, and if yes how exactly? Is the template always in the context and just the prompt trimmed, or will it be cut off too? Or do I understand this completely wrong? Additionally the users of my app should be able to add a "long term memory", essentially just more text that will be put at the beginning of the prompt, so that the AI can have info of the story that is already outside of the context size. That of course makes it necessary, that this memory text will definitely be in the context of the model. Now, all of this would be fairly simple to implement myself, if there would be a tokenize/detokenize endpoint. I have seen the issues regarding that, so maybe this can also be achieved using the chat endpoint? But then again, what happens when the context size is exceeded? Sorry for all those questions at once, I would be really thankful, if you could share some insights on how this works.
Author
Owner

@Jurik-001 commented on GitHub (Jan 26, 2024):

Exact following question i also asked myself: "This now has led me to multiple questions on how exactly ollama handles cases, where the prompt is larger than the context size of the chosen model. Will it get trimmed, and if yes how exactly?"

I found following, so ollama uses if i get it right llama.cpp, so i searched for context size exceeding in that case, i found a post, where someone said:
"By default llama.cpp limits it to 512, but you can use -c 2048 -n 2048 to get the full context window."
Post

Than i searched trough issues of llama.cpp and i found following issue. They discussed about a parameter -c N, --ctx-size N: Set the size of the prompt context. In that context was also discussed, about a code part for infinit text generation trough context swapping, which is not comparable to a model that can take the full input. Citing a answer for the question what infinit text generation means in that context: "It allows you to keep generating tokens past the normal context limit (possibly infinitely) but it does that by overwriting part of the context with the prompt and generating new tokens into that context. It's not the same as having infinite context length."

So the question is, if ollama use that.

UPDATE:
i found additional information modelfile.md:

num_predict Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context) int num_predict 42

but if you execute for example:
ollama show llama2 --parameters
you get something like:

stop "[INST]"
stop "[/INST]"
...
So their is still not specified, how many tokens model will predict.

<!-- gh-comment-id:1911822786 --> @Jurik-001 commented on GitHub (Jan 26, 2024): Exact following question i also asked myself: "This now has led me to multiple questions on how exactly ollama handles cases, where the prompt is larger than the context size of the chosen model. Will it get trimmed, and if yes how exactly?" I found following, so ollama uses if i get it right llama.cpp, so i searched for context size exceeding in that case, i found a post, where someone said: "By default llama.cpp limits it to 512, but you can use -c 2048 -n 2048 to get the full context window." [Post](https://news.ycombinator.com/item?id=35186185#:~:text=size%20of%202048.-,By%20default%20llama.,get%20the%20full%20context%20window.) Than i searched trough issues of llama.cpp and i found following [issue](https://github.com/ggerganov/llama.cpp/discussions/1838). They discussed about a parameter -c N, --ctx-size N: Set the size of the prompt context. In that context was also discussed, about a code part for infinit text generation trough context swapping, which is not comparable to a model that can take the full input. Citing a answer for the question what infinit text generation means in that context: "It allows you to keep generating tokens past the normal context limit (possibly infinitely) but it does that by overwriting part of the context with the prompt and generating new tokens into that context. It's not the same as having infinite context length." So the question is, if ollama use that. UPDATE: i found additional information [modelfile.md](https://github.com/ollama/ollama/blob/197e420a97167c702973243563b72eb70b0e6786/docs/modelfile.md): <!DOCTYPE html> num_predict | Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context) | int | num_predict 42 -- | -- | -- | -- but if you execute for example: ` ollama show llama2 --parameters ` you get something like: stop "[INST]" stop "[/INST]" ... So their is still not specified, how many tokens model will predict.
Author
Owner

@swip3798 commented on GitHub (Jan 26, 2024):

I also found #1963, there seems to be a pull request already related to trimming the prompt for the chat endpoint. If I understand this correctly, it would make sure that the template and system message is preserved completely.

<!-- gh-comment-id:1912229434 --> @swip3798 commented on GitHub (Jan 26, 2024): I also found #1963, there seems to be a pull request already related to trimming the prompt for the chat endpoint. If I understand this correctly, it would make sure that the template and system message is preserved completely.
Author
Owner

@jukofyork commented on GitHub (Jan 28, 2024):

Yeah, I've just been thinking about this too as I'm sending large amounts of code and can quickly approach the 16k context window.

It would be nice to have a clear understanding of exactly how the chat completion API will handle this.

For my use case I'd most like to keep the system prompt and then do a "First In, Last Out" removal of the oldest messages (or user assistant message tuples for the chat completion API) so as to never go over the context limit.

Just having an API endpoint to count the tokens would be enough for me to do this myself (actually just found somebody mentioned in the request thread for this feature that you can get this from the embedding API call by counting the number of items returned!?).

<!-- gh-comment-id:1913707420 --> @jukofyork commented on GitHub (Jan 28, 2024): Yeah, I've just been thinking about this too as I'm sending large amounts of code and can quickly approach the 16k context window. It would be nice to have a clear understanding of exactly how the chat completion API will handle this. For my use case I'd most like to keep the system prompt and then do a "First In, Last Out" removal of the oldest messages (or user assistant message tuples for the chat completion API) so as to never go over the context limit. Just having an API endpoint to count the tokens would be enough for me to do this myself (actually just found somebody mentioned in the request thread for this feature that you can get this from the embedding API call by counting the number of items returned!?).
Author
Owner

@jmorganca commented on GitHub (May 10, 2024):

The context limit defaults to 2048, it can be made larger with the num_ctx parameter in the API. However for large amounts of data, folks often use a workflow called RAG to store data outside of the context window and bring in chunks where required. Hope this helps!

<!-- gh-comment-id:2103671083 --> @jmorganca commented on GitHub (May 10, 2024): The context limit defaults to 2048, it can be made larger with the `num_ctx` parameter in the API. However for large amounts of data, folks often use a workflow called RAG to store data outside of the context window and bring in chunks where required. Hope this helps!
Author
Owner

@nikhil-swamix commented on GitHub (Oct 10, 2024):

cant i do ?
ollama setdefault num_ctx=8192 or similar?
lacking this critical feature. i have to create a proxy so model does not re init with requests of different num_ctx?
im getting migrane as i see models loading and unloading as i make http requests. please suggest best method.
@jmorganca

<!-- gh-comment-id:2404180005 --> @nikhil-swamix commented on GitHub (Oct 10, 2024): cant i do ? `ollama setdefault num_ctx=8192` or similar? lacking this critical feature. i have to create a proxy so model does not re init with requests of different num_ctx? im getting migrane as i see models loading and unloading as i make http requests. please suggest best method. @jmorganca
Author
Owner

@homjay commented on GitHub (Nov 6, 2024):

cant i do ? ollama setdefault num_ctx=8192 or similar? lacking this critical feature. i have to create a proxy so model does not re init with requests of different num_ctx? im getting migrane as i see models loading and unloading as i make http requests. please suggest best method. @jmorganca

This is a really important feature.

Not all API proxies support ollama context parameters, especially those with only OpenAI API integration, which lacks context parameters. A default context size of 2048 is not sufficient for most tasks, and the latest models already support context sizes over 32768.

Allowing custom context settings would free us from a lot of trouble.

<!-- gh-comment-id:2458659849 --> @homjay commented on GitHub (Nov 6, 2024): > cant i do ? `ollama setdefault num_ctx=8192` or similar? lacking this critical feature. i have to create a proxy so model does not re init with requests of different num_ctx? im getting migrane as i see models loading and unloading as i make http requests. please suggest best method. @jmorganca This is a really important feature. Not all API proxies support ollama context parameters, especially those with only OpenAI API integration, which lacks context parameters. A default context size of 2048 is not sufficient for most tasks, and the latest models already support context sizes over 32768. Allowing custom context settings would free us from a lot of trouble.
Author
Owner

@nikhil-swamix commented on GitHub (Nov 6, 2024):

The context limit defaults to 2048, it can be made larger with the num_ctx parameter in the API. However for large amounts of data, folks often use a workflow called RAG to store data outside of the context window and bring in chunks where required. Hope this helps!

this is not true, chunking RAG is bad or in other words, atleast 500 tokens fetched for top 3 results would be 75% of 2048, and leaves with mere 500 tokens to work with. and folks using ollama usually know hardware reqs, especially on production they get bigger machines for safety and accomodate Compute and memory. an article axplains this
image

as @homjay pointed, people are ever hungry for longer context as they compensate for finetuning and increase relevance, as big as 2048 may sound, its actually numbers from 2022... and most settings are preferred on server side/one time setup rather than API calls. on a side note, i would like to raise a PR which allows enhanced env settings which allow to configure defaults for a model tag ollama setenv <TAG> num_ctx <VALUE> as there are other settings like GPU/Layer splitting which do the same API based and leads to reloading in most cases.
please guide me @jmorganca , im not a go lang expert, but will try my best.

<!-- gh-comment-id:2458811920 --> @nikhil-swamix commented on GitHub (Nov 6, 2024): > The context limit defaults to 2048, it can be made larger with the `num_ctx` parameter in the API. However for large amounts of data, folks often use a workflow called RAG to store data outside of the context window and bring in chunks where required. Hope this helps! this is not true, chunking RAG is bad or in other words, atleast 500 tokens fetched for top 3 results would be 75% of 2048, and leaves with mere 500 tokens to work with. and folks using ollama usually know hardware reqs, especially on production they get bigger machines for safety and accomodate Compute and memory. an article axplains this ![image](https://github.com/user-attachments/assets/285d006b-7264-4be7-a349-e733c2dab9cf) as @homjay pointed, people are ever hungry for longer context as they compensate for finetuning and increase relevance, as big as 2048 may sound, its actually numbers from 2022... and most settings are preferred on server side/one time setup rather than API calls. on a side note, i would like to raise a PR which allows enhanced env settings which allow to configure defaults for a model tag `ollama setenv <TAG> num_ctx <VALUE>` as there are other settings like GPU/Layer splitting which do the same API based and leads to reloading in most cases. please guide me @jmorganca , im not a go lang expert, but will try my best.
Author
Owner

@homjay commented on GitHub (Nov 6, 2024):

I agree with @nikhil-swamix that chunking RAG isn't a universal solution and can overcomplicate simple tasks.
And my experience shows that RAG can struggle, especially compared to using longer contexts. Embeddings sometimes make errors, and the embedding process itself takes time. The key advantage of RAG is its ability to process very long documents, such as million-word novels or a large number of files. However, it isn't specifically designed to handle long contexts.

While some openai api proxy projects, like one-api, allow for setting a fixed context size when using Ollama in the future, this isn't universally supported. For example Cline.
Other inference projects like vLLM and LocalAI allow for setting context size when model is initiated. Because Ollama is designed to be open-source and user-friendly, adding context size control would further enhance its accessibility, particularly for users new to LLMs.

<!-- gh-comment-id:2459552317 --> @homjay commented on GitHub (Nov 6, 2024): I agree with @nikhil-swamix that chunking RAG isn't a universal solution and can overcomplicate simple tasks. And my experience shows that RAG can struggle, especially compared to using longer contexts. Embeddings sometimes make errors, and the embedding process itself takes time. The key advantage of RAG is its ability to process very long documents, such as million-word novels or a large number of files. However, it isn't specifically designed to handle long contexts. While some openai api proxy projects, like one-api, allow for setting a fixed context size when using Ollama in the future, this isn't universally supported. For example `Cline`. Other inference projects like vLLM and LocalAI allow for setting context size when model is initiated. Because Ollama is designed to be open-source and user-friendly, adding context size control would further enhance its accessibility, particularly for users new to LLMs.
Author
Owner

@qiulang commented on GitHub (Dec 16, 2024):

"However for large amounts of data, folks often use a workflow called RAG to store data outside of the context window and bring in chunks where required. " I don't think this is the case.

The chunk data RAG retrieves is put in prompt, which takes up the context window.

<!-- gh-comment-id:2544725723 --> @qiulang commented on GitHub (Dec 16, 2024): "However for large amounts of data, folks often use a workflow called RAG to store data outside of the context window and bring in chunks where required. " I don't think this is the case. The chunk data RAG retrieves is put in prompt, which takes up the context window.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1259