[GH-ISSUE #5480] How Does Llama3 Handle Dialogs Exceeding the Context Window? #29186

Closed
opened 2026-04-22 07:53:30 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @flyboss on GitHub (Jul 4, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5480

What is the issue?

I have noticed an issue with my dialogs where the total token count does not exceed 8k, but Llama ignores the earliest content, leading to completely incorrect responses. Here is my code:

# c1, c2, c3 are all long strings. Their tokens are 5168, 453, 782, 
# which I counted by sending them individually to Ollama for statistics
messages = [
    {
        "role": "user",
        "content": c1
    },
    {
        "role": "assistant",
        "content": c2
    },
    {
        "role": "user",
        "content": c3
    }
]
payload = json.dumps({
    "model": "llama3:70b",
    "messages": messages,
    "stream": False,
    "options": {
        "seed": 101,
        "temperature": 0,
        "num_ctx": 8192
    },
    "keep_alive": "15m"
})
response = requests.request("POST", "http://localhost:11434/api/chat",
                            headers={'Content-Type': 'application/json'}, data=payload)
response_obj = json.loads(response.text)
print(response_obj)

I find the response answer is totally wrong, and 'prompt_eval_count': 782. This indicates that c1 and c2 were completely ignored.

The total token count of c1 + c2 + c3 is only 6,403, which is far below 8k. Why is this happening?

Additionally, I found an interesting but strange phenomenon. When I change the request to the following:

messages = [
    {
        "role": "user",
        "content": c1 + c2 + c3
    }
]

The returned answer is correct (the model knows the information in c1 and c2), and the 'prompt_eval_count' is 6391.

I fount the #2714 also discuss the context topic. But it didn't solve my problem.

Could anyone help me? Thanks a lot!

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.32

Originally created by @flyboss on GitHub (Jul 4, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5480 ### What is the issue? I have noticed an issue with my dialogs where the total token count does not exceed 8k, but Llama ignores the earliest content, leading to completely incorrect responses. Here is my code: ```python # c1, c2, c3 are all long strings. Their tokens are 5168, 453, 782, # which I counted by sending them individually to Ollama for statistics messages = [ { "role": "user", "content": c1 }, { "role": "assistant", "content": c2 }, { "role": "user", "content": c3 } ] payload = json.dumps({ "model": "llama3:70b", "messages": messages, "stream": False, "options": { "seed": 101, "temperature": 0, "num_ctx": 8192 }, "keep_alive": "15m" }) response = requests.request("POST", "http://localhost:11434/api/chat", headers={'Content-Type': 'application/json'}, data=payload) response_obj = json.loads(response.text) print(response_obj) ``` I find the response answer is totally wrong, and 'prompt_eval_count': 782. This indicates that c1 and c2 were completely ignored. The total token count of c1 + c2 + c3 is only 6,403, which is far below 8k. Why is this happening? Additionally, I found an interesting but strange phenomenon. When I change the request to the following: ```python messages = [ { "role": "user", "content": c1 + c2 + c3 } ] ``` The returned answer is correct (the model knows the information in c1 and c2), and the 'prompt_eval_count' is 6391. I fount the #2714 also discuss the context topic. But it didn't solve my problem. Could anyone help me? Thanks a lot! ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.32
GiteaMirror added the bug label 2026-04-22 07:53:30 -05:00
Author
Owner

@jmorganca commented on GitHub (Jul 4, 2024):

Hi @flyboss. You may want to upgrade to 0.1.48 as it may have performance improvements around this

However, by default Ollama limits context window to 2048 - it can be extended with either:

  1. num_ctx option in the API
  2. /set parameter num_ctx 8192 in ollama run

Otherwise the prompt will be truncated to the context window size. Hope this helps. Note: we're working on extending the default context window - it's just that the memory requirements are quite large for larger context windows so we default to 2048 right now

<!-- gh-comment-id:2209303920 --> @jmorganca commented on GitHub (Jul 4, 2024): Hi @flyboss. You may want to upgrade to 0.1.48 as it may have performance improvements around this However, by default Ollama limits context window to 2048 - it can be extended with either: 1. `num_ctx` option in the API 2. `/set parameter num_ctx 8192` in `ollama run` Otherwise the prompt will be truncated to the context window size. Hope this helps. Note: we're working on extending the default context window - it's just that the memory requirements are quite large for larger context windows so we default to 2048 right now
Author
Owner

@g0t4 commented on GitHub (Jul 23, 2024):

@jmorganca how is truncation performed? Does it take the first N tokens? Or last N? or something else?

<!-- gh-comment-id:2246125556 --> @g0t4 commented on GitHub (Jul 23, 2024): @jmorganca how is truncation performed? Does it take the first N tokens? Or last N? or something else?
Author
Owner

@MarcSchluperAtIntel commented on GitHub (Sep 19, 2024):

My own experiments strongly suggest that if I provide a prompt of 8K with a context window of 2K, the first 6K of my prompt is simply silently ignored. So if this concerns a summary of a lengthy meeting, you get a nice summary in perfect English that does not include anything of the first 75% of the meeting, without a warning.
As a software application developer I consider this behavior a bug. Sure, it is a user error, but those need to be handled too.

So my conclusion is that we need to adapt num_ctx to the size of the prompt, accepting longer execution times with larger prompts, and yes, if the prompt is long we need to have enough memory available or else kindly reject the request. ("We" are here the developers of the applications that are using ollama.)

<!-- gh-comment-id:2360916586 --> @MarcSchluperAtIntel commented on GitHub (Sep 19, 2024): My own experiments strongly suggest that if I provide a prompt of 8K with a context window of 2K, the first 6K of my prompt is simply silently ignored. So if this concerns a summary of a lengthy meeting, you get a nice summary in perfect English that does not include anything of the first 75% of the meeting, _without a warning_. As a software application developer I consider this behavior a bug. Sure, it is a user error, but those need to be handled too. So my conclusion is that we need to adapt num_ctx to the size of the prompt, accepting longer execution times with larger prompts, and yes, if the prompt is long we need to have enough memory available or else kindly reject the request. ("We" are here the developers of the applications that are using ollama.)
Author
Owner

@ZhengRui commented on GitHub (Nov 11, 2024):

I have a similar issue, the very weird thing is as the number of tokens kept in the end is much less than the context_size, as @flyboss's example showed. I understand it will truncate the first couple tokens, but shouldn't the number of tokens kept in the end be around the context_size? @jmorganca

<!-- gh-comment-id:2468804355 --> @ZhengRui commented on GitHub (Nov 11, 2024): I have a similar issue, the very weird thing is as the number of tokens kept in the end is much less than the context_size, as @flyboss's example showed. I understand it will truncate the first couple tokens, but shouldn't the number of tokens kept in the end be around the context_size? @jmorganca
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#29186