[GH-ISSUE #2595] Conversation context no longer taken into account? #1527

Closed
opened 2026-04-12 11:26:17 -05:00 by GiteaMirror · 14 comments
Owner

Originally created by @dictoon on GitHub (Feb 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2595

I'm running ollama version is 0.1.25 on macOS.

It looks like the LLM is no longer taking earlier messages into account, even though they definitely fit in the context window of the models I'm using.

I'm having a conversation like this:

- User: Here is some text, please summarize it.
- Assistant: <outputs a summary>
- User: Now, please summarize what you just wrote.
- Assistant: <outputs a completely unrelated summary>

I've tried both the llama2 and mixtral models. I've tried with the Open WebUI interface, directly with ollama run --verbose llama2, and with the OpenAI API talking to my locally-running Ollama.

I'm always observing the same behavior: the model simply ignores all context in my second query.

This used to work just fine before I updated Ollama (I was using a version a few weeks old, but I don't recall which).

Originally created by @dictoon on GitHub (Feb 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2595 I'm running ollama version is 0.1.25 on macOS. It looks like the LLM is no longer taking earlier messages into account, even though they definitely fit in the context window of the models I'm using. I'm having a conversation like this: ``` - User: Here is some text, please summarize it. - Assistant: <outputs a summary> - User: Now, please summarize what you just wrote. - Assistant: <outputs a completely unrelated summary> ``` I've tried both the `llama2` and `mixtral` models. I've tried with the Open WebUI interface, directly with `ollama run --verbose llama2`, and with the OpenAI API talking to my locally-running Ollama. I'm always observing the same behavior: the model simply ignores all context in my second query. This used to work just fine before I updated Ollama (I was using a version a few weeks old, but I don't recall which).
Author
Owner

@dictoon commented on GitHub (Feb 19, 2024):

Here's ollama's verbose output, if it's of any use:

  • After the first user query (note: 1694 prompt tokens)

    total duration:       10.855821416s
    load duration:        1.128ms
    prompt eval count:    1694 token(s)
    prompt eval duration: 3.374573s
    prompt eval rate:     501.99 tokens/s
    eval count:           319 token(s)
    eval duration:        7.470252s
    eval rate:            42.70 tokens/s
    
  • After the second user query that outputs garbage (note: 147 prompt tokens)

    total duration:       1.263779041s
    load duration:        3.331875ms
    prompt eval count:    147 token(s)
    prompt eval duration: 538.146ms
    prompt eval rate:     273.16 tokens/s
    eval count:           42 token(s)
    eval duration:        705.7ms
    eval rate:            59.52 tokens/s
    
<!-- gh-comment-id:1952902351 --> @dictoon commented on GitHub (Feb 19, 2024): Here's ollama's verbose output, if it's of any use: - After the first user query (note: 1694 prompt tokens) ``` total duration: 10.855821416s load duration: 1.128ms prompt eval count: 1694 token(s) prompt eval duration: 3.374573s prompt eval rate: 501.99 tokens/s eval count: 319 token(s) eval duration: 7.470252s eval rate: 42.70 tokens/s ``` - After the second user query that outputs garbage (note: 147 prompt tokens) ``` total duration: 1.263779041s load duration: 3.331875ms prompt eval count: 147 token(s) prompt eval duration: 538.146ms prompt eval rate: 273.16 tokens/s eval count: 42 token(s) eval duration: 705.7ms eval rate: 59.52 tokens/s ```
Author
Owner

@dictoon commented on GitHub (Feb 19, 2024):

If I truncate the first user query to 5000 characters (not tokens), then I'm getting a correct answer to the second user query. So it looks like I'm hitting some kind of context window size limit? I'm far from the 4K context window, and in any case, assuming the window is sliding, there's plenty of context in the assistant's answer that immediately precedes the second user query.

<!-- gh-comment-id:1952923761 --> @dictoon commented on GitHub (Feb 19, 2024): If I truncate the first user query to 5000 characters (not tokens), then I'm getting a correct answer to the second user query. So it looks like I'm hitting some kind of context window size limit? I'm far from the 4K context window, and in any case, assuming the window is sliding, there's plenty of context in the assistant's answer that immediately precedes the second user query.
Author
Owner

@dictoon commented on GitHub (Feb 19, 2024):

Maybe related? PSA: You can (and may want to) disable Mixtral's Sliding Window!

<!-- gh-comment-id:1952928772 --> @dictoon commented on GitHub (Feb 19, 2024): Maybe related? [PSA: You can (and may want to) disable Mixtral's Sliding Window!](https://www.reddit.com/r/LocalLLaMA/comments/18k0fek/psa_you_can_and_may_want_to_disable_mixtrals/)
Author
Owner

@jmorganca commented on GitHub (Feb 20, 2024):

Hi @dictoon thanks for the issue. It seems you're hitting the context limit size. To increase it: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size

<!-- gh-comment-id:1953436321 --> @jmorganca commented on GitHub (Feb 20, 2024): Hi @dictoon thanks for the issue. It seems you're hitting the context limit size. To increase it: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size
Author
Owner

@dictoon commented on GitHub (Feb 20, 2024):

Thanks @jmorganca.

I'm invoking Ollama through OpenAI's API in Python. Do you know if there's documentation on passing additional options such as context size?

I've tried this, but it doesn't work:

options = dict(num_ctx=4096)
response = self.client.chat.completions.create(
    model=Plugin.LLM_MODEL, messages=conversation, extra_body={"options": options})
<!-- gh-comment-id:1953857078 --> @dictoon commented on GitHub (Feb 20, 2024): Thanks @jmorganca. I'm invoking Ollama through OpenAI's API in Python. Do you know if there's documentation on passing additional options such as context size? I've tried this, but it doesn't work: ``` options = dict(num_ctx=4096) response = self.client.chat.completions.create( model=Plugin.LLM_MODEL, messages=conversation, extra_body={"options": options}) ```
Author
Owner

@dictoon commented on GitHub (Feb 20, 2024):

Another thing I'm not clear about, and the reason why initially I didn't suspect that I was hitting the token limit:

The assistant's answer (the - Assistant: <outputs a summary> step in the conversation outlined in my initial post) should be well within the token window, shouldn't it? Unless for some reason only the user's prompts are sent to the model, which would be surprising and unlike how, e.g., ChatGPT works.

<!-- gh-comment-id:1954070711 --> @dictoon commented on GitHub (Feb 20, 2024): Another thing I'm not clear about, and the reason why initially I didn't suspect that I was hitting the token limit: The assistant's answer (the `- Assistant: <outputs a summary>` step in the conversation outlined in my initial post) should be well within the token window, shouldn't it? Unless for some reason only the user's prompts are sent to the model, which would be surprising and unlike how, e.g., ChatGPT works.
Author
Owner

@dictoon commented on GitHub (Feb 20, 2024):

Two more questions:

  • I thought the context window was defined by the model and couldn't be changed. Do I understand correctly that in the case of talking to Ollama via OpenAI's API, somehow the context window is shrunk? For performance perhaps?

  • I had zero such problems when using Ollama's native Python API.

[Edit: correction, I now have the exact same problem using Ollama's native Python API. I didn't have any problem before updating Ollama on my machine.]

<!-- gh-comment-id:1954090462 --> @dictoon commented on GitHub (Feb 20, 2024): Two more questions: - I thought the context window was defined by the model and couldn't be changed. Do I understand correctly that in the case of talking to Ollama via OpenAI's API, somehow the context window is shrunk? For performance perhaps? - I had zero such problems when using Ollama's native Python API. [Edit: correction, I now have the exact same problem using Ollama's native Python API. I didn't have any problem before updating Ollama on my machine.]
Author
Owner

@dictoon commented on GitHub (Feb 20, 2024):

Using Ollama's native Python API, it looks like this works:

response = ollama.chat(
    model=Plugin.OLLAMA_MODEL,
    messages=conversation,
    options={
        "num_ctx": 4096,
    })

Would still appreciate answers to my previous questions, especially since I would love being able to use one API (OpenAI's) to talk to both GPT-4 and Ollama.

Thanks!

<!-- gh-comment-id:1954326964 --> @dictoon commented on GitHub (Feb 20, 2024): Using Ollama's native Python API, it looks like this works: ``` response = ollama.chat( model=Plugin.OLLAMA_MODEL, messages=conversation, options={ "num_ctx": 4096, }) ``` Would still appreciate answers to my previous questions, especially since I would love being able to use one API (OpenAI's) to talk to both GPT-4 and Ollama. Thanks!
Author
Owner

@PhilipAmadasun commented on GitHub (Feb 22, 2024):

@jmorganca @dictoon If I have a user input of context length 27000, and use the options={"num_ctx": 4096,} what specifically would this do? Will this have the input be broken into batches of size 4096 and sent in all at once or one at a time or something?

<!-- gh-comment-id:1960000045 --> @PhilipAmadasun commented on GitHub (Feb 22, 2024): @jmorganca @dictoon If I have a user input of context length 27000, and use the `options={"num_ctx": 4096,}` what specifically would this do? Will this have the input be broken into batches of size 4096 and sent in all at once or one at a time or something?
Author
Owner

@dictoon commented on GitHub (Feb 22, 2024):

The context window is what the model can "pay attention to" while generating new tokens, so as far as I know it's not possible to send the context in batches: that wouldn't change the fact that the model would only consider the previous 4096 tokens while generating new ones.

<!-- gh-comment-id:1960086728 --> @dictoon commented on GitHub (Feb 22, 2024): The context window is what the model can "pay attention to" while generating new tokens, so as far as I know it's not possible to send the context in batches: that wouldn't change the fact that the model would only consider the previous 4096 tokens while generating new ones.
Author
Owner

@PhilipAmadasun commented on GitHub (Feb 23, 2024):

@dictoon Thank you for the reply. Just so I make sure I understand. Let's say I'm using mistral, and mistral's max context (according to google) is 8000, and "attention span" (according to google) is 128000. If I have a 27000 length user query. What exactly happens? If I set num_ctx: 4096 Does mistral just grab the last 4096 token sequence from the 27K user query? Then process the 4096 sequence along with the 128K window it grabs from the previously established overall context (In the case of the RESTful API, I'm talking about that body['context'] thing)?

<!-- gh-comment-id:1960556455 --> @PhilipAmadasun commented on GitHub (Feb 23, 2024): @dictoon Thank you for the reply. Just so I make sure I understand. Let's say I'm using mistral, and mistral's max context (according to google) is 8000, and "attention span" (according to google) is 128000. If I have a 27000 length user query. What exactly happens? If I set `num_ctx: 4096` Does mistral just grab the last 4096 token sequence from the 27K user query? Then process the 4096 sequence along with the 128K window it grabs from the previously established overall context (In the case of the RESTful API, I'm talking about that `body['context']` thing)?
Author
Owner

@dictoon commented on GitHub (Feb 23, 2024):

@PhilipAmadasun Excellent question: sadly, I have no idea :)

I'm afraid that comments on this issue aren't going to be seen since the issue is closed. Perhaps you could post your question in a new issue (and link it here, because I'd love to follow)?

<!-- gh-comment-id:1960917980 --> @dictoon commented on GitHub (Feb 23, 2024): @PhilipAmadasun Excellent question: sadly, I have no idea :) I'm afraid that comments on this issue aren't going to be seen since the issue is closed. Perhaps you could post your question in a new issue (and link it here, because I'd love to follow)?
Author
Owner

@PhilipAmadasun commented on GitHub (Feb 23, 2024):

@dictoon Sure! here's the link

<!-- gh-comment-id:1961765795 --> @PhilipAmadasun commented on GitHub (Feb 23, 2024): @dictoon Sure! here's the [link](https://github.com/ollama/ollama/issues/2714)
Author
Owner

@gaardhus commented on GitHub (Aug 28, 2024):

Thanks @jmorganca.

I'm invoking Ollama through OpenAI's API in Python. Do you know if there's documentation on passing additional options such as context size?

I've tried this, but it doesn't work:

options = dict(num_ctx=4096)
response = self.client.chat.completions.create(
    model=Plugin.LLM_MODEL, messages=conversation, extra_body={"options": options})

This does not work since the extra_body payload is appended to the rest of the body through an extra_json field, rather than being directly merged with it

<!-- gh-comment-id:2316088620 --> @gaardhus commented on GitHub (Aug 28, 2024): > Thanks @jmorganca. > > I'm invoking Ollama through OpenAI's API in Python. Do you know if there's documentation on passing additional options such as context size? > > I've tried this, but it doesn't work: > > ``` > options = dict(num_ctx=4096) > response = self.client.chat.completions.create( > model=Plugin.LLM_MODEL, messages=conversation, extra_body={"options": options}) > ``` This does not work since the extra_body payload is appended to the rest of the body through an extra_json field, rather than being directly merged with it
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1527