[GH-ISSUE #7807] newer version ollama chat more slower #67049

Closed
opened 2026-05-04 09:21:13 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @krmao on GitHub (Nov 23, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7807

What is the issue?

with the same code on the same machine
Apple M2 Pro
macos 15.1.1 (24B91)

import time
import ollama

start_time = time.perf_counter()
#len(final_chat_messages)= 6107
ai_response = ollama.chat(model=model, messages=final_chat_messages, tools=TOOLS)
print(f'time after ollama.chat: {((time.perf_counter() - start_time) * 1000):.0f}ms')

after a lots of test:

0.3.14 only need 1s, first time need 4580ms, but then only need 1073ms with the same question

0.4.2 need 20s+,

0.4.3-0.4.4 need 10s

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.4.2

Originally created by @krmao on GitHub (Nov 23, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7807 ### What is the issue? with the same code on the same machine Apple M2 Pro macos 15.1.1 (24B91) ```python import time import ollama start_time = time.perf_counter() #len(final_chat_messages)= 6107 ai_response = ollama.chat(model=model, messages=final_chat_messages, tools=TOOLS) print(f'time after ollama.chat: {((time.perf_counter() - start_time) * 1000):.0f}ms') ``` after a lots of test: 0.3.14 only need 1s, first time need 4580ms, but then only need 1073ms with the same question 0.4.2 need 20s+, 0.4.3-0.4.4 need 10s ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.4.2
GiteaMirror added the bug label 2026-05-04 09:21:13 -05:00
Author
Owner

@jmorganca commented on GitHub (Nov 23, 2024):

@krmao thanks for the issue. May I ask what the prompt was (or how long it was?) This will help debug the performance issues.

<!-- gh-comment-id:2495289363 --> @jmorganca commented on GitHub (Nov 23, 2024): @krmao thanks for the issue. May I ask what the prompt was (or how long it was?) This will help debug the performance issues.
Author
Owner

@krmao commented on GitHub (Nov 23, 2024):

can be found at the code #len(final_chat_messages)= 6107, about 6107+ chars about role system content(include 193 countries full and short names), and one short question about role user content(such as 'Pleast turn map to United States?')

  • the mode is dwightfoster03/functionary-small-v3.1:latest
  • TOOLS have two functions, one function is about get the country full name or nick name or latitude and longitude ranges
<!-- gh-comment-id:2495291574 --> @krmao commented on GitHub (Nov 23, 2024): can be found at the code `#len(final_chat_messages)= 6107`, about 6107+ chars about `role system content`(include 193 countries full and short names), and one short question about `role user content`(such as `'Pleast turn map to United States?'`) - the mode is `dwightfoster03/functionary-small-v3.1:latest` - `TOOLS` have two functions, one function is about get the country full name or nick name or latitude and longitude ranges
Author
Owner

@krmao commented on GitHub (Nov 23, 2024):

I have a new found, 0.3.14 first time need 4580ms, but then only need 1073ms with the same question

if change to the different question, then need 4493ms, seems not stable. I have not attention this on weeks ago.
it works everything ok before I am upgrading to the latest version. I forgot the first ok ollama version.

I test again, with the following results

0.4.4    8228ms different questions, 8045ms same questions
0.3.14   4513ms different questions, 1089ms same questions
0.3.13   4545ms different questions
0.3.12   4511ms different questions, 1103ms same questions
<!-- gh-comment-id:2495313960 --> @krmao commented on GitHub (Nov 23, 2024): I have a new found, `0.3.14 ` first time need 4580ms, but then only need 1073ms with the same question if change to the different question, then need 4493ms, seems not stable. I have not attention this on weeks ago. it works everything ok before I am upgrading to the latest version. I forgot the first ok ollama version. I test again, with the following results ``` 0.4.4 8228ms different questions, 8045ms same questions 0.3.14 4513ms different questions, 1089ms same questions 0.3.13 4545ms different questions 0.3.12 4511ms different questions, 1103ms same questions ```
Author
Owner

@rick-github commented on GitHub (Nov 23, 2024):

Have you adjusted the size of the context window (num_ctx)? 0.4.3+ handle long messages differently to 0.3.* when they exceed context.

<!-- gh-comment-id:2495431506 --> @rick-github commented on GitHub (Nov 23, 2024): Have you adjusted the size of the context window (`num_ctx`)? 0.4.3+ handle long messages differently to 0.3.* when they exceed context.
Author
Owner

@krmao commented on GitHub (Nov 25, 2024):

Have you adjusted the size of the context window (num_ctx)? 0.4.3+ handle long messages differently to 0.3.* when they exceed context.

yes, I have the code to add history question and response

<!-- gh-comment-id:2496524395 --> @krmao commented on GitHub (Nov 25, 2024): > Have you adjusted the size of the context window (`num_ctx`)? 0.4.3+ handle long messages differently to 0.3.* when they exceed context. yes, I have the code to add history question and response
Author
Owner

@rick-github commented on GitHub (Nov 25, 2024):

If you could provide a complete script to demonstrate the problem you are seeing it would make debugging much easier.

<!-- gh-comment-id:2496615232 --> @rick-github commented on GitHub (Nov 25, 2024): If you could provide a complete script to demonstrate the problem you are seeing it would make debugging much easier.
Author
Owner

@krmao commented on GitHub (Nov 25, 2024):

In the process of creating a test code, I discovered that the system prompt might have been added twice, resulting in duplication. After removing one of them, each inference took less than 1 second, even for different questions.

This revealed the root cause of the issue: the doubled prompt string exceeded 10,000 characters. This highlights the differences in handling very long prompts between version 0.3.x and 0.4.x.

Apologies for taking up everyone's time. The issue has now been resolved. Thank you all!

<!-- gh-comment-id:2496766765 --> @krmao commented on GitHub (Nov 25, 2024): In the process of creating a test code, I discovered that the system prompt might have been added twice, resulting in duplication. After removing one of them, each inference took less than 1 second, even for different questions. This revealed the root cause of the issue: the doubled prompt string exceeded 10,000 characters. This highlights the differences in handling very long prompts between version 0.3.x and 0.4.x. Apologies for taking up everyone's time. The issue has now been resolved. Thank you all!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#67049