[GH-ISSUE #7740] Performance is decreasing #4942

Closed
opened 2026-04-12 16:00:13 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @murzein on GitHub (Nov 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7740

What is the issue?

Load any model, for example gemma2:27b
OLLAMA_KEEP_ALIVE: -1

command: ollama run gemma2:27b --verbose
message: tell me about amd

1 iteration
total duration: 12.9263992s
load duration: 32.8561ms
prompt eval count: 13 token(s) <----
prompt eval duration: 73ms
prompt eval rate: 178.08 tokens/s
eval count: 480 token(s)
eval duration: 12.81s
eval rate: 37.47 tokens/s <----

2 iteration
prompt eval count: 506 token(s)
eval rate: 36.64 tokens/s

3 iteration
prompt eval count: 882 token(s)
eval rate: 36.05 tokens/s

4 iteration
prompt eval count: 1244 token(s)
eval rate: 35.81 tokens/s

5 iteration
prompt eval count: 1584 token(s)
eval rate: 35.30 tokens/s

6 iteration
prompt eval count: 1860 token(s)
eval rate: 17.04 tokens/s

Overload prompt, eval rate - 17.04 tokens/s

Reload the script
ctrl+z, and again:
command: ollama run gemma2:27b --verbose
message: tell me about amd

1 iteration
prompt eval count: 13 token(s)
eval rate: 18.01 tokens/s

Small prompt, but speed low.

I am experiencing the same issue when accessing ollama via the API.
Context is getting overflow somewhere.

OS

Linux, Windows

GPU

Nvidia, AMD

CPU

Intel

Ollama version

0.4.2

Originally created by @murzein on GitHub (Nov 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7740 ### What is the issue? Load any model, for example gemma2:27b OLLAMA_KEEP_ALIVE: -1 command: **ollama run gemma2:27b --verbose** message: **tell me about amd** **1 iteration** total duration: 12.9263992s load duration: 32.8561ms prompt eval count: **13 token(s)** <---- prompt eval duration: 73ms prompt eval rate: 178.08 tokens/s eval count: 480 token(s) eval duration: 12.81s eval rate: **37.47 tokens/s** <---- **2 iteration** prompt eval count: **506 token(s)** eval rate: **36.64 tokens/s** **3 iteration** prompt eval count: **882 token(s)** eval rate: **36.05 tokens/s** **4 iteration** prompt eval count: **1244 token(s)** eval rate: **35.81 tokens/s** **5 iteration** prompt eval count: **1584 token(s)** eval rate: **35.30 tokens/s** **6 iteration** prompt eval count: **1860 token(s)** eval rate: **17.04 tokens/s** Overload prompt, eval rate - 17.04 tokens/s **Reload the script** ctrl+z, and again: command: **ollama run gemma2:27b --verbose** message: **tell me about amd** **1 iteration** prompt eval count: **13 token(s)** eval rate: **18.01 tokens/s** Small prompt, but speed low. I am experiencing the same issue when accessing ollama via the API. Context is getting overflow somewhere. ### OS Linux, Windows ### GPU Nvidia, AMD ### CPU Intel ### Ollama version 0.4.2
GiteaMirror added the bug label 2026-04-12 16:00:13 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 19, 2024):

https://github.com/ollama/ollama/issues/7717

<!-- gh-comment-id:2485767536 --> @rick-github commented on GitHub (Nov 19, 2024): https://github.com/ollama/ollama/issues/7717
Author
Owner

@jessegross commented on GitHub (Nov 19, 2024):

I agree that this is likely the same as #7717 so it is best to track it there. However, @murzein I think we have a fix in main if you are able to build and try it out.

<!-- gh-comment-id:2487002945 --> @jessegross commented on GitHub (Nov 19, 2024): I agree that this is likely the same as #7717 so it is best to track it there. However, @murzein I think we have a fix in `main` if you are able to build and try it out.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4942