[GH-ISSUE #12504] Prompt evaluation is MUCH slower using the new Ollama engine #34061

Closed
opened 2026-04-22 17:17:34 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @Maltz42 on GitHub (Oct 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12504

What is the issue?

Beginning with v0.12.2, when Qwen3 started using the Ollama engine, prompt evaluation performance has plummeted. I suspect that the new engine is doing prompt evaluation on CPU instead of GPU, whereas the old engine left space for the context window in VRAM. Here are my observations.

  • VRAM is nearly completely full when a model is first loaded, regardless of context window size, where in the old engine, a larger context window would leave more VRAM unused at the start.
  • System RAM usage now increases by context window size, while this was not the case in the old engine
  • Only one core in heavy use during prompt evaluation in old engine, while new engine maxes out all CPU cores to 100%

Here is --verbose output using both the old engine and the new Ollama engine. This is qwen3:235b-a22b-instruct-2507-q8_0, 24k context window (full), split across three GPUs (192GB VRAM total) and spilling over into system RAM a bit. There is an increase in inference speed, due to more of the model being in VRAM, but pushing prompt evaluation onto the CPU makes performance far, far worse overall, from about 4 minutes total to 23 minutes from the time the prompt is entered to the time the reply is complete

Relevant log output

0.12.4rc5
total duration:       23m5.891375436s
load duration:        54.853551ms
prompt eval count:    24490 token(s)
prompt eval duration: 22m13.003072681s
prompt eval rate:     18.37 tokens/s
eval count:           248 token(s)
eval duration:        46.448428687s
eval rate:            5.34 tokens/s

0.12.1:
total duration:       4m18.250874851s
load duration:        51.181565ms
prompt eval count:    24483 token(s)
prompt eval duration: 2m44.736125262s
prompt eval rate:     148.62 tokens/s
eval count:           266 token(s)
eval duration:        1m17.38724237s
eval rate:            3.44 tokens/s

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.12.4rc5

Originally created by @Maltz42 on GitHub (Oct 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12504 ### What is the issue? Beginning with v0.12.2, when Qwen3 started using the Ollama engine, prompt evaluation performance has plummeted. I suspect that the new engine is doing prompt evaluation on CPU instead of GPU, whereas the old engine left space for the context window in VRAM. Here are my observations. - VRAM is nearly completely full when a model is first loaded, regardless of context window size, where in the old engine, a larger context window would leave more VRAM unused at the start. - System RAM usage now increases by context window size, while this was not the case in the old engine - Only one core in heavy use during prompt evaluation in old engine, while new engine maxes out all CPU cores to 100% Here is --verbose output using both the old engine and the new Ollama engine. This is qwen3:235b-a22b-instruct-2507-q8_0, 24k context window (full), split across three GPUs (192GB VRAM total) and spilling over into system RAM a bit. There is an increase in inference speed, due to more of the model being in VRAM, but pushing prompt evaluation onto the CPU makes performance far, far worse overall, from about 4 minutes total to 23 minutes from the time the prompt is entered to the time the reply is complete ### Relevant log output ```shell 0.12.4rc5 total duration: 23m5.891375436s load duration: 54.853551ms prompt eval count: 24490 token(s) prompt eval duration: 22m13.003072681s prompt eval rate: 18.37 tokens/s eval count: 248 token(s) eval duration: 46.448428687s eval rate: 5.34 tokens/s 0.12.1: total duration: 4m18.250874851s load duration: 51.181565ms prompt eval count: 24483 token(s) prompt eval duration: 2m44.736125262s prompt eval rate: 148.62 tokens/s eval count: 266 token(s) eval duration: 1m17.38724237s eval rate: 3.44 tokens/s ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.12.4rc5
GiteaMirror added the bug label 2026-04-22 17:17:34 -05:00
Author
Owner

@Maltz42 commented on GitHub (Oct 5, 2025):

As a workaround, is there a way to force the use of the old engine?

<!-- gh-comment-id:3368665224 --> @Maltz42 commented on GitHub (Oct 5, 2025): As a workaround, is there a way to force the use of the old engine?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#34061