[GH-ISSUE #10680] tried the new 0.7 and my t/s speed dropped from 2.9 to 0.5 on a 24b model #7019

Closed
opened 2026-04-12 18:54:54 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @AncientMystic on GitHub (May 13, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10680

What is the issue?

i just tried the new ollama 0.7 and i had been playing around with a 24B model that is 14GB and loads 75% cpu/25% GPU on a install of ubuntu with 4GB vram via vGPU profile.

i was seeing 2.5-2.9+t/s on 0.6.8 but i tried 0.7 and it dropped to a max of 0.5 (tried generating 10 responses) where it is generating, very, very slowly, i went back to 0.6.8 and i am back to 2.9t/s, so for some reason ollama is 5-6x slower in the latest pre-release.

hardware:
CPU: i7-7820X
GPU: Tesla P4 8GB using 4GB vGPU profile
Driver version: 550.144
cuda version: 12.4
compute version: 6.1

OS

Linux

GPU

Nvidia

CPU

Intel

Originally created by @AncientMystic on GitHub (May 13, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10680 ### What is the issue? i just tried the new ollama 0.7 and i had been playing around with a 24B model that is 14GB and loads 75% cpu/25% GPU on a install of ubuntu with 4GB vram via vGPU profile. i was seeing 2.5-2.9+t/s on 0.6.8 but i tried 0.7 and it dropped to a max of 0.5 (tried generating 10 responses) where it is generating, very, very slowly, i went back to 0.6.8 and i am back to 2.9t/s, so for some reason ollama is 5-6x slower in the latest pre-release. hardware: CPU: i7-7820X GPU: Tesla P4 8GB using 4GB vGPU profile Driver version: 550.144 cuda version: 12.4 compute version: 6.1 ### OS Linux ### GPU Nvidia ### CPU Intel
GiteaMirror added the bug label 2026-04-12 18:54:54 -05:00
Author
Owner

@rick-github commented on GitHub (May 13, 2025):

Server logs from both versions may aid in debugging.

<!-- gh-comment-id:2875322827 --> @rick-github commented on GitHub (May 13, 2025): [Server logs]( https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) from both versions may aid in debugging.
Author
Owner

@rick-github commented on GitHub (May 13, 2025):

Can confirm that models that spill to system RAM take a significant performance hit. Affects both old and new engines.

$ for m in qwen2.5:14b-instruct-q4_K_M phi4-reasoning:14b-q4_K_M gemma3:12b-it-q4_K_M qwen3:14b-q4_K_M ; do for v in 0.6.8 0.7.0-rc0 ; do OLLAMA_DEBUG=1 OLLAMA_DOCKER_TAG=$v OLLAMA_KEEP_ALIVE=-1 OLLAMA_NUM_PARALLEL=1 docker compose up -d ollama 2>&- >&- ; sleep 2 ; for c in 4096 65536 ; do echo $v $m $c ; curl -s localhost:11434/api/generate -d '{"model":"'$m'","prompt":"hello","stream":false,"options":{"num_ctx":'$c'}}' | jq -r '" prompt tps: \(.prompt_eval_count/(.prompt_eval_duration/1000000000))"," eval tps  : \(.eval_count/(.eval_duration/1000000000))"' ; ollama ps | grep $m ; echo ; done ; done ; done
0.6.8 qwen2.5:14b-instruct-q4_K_M 4096
 prompt tps: 228.23505444398873
 eval tps  : 40.94157748537369
qwen2.5:14b-instruct-q4_K_M    7cdf5a0187d5    10 GB    100% GPU     Forever    

0.6.8 qwen2.5:14b-instruct-q4_K_M 65536
 prompt tps: 22.840987952794233
 eval tps  : 7.886128381153751
qwen2.5:14b-instruct-q4_K_M    7cdf5a0187d5    29 GB    44%/56% CPU/GPU    Forever    

0.7.0-rc0 qwen2.5:14b-instruct-q4_K_M 4096
 prompt tps: 220.25571394714981
 eval tps  : 40.93492084414357
qwen2.5:14b-instruct-q4_K_M    7cdf5a0187d5    10 GB    100% GPU     Forever    

0.7.0-rc0 qwen2.5:14b-instruct-q4_K_M 65536
 prompt tps: 1.7845719102765394
 eval tps  : 1.715184921837344
qwen2.5:14b-instruct-q4_K_M    7cdf5a0187d5    29 GB    44%/56% CPU/GPU    Forever    

...
model engine 0.6.8-gpu 0.7.0-gpu 0.6.8-hybrid 0.7.0-hybrid 0.6.8-cpu 0.7.0-cpu
qwen2.5:14b-instruct-q4_K_M old 40.9 40.9 7.8 1.7 6.5 1.1
phi4-reasoning:14b-q4_K_M old 33.1 33.1 5.1 1.1 5.2 0.9
qwen3:14b-q4_K_M old 40.8 40.8 6.6 1.3 6.1 0.9
gemma3:12b-it-q4_K_M new 45.8 45.6 25.7 10.5 7.6 1.1
gemma2:9b-instruct-q4_K_M new 59.1 58.0 11.5 2.5 9.7 1.6
<!-- gh-comment-id:2876271530 --> @rick-github commented on GitHub (May 13, 2025): Can confirm that models that spill to system RAM take a significant performance hit. Affects both old and new engines. ```console $ for m in qwen2.5:14b-instruct-q4_K_M phi4-reasoning:14b-q4_K_M gemma3:12b-it-q4_K_M qwen3:14b-q4_K_M ; do for v in 0.6.8 0.7.0-rc0 ; do OLLAMA_DEBUG=1 OLLAMA_DOCKER_TAG=$v OLLAMA_KEEP_ALIVE=-1 OLLAMA_NUM_PARALLEL=1 docker compose up -d ollama 2>&- >&- ; sleep 2 ; for c in 4096 65536 ; do echo $v $m $c ; curl -s localhost:11434/api/generate -d '{"model":"'$m'","prompt":"hello","stream":false,"options":{"num_ctx":'$c'}}' | jq -r '" prompt tps: \(.prompt_eval_count/(.prompt_eval_duration/1000000000))"," eval tps : \(.eval_count/(.eval_duration/1000000000))"' ; ollama ps | grep $m ; echo ; done ; done ; done 0.6.8 qwen2.5:14b-instruct-q4_K_M 4096 prompt tps: 228.23505444398873 eval tps : 40.94157748537369 qwen2.5:14b-instruct-q4_K_M 7cdf5a0187d5 10 GB 100% GPU Forever 0.6.8 qwen2.5:14b-instruct-q4_K_M 65536 prompt tps: 22.840987952794233 eval tps : 7.886128381153751 qwen2.5:14b-instruct-q4_K_M 7cdf5a0187d5 29 GB 44%/56% CPU/GPU Forever 0.7.0-rc0 qwen2.5:14b-instruct-q4_K_M 4096 prompt tps: 220.25571394714981 eval tps : 40.93492084414357 qwen2.5:14b-instruct-q4_K_M 7cdf5a0187d5 10 GB 100% GPU Forever 0.7.0-rc0 qwen2.5:14b-instruct-q4_K_M 65536 prompt tps: 1.7845719102765394 eval tps : 1.715184921837344 qwen2.5:14b-instruct-q4_K_M 7cdf5a0187d5 29 GB 44%/56% CPU/GPU Forever ... ``` | model | engine | 0.6.8-gpu | 0.7.0-gpu | 0.6.8-hybrid | 0.7.0-hybrid | 0.6.8-cpu | 0.7.0-cpu | -- | -- | -- | -- | -- | -- | -- | -- | | qwen2.5:14b-instruct-q4_K_M | old | 40.9 | 40.9 | 7.8 | 1.7 | 6.5 | 1.1 | | phi4-reasoning:14b-q4_K_M | old | 33.1 | 33.1 | 5.1 | 1.1 | 5.2 | 0.9 | | qwen3:14b-q4_K_M | old | 40.8 | 40.8 | 6.6 | 1.3 | 6.1 | 0.9 | | gemma3:12b-it-q4_K_M | new | 45.8 | 45.6 | 25.7 | 10.5 | 7.6 | 1.1 | | gemma2:9b-instruct-q4_K_M | new | 59.1 | 58.0 | 11.5 | 2.5 | 9.7 | 1.6 |
Author
Owner

@AncientMystic commented on GitHub (May 13, 2025):

thank you for adding all of this, this is very well summarized,

<!-- gh-comment-id:2877645353 --> @AncientMystic commented on GitHub (May 13, 2025): thank you for adding all of this, this is very well summarized,
Author
Owner

@rick-github commented on GitHub (May 14, 2025):

Looks like this is fixed in 0.7.0-rc1.

model 0.6.8-cpu 0.7.0-rc0-cpu 0.7.0-rc1-cpu
qwen2.5:14b-instruct-q4_K_M 7.1 1.1 6.5
phi4-reasoning:14b-q4_K_M 5.1 0.9 5.1
qwen3:14b-q4_K_M 6.2 0.9 6.1
gemma3:12b-it-q4_K_M 7.6 1.1 7.6
gemma2:9b-instruct-q4_K_M 9.4 1.6 9.6
<!-- gh-comment-id:2880321366 --> @rick-github commented on GitHub (May 14, 2025): Looks like this is fixed in 0.7.0-rc1. | model | 0.6.8-cpu | 0.7.0-rc0-cpu | 0.7.0-rc1-cpu | | -- | -- | -- | -- | | qwen2.5:14b-instruct-q4_K_M | 7.1| 1.1 | 6.5 | | phi4-reasoning:14b-q4_K_M | 5.1 | 0.9 | 5.1 | | qwen3:14b-q4_K_M | 6.2 | 0.9 | 6.1 | | gemma3:12b-it-q4_K_M | 7.6 | 1.1 | 7.6 | | gemma2:9b-instruct-q4_K_M | 9.4 | 1.6 | 9.6 |
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7019