[GH-ISSUE #9306] Ollama performing better with fewer threads #6071

Open
opened 2026-04-12 17:24:03 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @AncientMystic on GitHub (Feb 24, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9306

i benchmarked my ram with this rambenchmark tool

image

https://github.com/rsusik/rambenchmark

And i noticed 7 threads hit higher than any other, so i switched ollama to 7 threads instead of 16, which is one below the total core count.

Results:
16 threads = 0.83t/s
7 threads = 6.56t/s

Maybe this will help someone else too, i was wondering why my token rate was so low and there it is, apparently less threads and less CPU usage equals a 155% improvement in performance, as i guess setting the full number of threads is over allocating the CPU.

This was specifically testing phi4 14b Q4_K_M (8.4GB) that consumed 10gb ram and offloads 36% to CPU

Testing other models that exceed vram:
M-MOE-4X7B-Dark-MultiVerse-UC-E32-24B-D_AU-max-cpu-IQ4_XS (12.5gb) consuming 15gb ram/vram 58% cpu/42% gpu:

2.7 t/s

DeepSeek-R1-Distill-Qwen-14B-IQ4_XS (7.6gb) using 10gb 38% cpu/62% gpu:

4.93t/s

DeepSeek-R1-Distill-Llama-70B-abliterated.IQ4_XS (35.6gb) using 41gb 84% cpu/16% gpu:

0.93 t/s

(I didnt even dare load 70b before since it was dipping to 0.x tokens on 14-24b)

Testing models that fit into vram, it seems to have absolutely zero effect positive or negative on token speed although ollama feels a bit snappier and seems to load them into ram/vram faster.

Specs:
OS: Ubuntu 22.04 kernel v6.0.0 in a vm running on Proxmox 8.3.4
CPU: i7-7820X 8c/16t
Ram: 96gb (8 sticks) 42gb allocated to VM
GPU: Nvidia tesla p4 8gb with 7gb vGPU profile in vm
Ollama: 0.5.11 (flash attention enabled, kv cache = q8_0 )

Originally created by @AncientMystic on GitHub (Feb 24, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9306 i benchmarked my ram with this rambenchmark tool ![image](https://github.com/user-attachments/assets/9d540967-7ec7-4d01-8483-23566a82ca15) https://github.com/rsusik/rambenchmark And i noticed 7 threads hit higher than any other, so i switched ollama to 7 threads instead of 16, which is one below the total core count. Results: 16 threads = 0.83t/s 7 threads = 6.56t/s Maybe this will help someone else too, i was wondering why my token rate was so low and there it is, apparently less threads and less CPU usage equals a 155% improvement in performance, as i guess setting the full number of threads is over allocating the CPU. This was specifically testing phi4 14b Q4_K_M (8.4GB) that consumed 10gb ram and offloads 36% to CPU Testing other models that exceed vram: M-MOE-4X7B-Dark-MultiVerse-UC-E32-24B-D_AU-max-cpu-IQ4_XS (12.5gb) consuming 15gb ram/vram 58% cpu/42% gpu: 2.7 t/s DeepSeek-R1-Distill-Qwen-14B-IQ4_XS (7.6gb) using 10gb 38% cpu/62% gpu: 4.93t/s DeepSeek-R1-Distill-Llama-70B-abliterated.IQ4_XS (35.6gb) using 41gb 84% cpu/16% gpu: 0.93 t/s (I didnt even dare load 70b before since it was dipping to 0.x tokens on 14-24b) Testing models that fit into vram, it seems to have absolutely zero effect positive or negative on token speed although ollama feels a bit snappier and seems to load them into ram/vram faster. Specs: OS: Ubuntu 22.04 kernel v6.0.0 in a vm running on Proxmox 8.3.4 CPU: i7-7820X 8c/16t Ram: 96gb (8 sticks) 42gb allocated to VM GPU: Nvidia tesla p4 8gb with 7gb vGPU profile in vm Ollama: 0.5.11 (flash attention enabled, kv cache = q8_0 )
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6071