[GH-ISSUE #6901] High CPU and slow generate tockens #4365

Closed
opened 2026-04-12 15:18:18 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @maco6096 on GitHub (Sep 21, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6901

What is the issue?

I have 32c , 64G mem, cpu is avx2, run qwen1.5-7B-chat.gguf, cpu load 3000% and tocken generate very slow.
this is my cpu config: (base) [app@T-LSM-1 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 32
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Stepping: 2
CPU MHz: 2297.339
BogoMIPS: 4594.67
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-31

Top:
top - 11:29:21 up 21 days, 19:09, 2 users, load average: 28.07, 16.55, 7.01Tasks: 528 total, 3 running, 525 sleeping, 0 stopped, 0 zombie%Cpu(s): 88.3 us, 0.4 sy, 0.0 ni, 1.8 id, 0.0 wa, 9.2 hi, 0.2 si, 0.0 stMiB Mem : 63857.0 total, 2803.0 free, 2671.4 used, 58382.7 buff/cacheMiB Swap: 8192.0 total, 8185.2 free, 6.8 used. 60476.7 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                        163803 root      20   0 2488468   1.1g   7668 R  3025   1.8   1679:15 ollama_llama_se                                                               

1919 root 20 0 1121684 518260 21516 S 6.6 0.8 224:03.88 ds_agent

There are some problem?

Who can help me?

Thank you!

OS

Linux

GPU

No response

CPU

Intel

Ollama version

0.3.11

Originally created by @maco6096 on GitHub (Sep 21, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6901 ### What is the issue? I have 32c , 64G mem, cpu is avx2, run qwen1.5-7B-chat.gguf, cpu load 3000% and tocken generate very slow. this is my cpu config: (base) [app@T-LSM-1 ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 32 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz Stepping: 2 CPU MHz: 2297.339 BogoMIPS: 4594.67 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-31 Top: top - 11:29:21 up 21 days, 19:09, 2 users, load average: 28.07, 16.55, 7.01Tasks: 528 total, 3 running, 525 sleeping, 0 stopped, 0 zombie%Cpu(s): 88.3 us, 0.4 sy, 0.0 ni, 1.8 id, 0.0 wa, 9.2 hi, 0.2 si, 0.0 stMiB Mem : 63857.0 total, 2803.0 free, 2671.4 used, 58382.7 buff/cacheMiB Swap: 8192.0 total, 8185.2 free, 6.8 used. 60476.7 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 163803 root 20 0 2488468 1.1g 7668 R 3025 1.8 1679:15 ollama_llama_se 1919 root 20 0 1121684 518260 21516 S 6.6 0.8 224:03.88 ds_agent There are some problem? Who can help me? Thank you! ### OS Linux ### GPU _No response_ ### CPU Intel ### Ollama version 0.3.11
GiteaMirror added the question label 2026-04-12 15:18:18 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 21, 2024):

top - 11:29:21 up 21 days, 19:09, 2 users, load average: 28.07, 16.55, 7.01
Tasks: 528 total, 3 running, 525 sleeping, 0 stopped, 0 zombie
%Cpu(s): 88.3 us, 0.4 sy, 0.0 ni, 1.8 id, 0.0 wa, 9.2 hi, 0.2 si, 0.0 st
MiB Mem : 63857.0 total, 2803.0 free, 2671.4 used, 58382.7 buff/cache
MiB Swap: 8192.0 total, 8185.2 free, 6.8 used. 60476.7 avail Mem
PID    USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
163803 root      20   0 2488468   1.1g   7668 R  3025   1.8   1679:15 ollama_llama_se
1919   root      20   0 1121684 518260  21516 S   6.6   0.8 224:03.88 ds_agent                                                               

You don't have a GPU, so inference is done on the CPU. CPU is slower than GPU, so token generation is slow. Without upgrading hardware, there are limited options to increase the rate of token generation. You can try a smaller model, that will run faster but the results may be poorer. You can try running a different quant of the model - you don't indicate which one you are using, but perhaps Q2_K or Q3_K might suit your needs. You can try increasing the number of threads that ollama uses:

$ ollama run qwen2.5:7b-instruct-q4_K_M
>>> /set parameter num_thread 32
Set parameter 'num_thread' to '32'
>>> hello
Hello! How can I assist you today?

>>> Send a message (/? for help)

This will increase CPU load but may increase speed of token generation.

<!-- gh-comment-id:2365028929 --> @rick-github commented on GitHub (Sep 21, 2024): ``` top - 11:29:21 up 21 days, 19:09, 2 users, load average: 28.07, 16.55, 7.01 Tasks: 528 total, 3 running, 525 sleeping, 0 stopped, 0 zombie %Cpu(s): 88.3 us, 0.4 sy, 0.0 ni, 1.8 id, 0.0 wa, 9.2 hi, 0.2 si, 0.0 st MiB Mem : 63857.0 total, 2803.0 free, 2671.4 used, 58382.7 buff/cache MiB Swap: 8192.0 total, 8185.2 free, 6.8 used. 60476.7 avail Mem ``` ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 163803 root 20 0 2488468 1.1g 7668 R 3025 1.8 1679:15 ollama_llama_se 1919 root 20 0 1121684 518260 21516 S 6.6 0.8 224:03.88 ds_agent ``` You don't have a GPU, so inference is done on the CPU. CPU is slower than GPU, so token generation is slow. Without upgrading hardware, there are limited options to increase the rate of token generation. You can try a smaller model, that will run faster but the results may be poorer. You can try running a different quant of the model - you don't indicate which one you are using, but perhaps Q2_K or Q3_K might suit your needs. You can try increasing the number of threads that ollama uses: ``` $ ollama run qwen2.5:7b-instruct-q4_K_M >>> /set parameter num_thread 32 Set parameter 'num_thread' to '32' >>> hello Hello! How can I assist you today? >>> Send a message (/? for help) ``` This will increase CPU load but may increase speed of token generation.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4365