[GH-ISSUE #11221] ollama has problems getting threads parameters through /proc/cpuinfo file under arm64 architecture #33153

Open
opened 2026-04-22 15:34:23 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @yewen024 on GitHub (Jun 27, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11221

What is the issue?

When I used ollama to infer the deepseek-r1:32b model in the Kunpeng + NVIDIA T4 environment, I found that it was very slow and all CPU threads were fully occupied. By comparing it with the Intel + NVIDIA environment, I found that the runner parameter --threads was different, and Intel threads were only half used.
By reading the ollama source code, I found that the arm64 architecture was not considered when reading /proc/cpuinfo to count CPU information. This is the cpu information structure definition
type linuxCpuInfo struct {
ID string cpuinfo:"processor"
VendorID string cpuinfo:"vendor_id"
ModelName string cpuinfo:"model name"
PhysicalID string cpuinfo:"physical id"
Siblings string cpuinfo:"siblings"
CoreID string cpuinfo:"core id"
}
but x86 and arm64 are different.
This is the result of cpu statistics:
Kunpeng [{ID: VendorID: ModelName:Kunpeng-920 CoreCount:256 EfficiencyCoreCount:0 ThreadCount:256}]
Intel [{ID:0 VendorID:GenuineIntel ModelName:Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz CoreCount:14 EfficiencyCoreCount:0 ThreadCount:28} {ID:1 VendorID:GenuineIntel ModelName:Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz CoreCount:14 EfficiencyCoreCount:0 ThreadCount:28}]

x86:
processor : 55
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz
stepping : 2
microcode : 0x43
cpu MHz : 3300.000
cache size : 35840 KB
physical id : 1
siblings : 28
core id : 14
cpu cores : 14
apicid : 61
initial apicid : 61
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts vnmi md_clear flush_l1d
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_stale_data
bogomips : 4599.92
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

arm64:
processor : 255
model name : Kunpeng-920
BogoMIPS : 200.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp flagm2 frint svei8mm svef32mm svef64mm svebf16 i8mm bf16 dgh rng bti ecv
CPU implementer : 0x48
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd02
CPU revision : 0

Currently I solve this problem by modifying the modelfile adding the num_thread parameter to recreate the model

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @yewen024 on GitHub (Jun 27, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11221 ### What is the issue? When I used ollama to infer the deepseek-r1:32b model in the Kunpeng + NVIDIA T4 environment, I found that it was very slow and all CPU threads were fully occupied. By comparing it with the Intel + NVIDIA environment, I found that the runner parameter --threads was different, and Intel threads were only half used. By reading the ollama source code, I found that the arm64 architecture was not considered when reading /proc/cpuinfo to count CPU information. This is the cpu information structure definition type linuxCpuInfo struct { ID string `cpuinfo:"processor"` VendorID string `cpuinfo:"vendor_id"` ModelName string `cpuinfo:"model name"` PhysicalID string `cpuinfo:"physical id"` Siblings string `cpuinfo:"siblings"` CoreID string `cpuinfo:"core id"` } but x86 and arm64 are different. This is the result of cpu statistics: Kunpeng [{ID: VendorID: ModelName:Kunpeng-920 CoreCount:256 EfficiencyCoreCount:0 ThreadCount:256}] Intel [{ID:0 VendorID:GenuineIntel ModelName:Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz CoreCount:14 EfficiencyCoreCount:0 ThreadCount:28} {ID:1 VendorID:GenuineIntel ModelName:Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz CoreCount:14 EfficiencyCoreCount:0 ThreadCount:28}] x86: processor : 55 vendor_id : GenuineIntel cpu family : 6 model : 63 model name : Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz stepping : 2 microcode : 0x43 cpu MHz : 3300.000 cache size : 35840 KB physical id : 1 siblings : 28 core id : 14 cpu cores : 14 apicid : 61 initial apicid : 61 fpu : yes fpu_exception : yes cpuid level : 15 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts vnmi md_clear flush_l1d vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_stale_data bogomips : 4599.92 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: arm64: processor : 255 model name : Kunpeng-920 BogoMIPS : 200.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp flagm2 frint svei8mm svef32mm svef64mm svebf16 i8mm bf16 dgh rng bti ecv CPU implementer : 0x48 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd02 CPU revision : 0 Currently I solve this problem by modifying the modelfile adding the num_thread parameter to recreate the model ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-22 15:34:23 -05:00
Author
Owner

@PrincessKhanNZ commented on GitHub (Jun 30, 2025):

I agree. Performance on ARM64 has slowed to a crawl. This is a bug that some previous versions had, and were then fixed. But now it's happening again with this new release

<!-- gh-comment-id:3017404104 --> @PrincessKhanNZ commented on GitHub (Jun 30, 2025): I agree. Performance on ARM64 has slowed to a crawl. This is a bug that some previous versions had, and were then fixed. But now it's happening again with this new release
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33153