[GH-ISSUE #4902] Performance issue with CPU only inference start 0.1.39 - to latest version of todate. #65133

Closed
opened 2026-05-03 19:50:00 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @raymond-infinitecode on GitHub (Jun 7, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4902

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I am running the ollama on intel xeon 32 processors (CPU only) previously which high token generation count using version 0.1.38
However, once I migrate to the latest ollama version 0.1.41, I found that the inference speed for even a model like phi3 on pure CPU slow to a halt.
I retest the version and reproducing the slowness start with 0.1.39.

Unable to provide log details as there is no error just pure slowness.
Didn't change any model nor configuration.
Revert back to ver 0.1.38, the performance turn high speed again.

Using centos 8 linux
Xeon gold processor 32 core

OS

Linux

GPU

Other

CPU

Intel

Ollama version

0.1.39

Originally created by @raymond-infinitecode on GitHub (Jun 7, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4902 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I am running the ollama on intel xeon 32 processors (CPU only) previously which high token generation count using version 0.1.38 However, once I migrate to the latest ollama version 0.1.41, I found that the inference speed for even a model like phi3 on pure CPU slow to a halt. I retest the version and reproducing the slowness start with 0.1.39. Unable to provide log details as there is no error just pure slowness. Didn't change any model nor configuration. Revert back to ver 0.1.38, the performance turn high speed again. Using centos 8 linux Xeon gold processor 32 core ### OS Linux ### GPU Other ### CPU Intel ### Ollama version 0.1.39
GiteaMirror added the performancebug labels 2026-05-03 19:50:00 -05:00
Author
Owner

@chrispaulm commented on GitHub (Jun 13, 2024):

I can confirm this issue with a HPE DL360 G10 with Dual Xeon Gold 6138 (2x 40 cores). All versions of Ollama after 0.1.38 are extremely slow, when running only on CPU.

(OS Debian/Linux, GPU None, CPU Intel, Ollama version v0.1.39 till v0.1.43)

<!-- gh-comment-id:2165886219 --> @chrispaulm commented on GitHub (Jun 13, 2024): I can confirm this issue with a HPE DL360 G10 with Dual Xeon Gold 6138 (2x 40 cores). All versions of Ollama after 0.1.38 are extremely slow, when running only on CPU. (OS Debian/Linux, GPU None, CPU Intel, Ollama version v0.1.39 till v0.1.43)
Author
Owner

@pdevine commented on GitHub (Jun 14, 2024):

Can you post the results of ollama ps?

<!-- gh-comment-id:2167012633 --> @pdevine commented on GitHub (Jun 14, 2024): Can you post the results of `ollama ps`?
Author
Owner

@chrispaulm commented on GitHub (Jun 14, 2024):

Can you post the results of ollama ps?

Sure.

With 0.1.38

NAME ID SIZE PROCESSOR UNTIL
aya:latest 7ef8c4942023 5.7 GB 100% CPU 4 minutes from now

With 0.1.44

NAME ID SIZE PROCESSOR UNTIL
aya:latest 7ef8c4942023 5.7 GB 100% CPU 4 minutes from now

I also did a short test setting 0.1.44 num_thread to 40 and to 80. When setting the threads to 80 it gets very slow. With 40 it is fast. So the issue is connected to the handling of cores and Hyperthreading after version 0.1.38?

See: https://github.com/ollama/ollama/issues/2496#issuecomment-2151408867

<!-- gh-comment-id:2167374655 --> @chrispaulm commented on GitHub (Jun 14, 2024): > Can you post the results of `ollama ps`? Sure. **With 0.1.38** NAME ID SIZE PROCESSOR UNTIL aya:latest 7ef8c4942023 5.7 GB 100% CPU 4 minutes from now **With 0.1.44** NAME ID SIZE PROCESSOR UNTIL aya:latest 7ef8c4942023 5.7 GB 100% CPU 4 minutes from now I also did a short test setting 0.1.44 num_thread to 40 and to 80. When setting the threads to 80 it gets very slow. With 40 it is fast. So the issue is connected to the handling of cores and Hyperthreading after version 0.1.38? See: https://github.com/ollama/ollama/issues/2496#issuecomment-2151408867
Author
Owner

@pdevine commented on GitHub (Jun 16, 2024):

cc @dhiltgen

<!-- gh-comment-id:2171748227 --> @pdevine commented on GitHub (Jun 16, 2024): cc @dhiltgen
Author
Owner

@dhiltgen commented on GitHub (Jun 18, 2024):

It does sound like this is likely a dup of #2496

If you compare the server logs between 0.1.38 and latest, look for a line that looks something like this

INFO [main] system info | n_threads=4 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140660218521472" timestamp=1718729369 total_threads=8

Assuming the thread settings are lower in the newer version, and when you force them to match the old version performance returns to expected, then I think we can dup this to #2496.

<!-- gh-comment-id:2176556600 --> @dhiltgen commented on GitHub (Jun 18, 2024): It does sound like this is likely a dup of #2496 If you compare the server logs between 0.1.38 and latest, look for a line that looks something like this ``` INFO [main] system info | n_threads=4 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140660218521472" timestamp=1718729369 total_threads=8 ``` Assuming the thread settings are lower in the newer version, and when you force them to match the old version performance returns to expected, then I think we can dup this to #2496.
Author
Owner

@Lofanmi commented on GitHub (Jun 20, 2024):

I have also encountered the same issue, which started occurring from version v0.1.39-rc1, but I can confirm that it is not a problem with the Go code; it is caused by the update of the submodule llama.cpp.

https://github.com/ollama/ollama/compare/v0.1.38...v0.1.39-rc1

I tried to set up the development environment():

git checkout v0.1.38
go generate ./...
git checkout v0.1.44
go build

I suspect that it compiled the Go binary for version v0.1.44, but used the llama.cpp from version v0.1.38 (if not, then my guess is wrong).

The speed has become extremely fast! (If using v0.1.39-rc1, it is very slow), so we can rule out the issue with the Go code.

However, there are many changes in llama.cpp, making it difficult to pinpoint the specific cause:

952d03dbea...614d3b914e

OS: Manjaro Linux
GPU: none
CPU: Intel E5-2696v3 x2 + DDR4 128G ECC
Model: ollama run qwen:0.5b

<!-- gh-comment-id:2180380191 --> @Lofanmi commented on GitHub (Jun 20, 2024): I have also encountered the same issue, which started occurring from version `v0.1.39-rc1`, but I can confirm that it is not a problem with the Go code; it is caused by the update of the submodule `llama.cpp`. https://github.com/ollama/ollama/compare/v0.1.38...v0.1.39-rc1 I tried to set up the development environment(): ```bash git checkout v0.1.38 go generate ./... git checkout v0.1.44 go build ``` > I suspect that it compiled the Go binary for version `v0.1.44`, but used the llama.cpp from version `v0.1.38` (if not, then my guess is wrong). The speed has become extremely fast! (If using `v0.1.39-rc1`, it is very slow), so we can rule out the issue with the Go code. However, there are many changes in llama.cpp, making it difficult to pinpoint the specific cause: https://github.com/ggerganov/llama.cpp/compare/952d03dbead16e4dbdd1d3458486340673cc2465...614d3b914e1c3e02596f869649eb4f1d3b68614d OS: Manjaro Linux GPU: none CPU: Intel E5-2696v3 x2 + DDR4 128G ECC Model: ollama run qwen:0.5b
Author
Owner

@dhiltgen commented on GitHub (Jul 3, 2024):

We'll track the incorrect default thread setting in #2496

<!-- gh-comment-id:2207510640 --> @dhiltgen commented on GitHub (Jul 3, 2024): We'll track the incorrect default thread setting in #2496
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65133