[GH-ISSUE #10022] CPU inference much slower than expected #32332

Open
opened 2026-04-22 13:30:00 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @rick-github on GitHub (Mar 27, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10022

Originally assigned to: @dhiltgen on GitHub.

Hey all.
Think this issue is the most related to what I'm experiencing with gemma3:27b.

I'm seeing that with running the model entirely on CPU I'm unable to get any response at all. If I wait long enough in the UI, I get tokens but it's EXTREMELY slow. My machine also goes flat out, for seemingly little output. See attached screenshot. I have tried turning off quantisation and flash attention etc but this has no affect.
Cheers.

Image

Originally posted by @Johnno1011 in #9857

Originally created by @rick-github on GitHub (Mar 27, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10022 Originally assigned to: @dhiltgen on GitHub. > Hey all. > Think this issue is the most related to what I'm experiencing with gemma3:27b. > > I'm seeing that with running the model entirely on CPU I'm unable to get any response at all. If I wait long enough in the UI, I get tokens but it's EXTREMELY slow. My machine also goes flat out, for seemingly little output. See attached screenshot. I have tried turning off quantisation and flash attention etc but this has no affect. > Cheers. > > <img width="945" alt="Image" src="https://github.com/user-attachments/assets/7e1e82a8-c1e0-4fd0-a1ba-38ca70e5dbe6" /> _Originally posted by @Johnno1011 in [#9857](https://github.com/ollama/ollama/issues/9857#issuecomment-2758147993)_
GiteaMirror added the performance label 2026-04-22 13:30:00 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 27, 2025):

@Johnno1011 please follow up here.

<!-- gh-comment-id:2759489287 --> @rick-github commented on GitHub (Mar 27, 2025): @Johnno1011 please follow up here.
Author
Owner

@ALLMI78 commented on GitHub (Mar 27, 2025):

"Think this issue is the most related to what I'm experiencing with gemma3:27b."

Does it make sense to test that with a different model? compare to qwen32b?

I've also noticed something with how gemma3 runs. I don't think it's running smoothly yet. I was not able to run 12b gemma3 with my 4060ti 16gb before, since the new v0.6.3 pre-release, gemma3 runs quite fast on my GPU, but some strange things are happening. The RAM/VRAM increases during inference, and while it's released afterward, I experienced a memory overflow after a few hours of runtime. Unfortunately, I haven't had the time to test this in detail or document it for you yet. I'm not sure if there are really any issues, so maybe testing with models we know and that run smoothly could be helpful?

<!-- gh-comment-id:2759625155 --> @ALLMI78 commented on GitHub (Mar 27, 2025): "Think this issue is the most related to what I'm experiencing with gemma3:27b." Does it make sense to test that with a different model? compare to qwen32b? I've also noticed something with how gemma3 runs. I don't think it's running smoothly yet. I was not able to run 12b gemma3 with my 4060ti 16gb before, since the new v0.6.3 pre-release, gemma3 runs quite fast on my GPU, but some strange things are happening. The RAM/VRAM increases during inference, and while it's released afterward, I experienced a memory overflow after a few hours of runtime. Unfortunately, I haven't had the time to test this in detail or document it for you yet. I'm not sure if there are really any issues, so maybe testing with models we know and that run smoothly could be helpful?
Author
Owner

@Johnno1011 commented on GitHub (Mar 28, 2025):

Thanks for the discussion guys this has been super useful! I've played around with num_thread for a bit and have produced this plot for you all to refer to, hopefully someone else finds this beneficial. It was interesting to see that putting it at 64 threads (max), the tokens/sec drastically reduced. Cheers.

Image

<!-- gh-comment-id:2761681481 --> @Johnno1011 commented on GitHub (Mar 28, 2025): Thanks for the discussion guys this has been super useful! I've played around with num_thread for a bit and have produced this plot for you all to refer to, hopefully someone else finds this beneficial. It was interesting to see that putting it at 64 threads (max), the tokens/sec drastically reduced. Cheers. ![Image](https://github.com/user-attachments/assets/11ff80aa-40b4-4810-85d0-10b35c902ff1)
Author
Owner

@sol8712 commented on GitHub (Mar 30, 2025):

Since 0.6.3 my cpu based use of gemma3 has been slower to start streaming text back, on raspberry pi 5 with gemma3:1b

<!-- gh-comment-id:2764642428 --> @sol8712 commented on GitHub (Mar 30, 2025): Since 0.6.3 my cpu based use of gemma3 has been slower to start streaming text back, on raspberry pi 5 with gemma3:1b
Author
Owner

@rick-github commented on GitHub (Mar 30, 2025):

Server logs will aid in debugging.

<!-- gh-comment-id:2764643463 --> @rick-github commented on GitHub (Mar 30, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32332