[GH-ISSUE #2993] Ollama only runs off CPU in Ubuntu #63875

Closed
opened 2026-05-03 15:17:34 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @uansah on GitHub (Mar 7, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2993

I am trying to run Dolphin Mistral on CPU and RAM (no GPU) I have 188gb available but it uses 100% of the CPU at full capacity. Before this I was running it on a virtual machine with the same OS and it used RAM.

Originally created by @uansah on GitHub (Mar 7, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2993 I am trying to run Dolphin Mistral on CPU and RAM (no GPU) I have 188gb available but it uses 100% of the CPU at full capacity. Before this I was running it on a virtual machine with the same OS and it used RAM.
Author
Owner

@ArchitW commented on GitHub (Mar 8, 2024):

2nd to this.

neofetch

image

LLM: tinyllama:latest
prompt: what is the meaning of life

top
image

<!-- gh-comment-id:1985446624 --> @ArchitW commented on GitHub (Mar 8, 2024): 2nd to this. **neofetch** ![image](https://github.com/ollama/ollama/assets/4264602/1ff547f8-fca5-4648-8874-bd07b6a9eb52) `LLM: tinyllama:latest` `prompt: what is the meaning of life` **top** ![image](https://github.com/ollama/ollama/assets/4264602/3564d946-6a9b-45a6-b9a7-a38952023932)
Author
Owner

@aosan commented on GitHub (Mar 11, 2024):

CPU would be the biggest performance limitation, even if the model can fit in RAM. In my case, any model fitting in the vRAM of my GPU is fast. Any model not fitting in the vRAM is considerably slower.

For example, a simple question with a small model with GPU and fitting in vRAM can output 50-60 tokens/s. The same question with large models fitting only in system RAM and using CPU can output only 2-3 tokens/s.

<!-- gh-comment-id:1988643105 --> @aosan commented on GitHub (Mar 11, 2024): CPU would be the biggest performance limitation, even if the model can fit in RAM. In my case, any model fitting in the vRAM of my GPU is fast. Any model not fitting in the vRAM is considerably slower. For example, a simple question with a small model with GPU and fitting in vRAM can output 50-60 tokens/s. The same question with large models fitting only in system RAM and using CPU can output only 2-3 tokens/s.
Author
Owner

@jmorganca commented on GitHub (Mar 11, 2024):

Hi there, CPU is definitely going to be the bottleneck here. I'm not sure if there's much we can do regarding this issue other than continue making Ollama faster 😊 . Thanks so much for the issue!

<!-- gh-comment-id:1989548986 --> @jmorganca commented on GitHub (Mar 11, 2024): Hi there, CPU is definitely going to be the bottleneck here. I'm not sure if there's much we can do regarding this issue other than continue making Ollama faster 😊 . Thanks so much for the issue!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63875