[GH-ISSUE #1578] Ollama order of magnitude slower on Apple M1 vs Llama.cpp #47379

Closed
opened 2026-04-28 03:39:09 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @svilupp on GitHub (Dec 18, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1578

First of all, thank you for the amazing app!

Observation: When I run the same prompt via latest Ollama vs Llama.cpp I get order of magnitude slower generation on Ollama.

  • With Ollama in generation, GPU usage is 0% and from time to time it jumps to 40%
  • With llama.cpp in generation, GPU usage constantly sits at ~99%

Setup:

  • Device: Apple M1 Pro, 32GB ram, shifted memory limit for mixtral to work
  • System: Ventura 13.6
  • Model: dolphin-mixtral:8x7b-v2.5-q4_K_M

Prompt: "Count to 5 and say hi"

Ollama: ollama run dolphin-mixtral:8x7b-v2.5-q4_K_M "Count to 5 then say hi." --verbose

First, I will start by counting from 1 to 5.

  1. One
  2. Two
  3. Three
  4. Four
  5. Five

Now that I have counted to 5, let me say hi! Hi there!

total duration: 5m3.16583525s
load duration: 33.760953875s
prompt eval count: 35 token(s)
prompt eval duration: 24.710485s
prompt eval rate: 1.42 tokens/s
eval count: 54 token(s)
eval duration: 4m4.681389s
eval rate: 0.22 tokens/s

Llama.cpp: ./main -m .ollama/models/blobs/sha256:34855d29fd5901f6ed6fe8112a80dc137bafdeb135d89bf75f9b171e62980ac2 --prompt "[INST] Count to 5 and then say hi. [INST]"

1
2
3
4
5
Hi!
<...it goes on about something else for a bit...it has some stopping issues>

llama_print_timings: load time = 5242.30 ms
llama_print_timings: sample time = 38.25 ms / 425 runs ( 0.09 ms per token, 11109.95 tokens per second)
llama_print_timings: prompt eval time = 800.60 ms / 17 tokens ( 47.09 ms per token, 21.23 tokens per second)
llama_print_timings: eval time = 25695.06 ms / 424 runs ( 60.60 ms per token, 16.50 tokens per second)
llama_print_timings: total time = 26599.97 ms
ggml_metal_free: deallocating
Log end

Any idea what I could doing wrong?

Originally created by @svilupp on GitHub (Dec 18, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1578 First of all, thank you for the amazing app! **Observation**: When I run the same prompt via latest Ollama vs Llama.cpp I get order of magnitude slower generation on Ollama. - With Ollama in generation, GPU usage is 0% and from time to time it jumps to 40% - With llama.cpp in generation, GPU usage constantly sits at ~99% **Setup**: - Device: Apple M1 Pro, 32GB ram, shifted memory limit for mixtral to work - System: Ventura 13.6 - Model: dolphin-mixtral:8x7b-v2.5-q4_K_M **Prompt**: "Count to 5 and say hi" **Ollama**: `ollama run dolphin-mixtral:8x7b-v2.5-q4_K_M "Count to 5 then say hi." --verbose` > First, I will start by counting from 1 to 5. > > 1. One > 2. Two > 3. Three > 4. Four > 5. Five > > Now that I have counted to 5, let me say hi! Hi there! > > total duration: 5m3.16583525s > load duration: 33.760953875s > prompt eval count: 35 token(s) > prompt eval duration: 24.710485s > prompt eval rate: 1.42 tokens/s > eval count: 54 token(s) > eval duration: 4m4.681389s > eval rate: 0.22 tokens/s **Llama.cpp**: `./main -m .ollama/models/blobs/sha256:34855d29fd5901f6ed6fe8112a80dc137bafdeb135d89bf75f9b171e62980ac2 --prompt "[INST] Count to 5 and then say hi. [INST]"` > 1 > 2 > 3 > 4 > 5 > Hi! > <...it goes on about something else for a bit...it has some stopping issues> > > llama_print_timings: load time = 5242.30 ms > llama_print_timings: sample time = 38.25 ms / 425 runs ( 0.09 ms per token, 11109.95 tokens per second) > llama_print_timings: prompt eval time = 800.60 ms / 17 tokens ( 47.09 ms per token, 21.23 tokens per second) > llama_print_timings: eval time = 25695.06 ms / 424 runs ( 60.60 ms per token, 16.50 tokens per second) > llama_print_timings: total time = 26599.97 ms > ggml_metal_free: deallocating > Log end Any idea what I could doing wrong?
Author
Owner

@heresyrj commented on GitHub (Dec 18, 2023):

curious too. how do I turn on Apple Metal for Ollama inference?
I tried:
sudo sysctl iogpu.wired_limit_mb=26624
which helped somewhat.

<!-- gh-comment-id:1860083889 --> @heresyrj commented on GitHub (Dec 18, 2023): curious too. how do I turn on Apple Metal for Ollama inference? I tried: sudo sysctl iogpu.wired_limit_mb=26624 which helped somewhat.
Author
Owner

@svilupp commented on GitHub (Dec 18, 2023):

Apologies, but it seems to be a duplicate of:

I searched for M1 and llama.cpp but clearly missed these :(

Closing!

<!-- gh-comment-id:1860692459 --> @svilupp commented on GitHub (Dec 18, 2023): Apologies, but it seems to be a duplicate of: - https://github.com/jmorganca/ollama/issues/1556 - https://github.com/jmorganca/ollama/issues/1557 I searched for M1 and llama.cpp but clearly missed these :( Closing!
Author
Owner

@svilupp commented on GitHub (Dec 20, 2023):

So I've resolved the issue -- it's because Ollama by default offloads only one GPU layer (see: 23dc179350/docs/modelfile.md (L141))

If you set num_gpu=99, you get similar performance as llama.cpp.

Not sure what the rationale is for this default

<!-- gh-comment-id:1864095588 --> @svilupp commented on GitHub (Dec 20, 2023): So I've resolved the issue -- it's because Ollama by default offloads only one GPU layer (see: https://github.com/jmorganca/ollama/blob/23dc1793500c1e8d9709fb6ed57537f9010a0b84/docs/modelfile.md?plain=1#L141) If you set `num_gpu=99`, you get similar performance as llama.cpp. Not sure what the rationale is for this default
Author
Owner

@vbcoach commented on GitHub (Dec 6, 2024):

@svilupp can you tell us where you set num_gpu=99?

<!-- gh-comment-id:2524234654 --> @vbcoach commented on GitHub (Dec 6, 2024): @svilupp can you tell us where you set num_gpu=99?
Author
Owner

@svilupp commented on GitHub (Dec 6, 2024):

I just passed it as a kwarg to the model when making the request, eg, https://siml.earth/PromptingTools.jl/dev/examples/working_with_ollama#Local-models-with-Ollama.ai

But it's quite old so maybe the API has changed!

EDIT: Sorry, as per the docs it was under the key "options"! Not directly at the top-level of the request.

<!-- gh-comment-id:2524240702 --> @svilupp commented on GitHub (Dec 6, 2024): I just passed it as a kwarg to the model when making the request, eg, https://siml.earth/PromptingTools.jl/dev/examples/working_with_ollama#Local-models-with-Ollama.ai But it's quite old so maybe the API has changed! EDIT: Sorry, as per the docs it was under the key "options"! Not directly at the top-level of the request.
Author
Owner

@vbcoach commented on GitHub (Dec 6, 2024):

got it, thanks. I'm going to try to recompile with a model file with this setting and see if things are faster on my M3/64gb

<!-- gh-comment-id:2524251931 --> @vbcoach commented on GitHub (Dec 6, 2024): got it, thanks. I'm going to try to recompile with a model file with this setting and see if things are faster on my M3/64gb
Author
Owner

@vbcoach commented on GitHub (Dec 6, 2024):

actually, running qwen2.5-coder:32b for a couple simple questions used 94% of my GPU. So, I think Olllama might be optimized already.

<!-- gh-comment-id:2524261055 --> @vbcoach commented on GitHub (Dec 6, 2024): actually, running qwen2.5-coder:32b for a couple simple questions used 94% of my GPU. So, I think Olllama might be optimized already.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47379