[GH-ISSUE #2370] 36GB Macbook not using GPU for models that could fit #27136

Closed
opened 2026-04-22 04:06:37 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @WinnieP on GitHub (Feb 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2370

Originally assigned to: @dhiltgen on GitHub.

27aa2d4a19/gpu/gpu_darwin.go (L24C1-L28C3)

In older versions of Ollama, certain models would run on the GPU of a 36GB M3 macbook pro (specifically q4_K_M quantization of mixtral). Now, it's running on CPU.
I believe MacOS is allowing closer to ~75% of the memory to be allocated to GPU on this model, not 66%.

ggml_metal_init: recommendedMaxWorkingSetSize = 28991.03 MB

Originally created by @WinnieP on GitHub (Feb 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2370 Originally assigned to: @dhiltgen on GitHub. https://github.com/ollama/ollama/blob/27aa2d4a194c6daeafbd00391f475628deccce72/gpu/gpu_darwin.go#L24C1-L28C3 In older versions of Ollama, certain models would run on the GPU of a 36GB M3 macbook pro (specifically q4_K_M quantization of mixtral). Now, it's running on CPU. I believe MacOS is allowing closer to ~75% of the memory to be allocated to GPU on this model, not 66%. ```ggml_metal_init: recommendedMaxWorkingSetSize = 28991.03 MB```
GiteaMirror added the gpu label 2026-04-22 04:06:38 -05:00
Author
Owner

@padok commented on GitHub (Feb 6, 2024):

I have a similar experience with my 32GB M1 Pro Macbook.

Previously, I was able to use the following model (or its similar sized predecessor) on the GPU:

dolphin-mixtral:latest        cfada4ba31c7            26 GB

Even though it took some time to load and macOS had to swap out nearly everything else in memory, it ran smoothly and quickly.

However, now that the model is being run on the CPU, the speed has significantly decreased, with performance dropping from 3-6 words/s to just ~0.25 words/s, making it unusable for me.

Given that I was able to run models of this size before, I would argue that even utilizing around 81% of the available memory (~26GB) may be possible.


I cannot remember making any changes to the memory limit using the command:

sudo sysctl iogpu.wired_limit_mb=

so this could potentially be a behavior specific to my system, rather than a general problem. But it worked!

<!-- gh-comment-id:1930438934 --> @padok commented on GitHub (Feb 6, 2024): I have a similar experience with my 32GB M1 Pro Macbook. Previously, I was able to use the following model (or its similar sized predecessor) on the GPU: ``` dolphin-mixtral:latest cfada4ba31c7 26 GB ``` Even though it took some time to load and macOS had to swap out nearly everything else in memory, it ran smoothly and quickly. However, now that the model is being run on the CPU, the speed has significantly decreased, with performance dropping from 3-6 words/s to just ~0.25 words/s, making it unusable for me. Given that I was able to run models of this size before, I would argue that even utilizing around 81% of the available memory (~26GB) may be possible. --- I cannot remember making any changes to the memory limit using the command: ``` sudo sysctl iogpu.wired_limit_mb= ``` so this could potentially be a behavior specific to my system, rather than a general problem. But it worked!
Author
Owner

@ChinChangYang commented on GitHub (Feb 16, 2024):

Could you specify which version of Ollama introduces the issue of certain models, such as the q4_K_M quantization of mixtral, switching from running on the GPU to the CPU, as observed in the referenced code snippet?

<!-- gh-comment-id:1948594880 --> @ChinChangYang commented on GitHub (Feb 16, 2024): Could you specify which version of Ollama introduces the issue of certain models, such as the q4_K_M quantization of mixtral, switching from running on the GPU to the CPU, as observed in the referenced code snippet?
Author
Owner

@thony-p commented on GitHub (Feb 17, 2024):

I use this patch so ollama won't ignore:
Thanks to @peanut256

sudo sysctl iogpu.wired_limit_mb=26624

It would be great if it were merged soon.

<!-- gh-comment-id:1949943349 --> @thony-p commented on GitHub (Feb 17, 2024): I use this [patch](https://github.com/ollama/ollama/pull/2354) so ollama won't ignore: Thanks to @peanut256 ```shell sudo sysctl iogpu.wired_limit_mb=26624 ``` It would be great if it were merged soon.
Author
Owner

@peanut256 commented on GitHub (Feb 18, 2024):

#2354 now solves you issue without having to set iogpu.wired_limit_mb (if you system has enough available VRAM by default)

<!-- gh-comment-id:1951487213 --> @peanut256 commented on GitHub (Feb 18, 2024): #2354 now solves you issue without having to set iogpu.wired_limit_mb (if you system has enough available VRAM by default)
Author
Owner

@dhiltgen commented on GitHub (Mar 12, 2024):

It looks like we can mark this closed. If you're still seeing it run on CPU mistakenly, let us know and I'll re-open.

<!-- gh-comment-id:1992620769 --> @dhiltgen commented on GitHub (Mar 12, 2024): It looks like we can mark this closed. If you're still seeing it run on CPU mistakenly, let us know and I'll re-open.
Author
Owner

@chigkim commented on GitHub (Apr 9, 2024):

@dhiltgen I'm not sure if this is the same issue, but I'm running Ollama 0.1.31 on MBP 16" m3 64GB running Sonoma 144.4.1.

I imported miqu-1-70b.q5_K_M.gguf which is 48.8GB.

When I use the model with api, my memory usage only jumps from 6GB to 16GB, and cached file indicates 49GB.

Why does it run from cache even though it has enough memory to load the entire model?

<!-- gh-comment-id:2044153746 --> @chigkim commented on GitHub (Apr 9, 2024): @dhiltgen I'm not sure if this is the same issue, but I'm running Ollama 0.1.31 on MBP 16" m3 64GB running Sonoma 144.4.1. I imported [miqu-1-70b.q5_K_M.gguf](https://huggingface.co/miqudev/miqu-1-70b/blob/main/miqu-1-70b.q5_K_M.gguf) which is 48.8GB. When I use the model with api, my memory usage only jumps from 6GB to 16GB, and cached file indicates 49GB. Why does it run from cache even though it has enough memory to load the entire model?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#27136