[GH-ISSUE #1986] Ollama Utilizing Only CPU Instead of GPU on MacBook Pro M1 Pro #26905

Closed
opened 2026-04-22 03:38:01 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @vidvudsc on GitHub (Jan 14, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1986

Description
I've encountered an issue where Ollama, when running any llm is utilizing only the CPU instead of the GPU on my MacBook Pro with an M1 Pro chip. This results in less efficient model performance than expected.

Environment
MacBook Pro with M1 Pro chip
MacOS version: Sonoma 14.2.1
Ollama version: 1.20

No specific error messages are observed.
All dependencies and drivers are up to date.
I would appreciate any guidance or updates regarding this issue. If there are any configurations or settings I might be missing, please let me know.

Screenshot 2024-01-14 at 08 00 10

PS. the image was taken when running dolphin-mixtral

Thanks!

Originally created by @vidvudsc on GitHub (Jan 14, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1986 Description I've encountered an issue where Ollama, when running any llm is utilizing only the CPU instead of the GPU on my MacBook Pro with an M1 Pro chip. This results in less efficient model performance than expected. Environment MacBook Pro with M1 Pro chip MacOS version: Sonoma 14.2.1 Ollama version: 1.20 No specific error messages are observed. All dependencies and drivers are up to date. I would appreciate any guidance or updates regarding this issue. If there are any configurations or settings I might be missing, please let me know. <img width="852" alt="Screenshot 2024-01-14 at 08 00 10" src="https://github.com/jmorganca/ollama/assets/77242455/ee3c0398-37e9-4473-af5b-a3b3253d1662"> PS. the image was taken when running dolphin-mixtral Thanks!
Author
Owner

@easp commented on GitHub (Jan 14, 2024):

16GB isn't nearly enough to run dolphin-mixtral at any reasonable speed. The default download is 26GB in size. The computer will have to move more than 10GB of data from the SSD for every token generated.

This isn't really practical when using the GPU (or at all, really) so Ollama falls back to CPU. Under these conditions the difference between using CPU and GPU is insignificant, anyway since most of the time is spent moving data from the SSD.

Because it spends most of the time waiting for data transfer from the SSD, the CPU is largely idle.

The model data is memory mapped and so it's not accounted for in normal process memory. It should be accounted for in wired memory and/or file cache.

In short, your expectations are out of line with realities of what your computer is capable of and how resource use is accounted for.

As for what you can do... For reasonable performance, run models that fit within the memory that MacOS makes accessible to the GPU (66% of 16GB by default, which is about 10.5GB). That's not going to be enough for even a ~2-bit quantization of Mixtral.

<!-- gh-comment-id:1891022626 --> @easp commented on GitHub (Jan 14, 2024): 16GB isn't nearly enough to run dolphin-mixtral at any reasonable speed. The default download is 26GB in size. The computer will have to move more than 10GB of data from the SSD for every token generated. This isn't really practical when using the GPU (or at all, really) so Ollama falls back to CPU. Under these conditions the difference between using CPU and GPU is insignificant, anyway since most of the time is spent moving data from the SSD. Because it spends most of the time waiting for data transfer from the SSD, the CPU is largely idle. The model data is memory mapped and so it's not accounted for in normal process memory. It should be accounted for in wired memory and/or file cache. In short, your expectations are out of line with realities of what your computer is capable of and how resource use is accounted for. As for what you can do... For reasonable performance, run models that fit within the memory that MacOS makes accessible to the GPU (66% of 16GB by default, which is about 10.5GB). That's not going to be enough for even a ~2-bit quantization of Mixtral.
Author
Owner

@jmorganca commented on GitHub (Jan 14, 2024):

Hi there, what @easp mentioned is a great overview of why it uses the CPU right now. At the moment Ollama won't partially use the GPU, it will fall back to CPU. That said look out for improvements to this in the future. For your setup smaller models should run quite fast on the GPU (e.g. llama2, mistral)

<!-- gh-comment-id:1891045564 --> @jmorganca commented on GitHub (Jan 14, 2024): Hi there, what @easp mentioned is a great overview of why it uses the CPU right now. At the moment Ollama won't partially use the GPU, it will fall back to CPU. That said look out for improvements to this in the future. For your setup smaller models should run quite fast on the GPU (e.g. `llama2`, `mistral`)
Author
Owner

@jmorganca commented on GitHub (Jan 14, 2024):

Also, thanks @easp !

<!-- gh-comment-id:1891045597 --> @jmorganca commented on GitHub (Jan 14, 2024): Also, thanks @easp !
Author
Owner

@vidvudsc commented on GitHub (Jan 14, 2024):

@jmorganca @easp Thanks for the help! Really appreciated it.

<!-- gh-comment-id:1891065960 --> @vidvudsc commented on GitHub (Jan 14, 2024): @jmorganca @easp Thanks for the help! Really appreciated it.
Author
Owner

@mdl054 commented on GitHub (Jan 31, 2024):

16GB isn't nearly enough to run dolphin-mixtral at any reasonable speed. The default download is 26GB in size. The computer will have to move more than 10GB of data from the SSD for every token generated.

This isn't really practical when using the GPU (or at all, really) so Ollama falls back to CPU. Under these conditions the difference between using CPU and GPU is insignificant, anyway since most of the time is spent moving data from the SSD.

Because it spends most of the time waiting for data transfer from the SSD, the CPU is largely idle.

The model data is memory mapped and so it's not accounted for in normal process memory. It should be accounted for in wired memory and/or file cache.

In short, your expectations are out of line with realities of what your computer is capable of and how resource use is accounted for.

As for what you can do... For reasonable performance, run models that fit within the memory that MacOS makes accessible to the GPU (66% of 16GB by default, which is about 10.5GB). That's not going to be enough for even a ~2-bit quantization of Mixtral.

Sorry to hijack, does this mean having more RAM means you can load larger models or do you mean that 16GB is a hard limit due to the memory the GPU has available? So if we had a Mac with 96gb vs the 16gb for example.

<!-- gh-comment-id:1919257889 --> @mdl054 commented on GitHub (Jan 31, 2024): > 16GB isn't nearly enough to run dolphin-mixtral at any reasonable speed. The default download is 26GB in size. The computer will have to move more than 10GB of data from the SSD for every token generated. > > This isn't really practical when using the GPU (or at all, really) so Ollama falls back to CPU. Under these conditions the difference between using CPU and GPU is insignificant, anyway since most of the time is spent moving data from the SSD. > > Because it spends most of the time waiting for data transfer from the SSD, the CPU is largely idle. > > The model data is memory mapped and so it's not accounted for in normal process memory. It should be accounted for in wired memory and/or file cache. > > In short, your expectations are out of line with realities of what your computer is capable of and how resource use is accounted for. > > As for what you can do... For reasonable performance, run models that fit within the memory that MacOS makes accessible to the GPU (66% of 16GB by default, which is about 10.5GB). That's not going to be enough for even a ~2-bit quantization of Mixtral. Sorry to hijack, does this mean having more RAM means you can load larger models or do you mean that 16GB is a hard limit due to the memory the GPU has available? So if we had a Mac with 96gb vs the 16gb for example.
Author
Owner

@easp commented on GitHub (Jan 31, 2024):

@mdl054 If you have more RAM you can load larger models and have them processed on the GPU. MacOS gives the GPU access to 2/3rds of system memory on Macs with 36GB or less and 3/4 on machines with 48GB or more. A 96GB Mac has 72 GB available to the GPU. Some of that will be needed beyond the model data itself.

There is a way to allocate more RAM to the GPU, but as of 0.1.22 Ollama doesn't take it into account.

<!-- gh-comment-id:1919477996 --> @easp commented on GitHub (Jan 31, 2024): @mdl054 If you have more RAM you can load larger models and have them processed on the GPU. MacOS gives the GPU access to 2/3rds of system memory on Macs with 36GB or less and 3/4 on machines with 48GB or more. A 96GB Mac has 72 GB available to the GPU. Some of that will be needed beyond the model data itself. There is a way to allocate more RAM to the GPU, but as of 0.1.22 Ollama doesn't take it into account.
Author
Owner

@srbt1 commented on GitHub (Mar 8, 2024):

Thanks so much @easp : that answers my question too :)

<!-- gh-comment-id:1985401225 --> @srbt1 commented on GitHub (Mar 8, 2024): Thanks so much @easp : that answers my question too :)
Author
Owner

@iamshreeram commented on GitHub (Jun 29, 2024):

Hi @easp, I'm using ollama to run models on my old MacBook Pro with an Intel (i9 with 32GB RAM) and an AMD Radeon GPU (4GB).

Despite setting the environment variable OLLAMA_NUM_GPU to 999, the inference process is primarily using 60% of the CPU and not the GPU.

Model I'm trying to run : starcoder2:3b (1.7 GB).

As the model size is low, How can I configure the inference to utilize the GPU fully without relying on the CPU at all?

<!-- gh-comment-id:2198003716 --> @iamshreeram commented on GitHub (Jun 29, 2024): Hi @easp, I'm using ollama to run models on my old MacBook Pro with an Intel (i9 with 32GB RAM) and an AMD Radeon GPU (4GB). Despite setting the environment variable OLLAMA_NUM_GPU to 999, the inference process is primarily using 60% of the CPU and not the GPU. Model I'm trying to run : starcoder2:3b (1.7 GB). As the model size is low, How can I configure the inference to utilize the GPU fully without relying on the CPU at all?
Author
Owner

@easp commented on GitHub (Jun 29, 2024):

@iamshreeram It looks like people are working on GPU support for Intel Macs, but the official version of Ollama doesn't yet incorporate support.

<!-- gh-comment-id:2198275444 --> @easp commented on GitHub (Jun 29, 2024): @iamshreeram It looks like [people are working on GPU support for Intel Macs](https://github.com/ollama/ollama/issues/1016#issuecomment-2178982244), but the official version of Ollama doesn't yet incorporate support.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#26905