[GH-ISSUE #10163] qwen2.5:72b and llama3:70b not using GPU – extremely slow and consume 40GB+ RAM #32429

Closed
opened 2026-04-22 13:42:15 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @nth347 on GitHub (Apr 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10163

What is the issue?

Hi Ollama team,

I’m encountering an issue when running the qwen2.5:72b and llama3:70b models with Ollama. Instead of utilizing the GPU, both models are using upwards of 40GB of system RAM and are extremely slow during inference. It appears they're running purely on CPU, despite a compatible GPU being available and working with other models (e.g gemma3:27b).

I am using MacOS 15.3.2 with:

  • CPU: Apple M2 Pro
  • GPU: Apple M2 Pro
  • Memory: 32 GB

Really slow response to simple "hello":

Image

RAM usage:

Image

Relevant log output


OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.6.4

Originally created by @nth347 on GitHub (Apr 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10163 ### What is the issue? Hi Ollama team, I’m encountering an issue when running the qwen2.5:72b and llama3:70b models with Ollama. Instead of utilizing the GPU, both models are using upwards of 40GB of system RAM and are extremely slow during inference. It appears they're running purely on CPU, despite a compatible GPU being available and working with other models (e.g gemma3:27b). I am using MacOS 15.3.2 with: - CPU: Apple M2 Pro - GPU: Apple M2 Pro - Memory: 32 GB Really slow response to simple "hello": <img width="962" alt="Image" src="https://github.com/user-attachments/assets/04cbecd7-915d-4617-85d3-b1367c726879" /> RAM usage: <img width="279" alt="Image" src="https://github.com/user-attachments/assets/c92b1864-bddc-428b-9b02-70bb6472df42" /> ### Relevant log output ```shell ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.6.4
GiteaMirror added the bug label 2026-04-22 13:42:16 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 7, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:2783887275 --> @rick-github commented on GitHub (Apr 7, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/[troubleshooting](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues).md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@igorschlum commented on GitHub (Apr 9, 2025):

Hi @nth347

Thanks for the detailed report and for testing large models with Ollama.

The behavior you’re seeing is expected given your hardware setup. On macOS, even if your machine has 32 GB of unified memory, not all of it is available to the GPU — the system and other apps also require memory, so in practice, only a portion can be used by models running through the GPU.

Running 70B+ models like qwen2.5:72b or llama3:70b requires significantly more memory than what’s available on a Mac with 32 GB — especially for GPU execution. When there’s not enough memory, Ollama gracefully falls back to CPU execution, which explains the extremely slow performance you’re experiencing.

To run these large models efficiently, you’d typically need either:
• A more powerful machine with significantly more memory (96+ GB), or
• To use a smaller model that’s optimized for local inference on your current hardware, like gemma:2b, llama3:8b, or mistral:7b.

Since this isn’t a bug but rather a limitation based on available resources, you might consider closing this issue to help keep the tracker focused on actionable problems.

Hope this helps clarify things — and feel free to ask the community on Discord for model recommendations tailored to your setup!

<!-- gh-comment-id:2788304051 --> @igorschlum commented on GitHub (Apr 9, 2025): Hi @nth347 Thanks for the detailed report and for testing large models with Ollama. The behavior you’re seeing is expected given your hardware setup. On macOS, even if your machine has 32 GB of unified memory, not all of it is available to the GPU — the system and other apps also require memory, so in practice, only a portion can be used by models running through the GPU. Running 70B+ models like qwen2.5:72b or llama3:70b requires significantly more memory than what’s available on a Mac with 32 GB — especially for GPU execution. When there’s not enough memory, Ollama gracefully falls back to CPU execution, which explains the extremely slow performance you’re experiencing. To run these large models efficiently, you’d typically need either: • A more powerful machine with significantly more memory (96+ GB), or • To use a smaller model that’s optimized for local inference on your current hardware, like gemma:2b, llama3:8b, or mistral:7b. Since this isn’t a bug but rather a limitation based on available resources, you might consider closing this issue to help keep the tracker focused on actionable problems. Hope this helps clarify things — and feel free to ask the community on Discord for model recommendations tailored to your setup!
Author
Owner

@nth347 commented on GitHub (Apr 11, 2025):

Hi @igorschlum ,

Thank you for your very detailed explanation, I am going to close the case now.

<!-- gh-comment-id:2795587043 --> @nth347 commented on GitHub (Apr 11, 2025): Hi @igorschlum , Thank you for your very detailed explanation, I am going to close the case now.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32429