[GH-ISSUE #9948] Default Num_GPU Causes Garbage Output On Apple Metal GPUs #32273

Open
opened 2026-04-22 13:22:49 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @jessatech on GitHub (Mar 22, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9948

What is the issue?

Problem

I have an M3 Max 36GB MacBook Pro running Sequoia 15.2 (24C101)

I've been troubled by an issue with garbage outputs from all models run on Ollama for a bit but I believe I just found the cause. I was trying to get Gemma 3 12B running and I stumbled across issue #7402 which pointed me to setting num_gpu to 0, which works but with horrible performance. So this is clearly a problem with offloading layers to the GPU.

When I saw this working with num_gpu 0, I realized I've seen this same problem with LM Studio and I had to set num_gpu to minus 1 of the max value, so num_gpu to n_layers.

Possible Cause

I'm not sure of the underlying root cause, but poking at the logs I have a hypothesis.

In the logs for loading a model into Ollama I can see the layers being offloaded, I've attached a snippet of these logs. For Gemma 12B Instruct, we can see that 48 repeating layers are offloaded to GPU followed by 1 output layer, for a total of 49 layers.

This means num_gpu = 49 but the model only has 48 repeating layers.

So it seems like offloading the output layer to the GPU (or at least including the output layer in the count for num_gpu) on M series Apple Metal GPUs results in garbage output for some/many models.

In LM Studio, the max offload for Gemma 3 12B is 48, so it looks like they are not counting the output layer in num_gpu for max offload but when setting it to 48 in LM Studio I also get garbage output. My guess is they are likely adding the extra layer on the backend because setting it to Max -1 there also fixes this problem.

Fix

Setting num_gpu = repeating_layers instead of num_gpu = repeating_layers + output_layer fixes the problem entirely. I get good outputs now and the performance is the same as a full offload (I assume because it is a full offload and the output layer is not supposed to be counted), but I'm not sure how to set this programmatically for each model loaded until this is fixed.

Needing to manually lookup the number of repeating layers for a specific model and then manually set num_gpu = repeating_layers is tedious. I brought this up to LM Studio as well here: https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/543

Relevant log output

load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.6.2

Originally created by @jessatech on GitHub (Mar 22, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9948 ### What is the issue? ## Problem I have an M3 Max 36GB MacBook Pro running Sequoia 15.2 (24C101) I've been troubled by an issue with garbage outputs from all models run on Ollama for a bit but I believe I just found the cause. I was trying to get Gemma 3 12B running and I stumbled across issue #7402 which pointed me to setting num_gpu to 0, which works but with horrible performance. So this is clearly a problem with offloading layers to the GPU. When I saw this working with num_gpu 0, I realized I've seen this same problem with LM Studio and I had to set num_gpu to minus 1 of the max value, so num_gpu to n_layers. ## Possible Cause I'm not sure of the underlying root cause, but poking at the logs I have a hypothesis. In the logs for loading a model into Ollama I can see the layers being offloaded, I've attached a snippet of these logs. For Gemma 12B Instruct, we can see that 48 repeating layers are offloaded to GPU followed by 1 output layer, for a total of 49 layers. This means num_gpu = 49 but the model only has 48 repeating layers. So it seems like offloading the output layer to the GPU (or at least including the output layer in the count for num_gpu) on M series Apple Metal GPUs results in garbage output for some/many models. In LM Studio, the max offload for Gemma 3 12B is 48, so it looks like they are not counting the output layer in num_gpu for max offload but when setting it to 48 in LM Studio I also get garbage output. My guess is they are likely adding the extra layer on the backend because setting it to Max -1 there also fixes this problem. ## Fix Setting num_gpu = repeating_layers instead of num_gpu = repeating_layers + output_layer fixes the problem entirely. I get good outputs now and the performance is the same as a full offload (I assume because it is a full offload and the output layer is not supposed to be counted), but I'm not sure how to set this programmatically for each model loaded until this is fixed. Needing to manually lookup the number of repeating layers for a specific model and then manually set num_gpu = repeating_layers is tedious. I brought this up to LM Studio as well here: https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/543 ### Relevant log output ```shell load_tensors: offloading 48 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 49/49 layers to GPU ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.6.2
GiteaMirror added the bug label 2026-04-22 13:22:49 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 23, 2025):

block_count in the model metadata tells you the repeating layer count. You can either get that from the ollama library (eg gemma3:12b) or, if using ollama 0.6.2+, with ollama show -v <model>. Unfortunately there's no option for setting a relative number for num_gpu. The best you can do at the moment is compute the best num_gpu for a model and then set it in the API or Modelfile.

There doesn't seem to be a lot of reports of this in ollama or llama.cpp issues, which may be a reflection on the user count. Hopefully your detailed bug report will help in a quick resolution, thanks.

<!-- gh-comment-id:2745924750 --> @rick-github commented on GitHub (Mar 23, 2025): `block_count` in the model metadata tells you the repeating layer count. You can either get that from the ollama library (eg [gemma3:12b](https://ollama.com/library/gemma3:12b/blobs/adca500fad9b)) or, if using ollama 0.6.2+, with `ollama show -v <model>`. Unfortunately there's no option for setting a relative number for `num_gpu`. The best you can do at the moment is compute the best `num_gpu` for a model and then [set it](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650) in the API or Modelfile. There doesn't seem to be a lot of reports of this in ollama or llama.cpp issues, which may be a reflection on the user count. Hopefully your detailed bug report will help in a quick resolution, thanks.
Author
Owner

@rick-github commented on GitHub (Mar 23, 2025):

#9910 might be relevant.

<!-- gh-comment-id:2745925957 --> @rick-github commented on GitHub (Mar 23, 2025): #9910 might be relevant.
Author
Owner

@jessatech commented on GitHub (Mar 23, 2025):

The best you can do at the moment is compute the best num_gpu for a model and then https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650 in the API or Modelfile.

This was perfect for showing how to set the model file and got me to a good workaround.

I copied the Modelfile from gemma3:12b and then updated the file to include the num_gpu parameter set to 48 to match this models block_count and boom, good output.

So this definitely seems to be something odd with how the default num_gpu is calculated for Apple silicon and Metal, hopefully the report helps them track the problem or maybe it is in lama.cpp if that's the base for ollama.

Immediate Workaround - Gemma

For other models you'd need to use that model name and set num_gpu to that models block_count instead

  1. Open a terminal to your desired working folder
  2. Dump the Modelfile from the base model pulled from Ollama
ollama show --modelfile gemma3:12b > Modelfile
  1. Find the block_count for the model:
ollama show -v gemma3:12b \
  | grep -m 1 -E '\.block_count[[:space:]]+[0-9]+' \
  | sed -E 's/^[[:space:]]+//; s/[[:space:]]+$//; s/(\.block_count)[[:space:]]+/\1 /'
  1. Open Modelfile in text editor and change FROM to FROM gemma3:12b
  2. Add the corrected num_gpu parameter to match the block_count PARAMETER num_gpu 48
  3. Save and close the Modelfile
  4. Rebuild
ollama create gemma3:ngpu_fix -f Modelfile
  1. Run
ollama run gemma3:ngpu_fix
<!-- gh-comment-id:2746017471 --> @jessatech commented on GitHub (Mar 23, 2025): > The best you can do at the moment is compute the best num_gpu for a model and then https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650 in the API or Modelfile. This was perfect for showing how to set the model file and got me to a good workaround. I copied the Modelfile from gemma3:12b and then updated the file to include the num_gpu parameter set to 48 to match this models `block_count` and boom, good output. So this definitely seems to be something odd with how the default num_gpu is calculated for Apple silicon and Metal, hopefully the report helps them track the problem or maybe it is in lama.cpp if that's the base for ollama. ### Immediate Workaround - Gemma > For other models you'd need to use that model name and set num_gpu to that models block_count instead 1. Open a terminal to your desired working folder 2. Dump the Modelfile from the base model pulled from Ollama ```bash ollama show --modelfile gemma3:12b > Modelfile ``` 2. Find the block_count for the model: ```bash ollama show -v gemma3:12b \ | grep -m 1 -E '\.block_count[[:space:]]+[0-9]+' \ | sed -E 's/^[[:space:]]+//; s/[[:space:]]+$//; s/(\.block_count)[[:space:]]+/\1 /' ``` 3. Open Modelfile in text editor and change FROM to `FROM gemma3:12b` 4. Add the corrected num_gpu parameter to match the block_count `PARAMETER num_gpu 48` 5. Save and close the Modelfile 6. Rebuild ```bash ollama create gemma3:ngpu_fix -f Modelfile ``` 6. Run ```bash ollama run gemma3:ngpu_fix ```
Author
Owner

@voidburn commented on GitHub (Apr 15, 2025):

Workaround still works/necessary on 0.6.5

<!-- gh-comment-id:2804200752 --> @voidburn commented on GitHub (Apr 15, 2025): Workaround still works/necessary on 0.6.5
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32273