[PR #15257] [CLOSED] ggml-metal: fix tensor API probe and bf16/f16 type mismatch on Apple M5 #15096

Closed
opened 2026-04-13 01:10:16 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15257
Author: @AlexWorland
Created: 4/3/2026
Status: Closed

Base: mainHead: fix/m5-metal-tensor-bf16-mismatch


📝 Commits (2)

  • b2bec1f ggml-metal: fix tensor API probe and bf16/f16 type mismatch on M5
  • d5d3e86 Merge branch 'main' into fix/m5-metal-tensor-bf16-mismatch

📊 Changes

3 files changed (+2 additions, -14 deletions)

View changed files

📝 ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-device.m (+2 -2)
📝 ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal (+0 -6)
📝 ml/backend/ggml/ggml/src/ggml-metal/ggml-metal.metal (+0 -6)

📄 Description

Summary

  • Fix tensor API probe tile sizes (8,8)(16,16) to match upstream llama.cpp, preventing false has_tensor=false on M5+ hardware
  • Remove mixed bfloat/half kernel instantiations (kernel_mul_mm_bf16_f16, kernel_mul_mm_id_bf16_f16) that trigger static_assert failures in Apple's MPP framework when tensor API is enabled

Problem

On Apple M5 devices (MTLGPUFamilyMetal4), two cascading failures prevent the Metal tensor API from being used:

  1. Tensor probe failure: The runtime matmul2d_descriptor(8, 8, dynamic_extent) test shader fails to compile, incorrectly setting has_tensor=false. Upstream llama.cpp fixed this by using (16, 16) tile sizes.

  2. Main library crash: When the probe does pass (or with the tile fix applied), the embedded Metal library fails to compile because kernel_mul_mm_bf16_f16 and kernel_mul_mm_id_bf16_f16 mix bfloat and half operand types in matmul2d::run(). Apple's MPP headers enforce strict type matching via static_assert, causing a SIGABRT during model load.

Changes

File Change
ggml-metal-device.m matmul2d_descriptor(8,8)(16,16) in f16 and bf16 tensor probes
ggml-metal.metal Remove kernel_mul_mm_bf16_f16 and kernel_mul_mm_id_bf16_f16
ggml-metal-embed.metal Same removal in embedded library source

Before / After (Apple M5 Max, gemma4:26b Q4_K_M, macOS 26)

Same prompt (lipsum.com full page, ~1968 tokens), same hardware, cold model load.

Before (stock Ollama 0.20.0)

ggml_metal_device_init: has tensor = false
prompt eval count:    1968 token(s)
prompt eval duration: 1.562097917s
prompt eval rate:     1259.84 tokens/s
eval count:           1072 token(s)
eval duration:        11.81003948s
eval rate:            90.77 tokens/s

After (patched build)

ggml_metal_device_init: has tensor = true
prompt eval count:    1968 token(s)
prompt eval duration: 639.683083ms
prompt eval rate:     3076.52 tokens/s
eval count:           1069 token(s)
eval duration:        11.755829738s
eval rate:            90.93 tokens/s
Metric Before After Change
Prompt eval (pp) 1,260 tok/s 3,077 tok/s +2.4x
Token generation (tg) 90.77 tok/s 90.93 tok/s ~same
Prompt eval time 1.56s 0.64s -59%

References

Test plan

  • Build from source on Apple M5 Max (macOS 26)
  • Verify has tensor = true in debug output
  • Verify Metal library compiles without static_assert errors
  • Run model inference (gemma4:26b via ollama run --verbose)
  • Verify GPU acceleration is active (not CPU fallback)
  • Benchmark: pp 3077 tok/s, tg 91 tok/s (vs stock pp 1260, tg 91)

🤖 Generated with Claude Code


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15257 **Author:** [@AlexWorland](https://github.com/AlexWorland) **Created:** 4/3/2026 **Status:** ❌ Closed **Base:** `main` ← **Head:** `fix/m5-metal-tensor-bf16-mismatch` --- ### 📝 Commits (2) - [`b2bec1f`](https://github.com/ollama/ollama/commit/b2bec1f287f07a7b8d01903c65d93e0738070fe8) ggml-metal: fix tensor API probe and bf16/f16 type mismatch on M5 - [`d5d3e86`](https://github.com/ollama/ollama/commit/d5d3e861b3b30e0d9a5390d14237beb1e3e91715) Merge branch 'main' into fix/m5-metal-tensor-bf16-mismatch ### 📊 Changes **3 files changed** (+2 additions, -14 deletions) <details> <summary>View changed files</summary> 📝 `ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-device.m` (+2 -2) 📝 `ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal` (+0 -6) 📝 `ml/backend/ggml/ggml/src/ggml-metal/ggml-metal.metal` (+0 -6) </details> ### 📄 Description ## Summary - Fix tensor API probe tile sizes `(8,8)` → `(16,16)` to match upstream llama.cpp, preventing false `has_tensor=false` on M5+ hardware - Remove mixed `bfloat`/`half` kernel instantiations (`kernel_mul_mm_bf16_f16`, `kernel_mul_mm_id_bf16_f16`) that trigger `static_assert` failures in Apple's MPP framework when tensor API is enabled ## Problem On Apple M5 devices (MTLGPUFamilyMetal4), two cascading failures prevent the Metal tensor API from being used: 1. **Tensor probe failure**: The runtime `matmul2d_descriptor(8, 8, dynamic_extent)` test shader fails to compile, incorrectly setting `has_tensor=false`. Upstream llama.cpp fixed this by using `(16, 16)` tile sizes. 2. **Main library crash**: When the probe *does* pass (or with the tile fix applied), the embedded Metal library fails to compile because `kernel_mul_mm_bf16_f16` and `kernel_mul_mm_id_bf16_f16` mix `bfloat` and `half` operand types in `matmul2d::run()`. Apple's MPP headers enforce strict type matching via `static_assert`, causing a `SIGABRT` during model load. ## Changes | File | Change | |------|--------| | `ggml-metal-device.m` | `matmul2d_descriptor(8,8)` → `(16,16)` in f16 and bf16 tensor probes | | `ggml-metal.metal` | Remove `kernel_mul_mm_bf16_f16` and `kernel_mul_mm_id_bf16_f16` | | `ggml-metal-embed.metal` | Same removal in embedded library source | ## Before / After (Apple M5 Max, gemma4:26b Q4_K_M, macOS 26) Same prompt (lipsum.com full page, ~1968 tokens), same hardware, cold model load. ### Before (stock Ollama 0.20.0) ``` ggml_metal_device_init: has tensor = false ``` ``` prompt eval count: 1968 token(s) prompt eval duration: 1.562097917s prompt eval rate: 1259.84 tokens/s eval count: 1072 token(s) eval duration: 11.81003948s eval rate: 90.77 tokens/s ``` ### After (patched build) ``` ggml_metal_device_init: has tensor = true ``` ``` prompt eval count: 1968 token(s) prompt eval duration: 639.683083ms prompt eval rate: 3076.52 tokens/s eval count: 1069 token(s) eval duration: 11.755829738s eval rate: 90.93 tokens/s ``` | Metric | Before | After | Change | |--------|--------|-------|--------| | Prompt eval (pp) | 1,260 tok/s | 3,077 tok/s | **+2.4x** | | Token generation (tg) | 90.77 tok/s | 90.93 tok/s | ~same | | Prompt eval time | 1.56s | 0.64s | **-59%** | ## References - Upstream fix: [llama.cpp#18456](https://github.com/ggml-org/llama.cpp/pull/18456) - Related PRs: #14604, #14996, #13701 - Fixes: #14432, #13460, #13867 ## Test plan - [x] Build from source on Apple M5 Max (macOS 26) - [x] Verify `has tensor = true` in debug output - [x] Verify Metal library compiles without `static_assert` errors - [x] Run model inference (`gemma4:26b` via `ollama run --verbose`) - [x] Verify GPU acceleration is active (not CPU fallback) - [x] Benchmark: pp 3077 tok/s, tg 91 tok/s (vs stock pp 1260, tg 91) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-13 01:10:16 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#15096