[PR #15256] [CLOSED] ggml-metal: fix tensor API probe and bf16/f16 type mismatch on M5 #61792

Closed
opened 2026-04-29 16:48:33 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15256
Author: @AlexWorland
Created: 4/3/2026
Status: Closed

Base: mainHead: fix/m5-metal-tensor-bf16-mismatch


📝 Commits (1)

  • b2bec1f ggml-metal: fix tensor API probe and bf16/f16 type mismatch on M5

📊 Changes

3 files changed (+2 additions, -14 deletions)

View changed files

📝 ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-device.m (+2 -2)
📝 ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal (+0 -6)
📝 ml/backend/ggml/ggml/src/ggml-metal/ggml-metal.metal (+0 -6)

📄 Description

Summary

  • Fix tensor API probe tile sizes (8,8)(16,16) to match upstream llama.cpp, preventing false has_tensor=false on M5+ hardware
  • Remove mixed bfloat/half kernel instantiations (kernel_mul_mm_bf16_f16, kernel_mul_mm_id_bf16_f16) that trigger static_assert failures in Apple's MPP framework when tensor API is enabled

Problem

On Apple M5 devices (MTLGPUFamilyMetal4), two cascading failures prevent GPU inference:

  1. Tensor probe failure: The runtime matmul2d_descriptor(8, 8, dynamic_extent) test shader fails to compile, incorrectly setting has_tensor=false. Upstream llama.cpp fixed this by using (16, 16) tile sizes.

  2. Main library crash: When the probe does pass (or with the tile fix applied), the embedded Metal library fails to compile because kernel_mul_mm_bf16_f16 and kernel_mul_mm_id_bf16_f16 mix bfloat and half operand types in matmul2d::run(). Apple's MPP headers enforce strict type matching via static_assert, causing a SIGABRT during model load.

Changes

File Change
ggml-metal-device.m matmul2d_descriptor(8,8)(16,16) in f16 and bf16 tensor probes
ggml-metal.metal Remove kernel_mul_mm_bf16_f16 and kernel_mul_mm_id_bf16_f16
ggml-metal-embed.metal Same removal in embedded library source

References

Test plan

  • Build from source on Apple M5 Max (macOS 26)
  • Verify has tensor = true in debug output
  • Verify Metal library compiles without static_assert errors
  • Run model inference (e.g. ollama run gemma4:26b --verbose)
  • Verify GPU acceleration is active (not CPU fallback)

🤖 Generated with Claude Code


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15256 **Author:** [@AlexWorland](https://github.com/AlexWorland) **Created:** 4/3/2026 **Status:** ❌ Closed **Base:** `main` ← **Head:** `fix/m5-metal-tensor-bf16-mismatch` --- ### 📝 Commits (1) - [`b2bec1f`](https://github.com/ollama/ollama/commit/b2bec1f287f07a7b8d01903c65d93e0738070fe8) ggml-metal: fix tensor API probe and bf16/f16 type mismatch on M5 ### 📊 Changes **3 files changed** (+2 additions, -14 deletions) <details> <summary>View changed files</summary> 📝 `ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-device.m` (+2 -2) 📝 `ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal` (+0 -6) 📝 `ml/backend/ggml/ggml/src/ggml-metal/ggml-metal.metal` (+0 -6) </details> ### 📄 Description ## Summary - Fix tensor API probe tile sizes `(8,8)` → `(16,16)` to match upstream llama.cpp, preventing false `has_tensor=false` on M5+ hardware - Remove mixed `bfloat`/`half` kernel instantiations (`kernel_mul_mm_bf16_f16`, `kernel_mul_mm_id_bf16_f16`) that trigger `static_assert` failures in Apple's MPP framework when tensor API is enabled ## Problem On Apple M5 devices (MTLGPUFamilyMetal4), two cascading failures prevent GPU inference: 1. **Tensor probe failure**: The runtime `matmul2d_descriptor(8, 8, dynamic_extent)` test shader fails to compile, incorrectly setting `has_tensor=false`. Upstream llama.cpp fixed this by using `(16, 16)` tile sizes. 2. **Main library crash**: When the probe *does* pass (or with the tile fix applied), the embedded Metal library fails to compile because `kernel_mul_mm_bf16_f16` and `kernel_mul_mm_id_bf16_f16` mix `bfloat` and `half` operand types in `matmul2d::run()`. Apple's MPP headers enforce strict type matching via `static_assert`, causing a `SIGABRT` during model load. ## Changes | File | Change | |------|--------| | `ggml-metal-device.m` | `matmul2d_descriptor(8,8)` → `(16,16)` in f16 and bf16 tensor probes | | `ggml-metal.metal` | Remove `kernel_mul_mm_bf16_f16` and `kernel_mul_mm_id_bf16_f16` | | `ggml-metal-embed.metal` | Same removal in embedded library source | ## References - Upstream fix: [llama.cpp#18456](https://github.com/ggml-org/llama.cpp/pull/18456) - Related PRs: #14604, #14996, #13701 - Fixes: #14432, #13460, #13867 ## Test plan - [ ] Build from source on Apple M5 Max (macOS 26) - [ ] Verify `has tensor = true` in debug output - [ ] Verify Metal library compiles without `static_assert` errors - [ ] Run model inference (e.g. `ollama run gemma4:26b --verbose`) - [ ] Verify GPU acceleration is active (not CPU fallback) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 16:48:33 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#61792