[PR #15755] [MERGED] metal: harden for ggml initialization failures #77586

Closed
opened 2026-05-05 10:15:21 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15755
Author: @dhiltgen
Created: 4/22/2026
Status: Merged
Merged: 4/30/2026
Merged by: @dhiltgen

Base: mainHead: metal-harden


📝 Commits (3)

📊 Changes

11 files changed (+450 additions, -109 deletions)

View changed files

📝 discover/runner.go (+76 -25)
📝 discover/runner_test.go (+26 -0)
📝 llm/server.go (+61 -31)
llm/server_wait_test.go (+31 -0)
📝 llm/status.go (+56 -6)
llm/status_test.go (+44 -0)
📝 ml/backend/ggml/ggml.go (+11 -3)
📝 ml/device.go (+16 -3)
ml/device_test.go (+60 -0)
📝 runner/ollamarunner/runner.go (+62 -29)
📝 x/mlxrunner/client.go (+7 -12)

📄 Description

ggml_metal_device_init performs a probe to verify the tensor API compiles. On some systems this passes, even though kernel coverage isn't complete, which results in a later crash when compiling the real kernels. This change adds a single retry if any of the error strings match this failure mode to disable the tensor API. It also hardens an error case in the Go initDevices to detect device initialization failures and panic instead of crashing later on a nil array entry.

Fixes #15734

On my test system the probe test disables the feature, so the crash behavior isn't seen. To simulate the bug, I temporarily bypassed the probe so the API was enabled, and verified the crash, then the retry kicked in properly and got models running. While running that repro, I uncovered some other rough edges and hardened those as well.

  • GPU discovery used to hammer the /info API on failure cases, and cause multiple dummy loads concurrently, which broke the backend, and lead to a 30s hang/timeout instead of failing fast. This timeout has been a long running problem, and I now understand why, and this fixes it by synchronizing the dummy load inside the runner, so multiple /info calls queue up and return the device list after the initial load finishes (or fails once). This fix will likely help many users with unsupported AMD GPUs.
  • StatusWriter was getting concurrent writes from go routine copies, leading to weird interleaving of the stdout/stderr data - switched to a common cmd.Stdout/Stderr object which triggers os/exec to serialize writes.
  • StatusWriter was only capturing the last error which sometimes was a generic log message while the more pertinent error was detected earlier and overwritten - switched to an accumulation approach for all the matching patterns.

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15755 **Author:** [@dhiltgen](https://github.com/dhiltgen) **Created:** 4/22/2026 **Status:** ✅ Merged **Merged:** 4/30/2026 **Merged by:** [@dhiltgen](https://github.com/dhiltgen) **Base:** `main` ← **Head:** `metal-harden` --- ### 📝 Commits (3) - [`a0c7663`](https://github.com/ollama/ollama/commit/a0c7663d69e6392d1335af929111b5f77dfd679d) metal: harden for ggml initialization failures - [`b76eeea`](https://github.com/ollama/ollama/commit/b76eeea17a0fa4f9a693a42345caade944602381) review comments - [`a0c6d1d`](https://github.com/ollama/ollama/commit/a0c6d1dfe071c38c3212fc28fd9bab0d6966ca20) review comments ### 📊 Changes **11 files changed** (+450 additions, -109 deletions) <details> <summary>View changed files</summary> 📝 `discover/runner.go` (+76 -25) 📝 `discover/runner_test.go` (+26 -0) 📝 `llm/server.go` (+61 -31) ➕ `llm/server_wait_test.go` (+31 -0) 📝 `llm/status.go` (+56 -6) ➕ `llm/status_test.go` (+44 -0) 📝 `ml/backend/ggml/ggml.go` (+11 -3) 📝 `ml/device.go` (+16 -3) ➕ `ml/device_test.go` (+60 -0) 📝 `runner/ollamarunner/runner.go` (+62 -29) 📝 `x/mlxrunner/client.go` (+7 -12) </details> ### 📄 Description ggml_metal_device_init performs a probe to verify the tensor API compiles. On some systems this passes, even though kernel coverage isn't complete, which results in a later crash when compiling the real kernels. This change adds a single retry if any of the error strings match this failure mode to disable the tensor API. It also hardens an error case in the Go initDevices to detect device initialization failures and panic instead of crashing later on a nil array entry. Fixes #15734 On my test system the probe test disables the feature, so the crash behavior isn't seen. To simulate the bug, I temporarily bypassed the probe so the API was enabled, and verified the crash, then the retry kicked in properly and got models running. While running that repro, I uncovered some other rough edges and hardened those as well. - GPU discovery used to hammer the `/info` API on failure cases, and cause multiple dummy loads concurrently, which broke the backend, and lead to a 30s hang/timeout instead of failing fast. This timeout has been a long running problem, and I now understand why, and this fixes it by synchronizing the dummy load inside the runner, so multiple `/info` calls queue up and return the device list after the initial load finishes (or fails once). This fix will likely help many users with unsupported AMD GPUs. - StatusWriter was getting concurrent writes from go routine copies, leading to weird interleaving of the stdout/stderr data - switched to a common cmd.Stdout/Stderr object which triggers os/exec to serialize writes. - StatusWriter was only capturing the last error which sometimes was a generic log message while the more pertinent error was detected earlier and overwritten - switched to an accumulation approach for all the matching patterns. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 10:15:21 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#77586