[PR #15206] fix(mlx): prevent Metal GPU watchdog crash on large models #77368

Open
opened 2026-05-05 10:02:48 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15206
Author: @iamkritika-official
Created: 4/2/2026
Status: 🔄 Open

Base: mainHead: fix/metal-command-buffer-watchdog


📝 Commits (1)

  • 9f6e211 fix: reduce prefill chunk size and add GPU sync to prevent Metal watchdog crash

📊 Changes

1 file changed (+4 additions, -2 deletions)

View changed files

📝 x/mlxrunner/pipeline.go (+4 -2)

📄 Description

Root Cause

macOS GPU watchdog terminates Metal command buffers that take too long (~1-2 sec). Large models with long context (post web-search) were triggering this because:

  1. prefillChunkSize() was returning 2048 tokens — too large for a single Metal command buffer
  2. Generation loop was only syncing GPU every 256 tokens — too infrequent, allowing command buffers to accumulate

Fix

Two changes in x/mlxrunner/pipeline.go:

  1. Reduced prefill chunk size from 2 << 10 (2048) to 512 — smaller chunks mean each Metal command buffer completes faster, staying under the watchdog timeout

  2. Added mlx.Eval(sample) and reduced cache clear interval from 256 to 64 tokens — forces periodic GPU synchronization during generation so command buffers don't accumulate

Testing

Needs testing on macOS with Apple Silicon. To reproduce the original crash:

  1. Run ollama run qwen3.5:35b-a3b-coding-nvfp4
  2. Give a prompt that triggers web search
  3. Before fix: crash with Metal watchdog error after ~1 minute
  4. After fix: response should complete successfully

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15206 **Author:** [@iamkritika-official](https://github.com/iamkritika-official) **Created:** 4/2/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix/metal-command-buffer-watchdog` --- ### 📝 Commits (1) - [`9f6e211`](https://github.com/ollama/ollama/commit/9f6e2111b367daddd5d46731618ea0f43d6b4448) fix: reduce prefill chunk size and add GPU sync to prevent Metal watchdog crash ### 📊 Changes **1 file changed** (+4 additions, -2 deletions) <details> <summary>View changed files</summary> 📝 `x/mlxrunner/pipeline.go` (+4 -2) </details> ### 📄 Description ## Root Cause macOS GPU watchdog terminates Metal command buffers that take too long (~1-2 sec). Large models with long context (post web-search) were triggering this because: 1. `prefillChunkSize()` was returning 2048 tokens — too large for a single Metal command buffer 2. Generation loop was only syncing GPU every 256 tokens — too infrequent, allowing command buffers to accumulate ## Fix Two changes in `x/mlxrunner/pipeline.go`: 1. Reduced prefill chunk size from `2 << 10` (2048) to `512` — smaller chunks mean each Metal command buffer completes faster, staying under the watchdog timeout 2. Added `mlx.Eval(sample)` and reduced cache clear interval from `256` to `64` tokens — forces periodic GPU synchronization during generation so command buffers don't accumulate ## Testing Needs testing on macOS with Apple Silicon. To reproduce the original crash: 1. Run `ollama run qwen3.5:35b-a3b-coding-nvfp4` 2. Give a prompt that triggers web search 3. Before fix: crash with Metal watchdog error after ~1 minute 4. After fix: response should complete successfully --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 10:02:48 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#77368