[GH-ISSUE #11070] Ollama 0.9.0: ggml.go fails to check return status of C.ggml_backend_sched_graph_compute_async(c.b.sched, c.graph) #7303

Closed
opened 2026-04-12 19:21:12 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @stannenb on GitHub (Jun 14, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11070

Originally assigned to: @jessegross on GitHub.

What is the issue?

At line 605 of ollama/ml/backend/ggml/ggml.go
(https://github.com/ollama/ollama/blob/main/ml/backend/ggml/ggml.go#L605)

this code:

func (c *Context) Compute(tensors ...ml.Tensor) {
	C.ggml_backend_sched_graph_compute_async(c.b.sched, c.graph)
	C.ggml_backend_sched_reset(c.b.sched)

	needSync := true
	sync := func() {
		if needSync {
			C.ggml_backend_sched_synchronize(c.b.sched)
			needSync = false
		}
	}

	for _, t := range tensors {
		if C.ggml_nbytes(t.(*Tensor).t) > 0 {
			t.(*Tensor).sync = sync
		}
	}
}

this code seemingly fails to check the return status of C.ggml_backend_sched_graph_compute_async. Thus, if inference calculations fail, ollama keeps going and provides nonsensical answers.

In the case I've discovered, the calculations fail as follows:

ggml_metal_graph_compute: command buffer 1 failed with status 5
error: Internal Error (0000000e:Internal Error)
panic: GGML status: error (operation failed)

While that error is logged, ollama continues to process the request rather than note some kind of model failure.

See https://github.com/ollama/ollama/issues/10986 for full background but the tldr is: Ollama seems to fail on (some) vision models (gemma3) when running on (at least) a Apple M2 Max cpu with Metal acceleration on. Turning off metal acceleration with /set parameter num_gpu 0 allows the computation to complete.

Relevant log output


OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.9.0

Originally created by @stannenb on GitHub (Jun 14, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11070 Originally assigned to: @jessegross on GitHub. ### What is the issue? At line 605 of [ollama](https://github.com/ollama/ollama/tree/main)/[ml](https://github.com/ollama/ollama/tree/main/ml)/[backend](https://github.com/ollama/ollama/tree/main/ml/backend)/[ggml](https://github.com/ollama/ollama/tree/main/ml/backend/ggml)/ggml.go (https://github.com/ollama/ollama/blob/main/ml/backend/ggml/ggml.go#L605) this code: ``` func (c *Context) Compute(tensors ...ml.Tensor) { C.ggml_backend_sched_graph_compute_async(c.b.sched, c.graph) C.ggml_backend_sched_reset(c.b.sched) needSync := true sync := func() { if needSync { C.ggml_backend_sched_synchronize(c.b.sched) needSync = false } } for _, t := range tensors { if C.ggml_nbytes(t.(*Tensor).t) > 0 { t.(*Tensor).sync = sync } } } ``` this code seemingly fails to check the return status of C.ggml_backend_sched_graph_compute_async. Thus, if inference calculations fail, ollama keeps going and provides nonsensical answers. In the case I've discovered, the calculations fail as follows: ``` ggml_metal_graph_compute: command buffer 1 failed with status 5 error: Internal Error (0000000e:Internal Error) panic: GGML status: error (operation failed) ``` While that error is logged, ollama continues to process the request rather than note some kind of model failure. See https://github.com/ollama/ollama/issues/10986 for full background but the tldr is: Ollama seems to fail on (some) vision models (gemma3) when running on (at least) a Apple M2 Max cpu with Metal acceleration on. Turning off metal acceleration with /set parameter num_gpu 0 allows the computation to complete. ### Relevant log output ```shell ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.9.0
GiteaMirror added the bug label 2026-04-12 19:21:12 -05:00
Author
Owner

@MrSimonC commented on GitHub (Jun 16, 2025):

I also have this issue with my Mac Mini M2 Pro (but not my Mac Mini M4). I can't directly solve it, but have (with AI's help) illuminated what's going on:

Ollama Metal Backend Failure on Apple M2 Pro with Gemma3

Verified Hardware & Metal Family Support

I have tested the following configurations:

  • Mac M2 Pro:

    • MTLGPUFamilyApple8: Supported
    • MTLGPUFamilyApple9: Not Supported
  • Mac M4:

    • MTLGPUFamilyApple9: Supported

This confirms that the Apple M2 Pro lacks support for MTLGPUFamilyApple9, which may be required by certain Metal kernels used in Ollama's Metal backend.

Issue Summary

When running vision models like gemma3 on a Mac M2 Pro with Metal acceleration enabled, Ollama fails silently. The logs indicate:

ggml_metal_graph_compute: command buffer 1 failed with status 5
error: Internal Error (0000000e:Internal Error)

Despite this error, Ollama continues processing and returns nonsensical outputs instead of halting execution. This behavior is due to the Compute function in ggml.go not checking the return status of C.ggml_backend_sched_graph_compute_async:

func (c *Context) Compute(tensors ...ml.Tensor) {
    C.ggml_backend_sched_graph_compute_async(c.b.sched, c.graph)
    C.ggml_backend_sched_reset(c.b.sched)

    needSync := true
    sync := func() {
        if needSync {
            C.ggml_backend_sched_synchronize(c.b.sched)
            needSync = false
        }
    }

    for _, t := range tensors {
        if C.ggml_nbytes(t.(*Tensor).t) > 0 {
            t.(*Tensor).sync = sync
        }
    }
}

The lack of error handling here allows the program to proceed despite the failure in GPU computation.

Proposed Fix for Maintainers

To address this issue, the Compute function should be updated to check the return status of C.ggml_backend_sched_graph_compute_async. If the function returns a non-zero value, indicating an error, the program should handle it appropriately, such as by logging the error and halting execution. Here’s a modified version of the function with error checking:

func (c *Context) Compute(tensors ...ml.Tensor) {
    if status := C.ggml_backend_sched_graph_compute_async(c.b.sched, c.graph); status != 0 {
        panic(fmt.Sprintf("ggml_backend_sched_graph_compute_async failed with status %d", status))
    }
    C.ggml_backend_sched_reset(c.b.sched)

    needSync := true
    sync := func() {
        if needSync {
            C.ggml_backend_sched_synchronize(c.b.sched)
            needSync = false
        }
    }

    for _, t := range tensors {
        if C.ggml_nbytes(t.(*Tensor).t) > 0 {
            t.(*Tensor).sync = sync
        }
    }
}

This change ensures that any failure in the asynchronous GPU computation is detected and handled, preventing the program from continuing with invalid results.

Why it blows up on family 8

  1. Tight thread-group scratch limits. Apple’s docs note family 8 still caps total shared memory per compute kernel at ≤ 64 KiB. 
  2. ggml_backend_sched_graph_compute_async() submits one monolithic graph whose peak scratch exceeds that limit; Metal aborts the buffer with status .error (value 5).  
  3. Family 9 GPUs add Dynamic Caching – the driver reallocates scratch on-the-fly, so the same graph fits without modification. 

Workaround

As a temporary solution, you can disable Metal acceleration by setting the number of GPU layers to 0:

/set parameter num_gpu 0

This forces Ollama to use CPU computation, which avoids the Metal backend issue on M2 Pro machines.

Upcoming macOS Release

It’s anticipated that future macOS updates may enhance Metal support on Apple Silicon, potentially adding support for newer GPU families like MTLGPUFamilyApple9 on M2-class devices. This could resolve compatibility issues with certain Metal kernels used in Ollama. Users should keep their systems updated to benefit from these improvements.

📚 References
• Ollama Issue #3698

<!-- gh-comment-id:2976655425 --> @MrSimonC commented on GitHub (Jun 16, 2025): I also have this issue with my Mac Mini M2 Pro (but not my Mac Mini M4). I can't directly solve it, but have (with AI's help) illuminated what's going on: # Ollama Metal Backend Failure on Apple M2 Pro with Gemma3 ## Verified Hardware & Metal Family Support I have tested the following configurations: - **Mac M2 Pro**: - `MTLGPUFamilyApple8`: ✅ Supported - `MTLGPUFamilyApple9`: ❌ Not Supported - **Mac M4**: - `MTLGPUFamilyApple9`: ✅ Supported This confirms that the Apple M2 Pro lacks support for `MTLGPUFamilyApple9`, which may be required by certain Metal kernels used in Ollama's Metal backend. ## Issue Summary When running vision models like `gemma3` on a Mac M2 Pro with Metal acceleration enabled, Ollama fails silently. The logs indicate: ggml_metal_graph_compute: command buffer 1 failed with status 5 error: Internal Error (0000000e:Internal Error) Despite this error, Ollama continues processing and returns nonsensical outputs instead of halting execution. This behavior is due to the `Compute` function in `ggml.go` not checking the return status of `C.ggml_backend_sched_graph_compute_async`: ```go func (c *Context) Compute(tensors ...ml.Tensor) { C.ggml_backend_sched_graph_compute_async(c.b.sched, c.graph) C.ggml_backend_sched_reset(c.b.sched) needSync := true sync := func() { if needSync { C.ggml_backend_sched_synchronize(c.b.sched) needSync = false } } for _, t := range tensors { if C.ggml_nbytes(t.(*Tensor).t) > 0 { t.(*Tensor).sync = sync } } } ``` The lack of error handling here allows the program to proceed despite the failure in GPU computation. ## Proposed Fix for Maintainers To address this issue, the Compute function should be updated to check the return status of C.ggml_backend_sched_graph_compute_async. If the function returns a non-zero value, indicating an error, the program should handle it appropriately, such as by logging the error and halting execution. Here’s a modified version of the function with error checking: ```go func (c *Context) Compute(tensors ...ml.Tensor) { if status := C.ggml_backend_sched_graph_compute_async(c.b.sched, c.graph); status != 0 { panic(fmt.Sprintf("ggml_backend_sched_graph_compute_async failed with status %d", status)) } C.ggml_backend_sched_reset(c.b.sched) needSync := true sync := func() { if needSync { C.ggml_backend_sched_synchronize(c.b.sched) needSync = false } } for _, t := range tensors { if C.ggml_nbytes(t.(*Tensor).t) > 0 { t.(*Tensor).sync = sync } } } ``` This change ensures that any failure in the asynchronous GPU computation is detected and handled, preventing the program from continuing with invalid results. ## Why it blows up on family 8 1. Tight thread-group scratch limits. Apple’s docs note family 8 still caps total shared memory per compute kernel at ≤ 64 KiB.  2. ggml_backend_sched_graph_compute_async() submits one monolithic graph whose peak scratch exceeds that limit; Metal aborts the buffer with status .error (value 5).   3. Family 9 GPUs add Dynamic Caching – the driver reallocates scratch on-the-fly, so the same graph fits without modification.  ## Workaround As a temporary solution, you can disable Metal acceleration by setting the number of GPU layers to 0: /set parameter num_gpu 0 This forces Ollama to use CPU computation, which avoids the Metal backend issue on M2 Pro machines. ## Upcoming macOS Release It’s anticipated that future macOS updates may enhance Metal support on Apple Silicon, potentially adding support for newer GPU families like MTLGPUFamilyApple9 on M2-class devices. This could resolve compatibility issues with certain Metal kernels used in Ollama. Users should keep their systems updated to benefit from these improvements. 📚 References • Ollama Issue #3698
Author
Owner

@jessegross commented on GitHub (Jun 19, 2025):

Thanks for tracking that down - I sent out a PR for it.

<!-- gh-comment-id:2989210248 --> @jessegross commented on GitHub (Jun 19, 2025): Thanks for tracking that down - I sent out a PR for it.
Author
Owner

@bmachek commented on GitHub (Jul 28, 2025):

Hey there
I got a few questions concerning this.

  • How bad is the performance impact when using the CPU instead of GPU with vision models on an M2?
  • Why is it working on some M1 systems?
  • Can "/set parameter num_gpu 0" be issued via API? My application uses the OpenAI compatible endpoint.

Many thanks in advance.

<!-- gh-comment-id:3128295882 --> @bmachek commented on GitHub (Jul 28, 2025): Hey there I got a few questions concerning this. * How bad is the performance impact when using the CPU instead of GPU with vision models on an M2? * Why is it working on some M1 systems? * Can "/set parameter num_gpu 0" be issued via API? My application uses the OpenAI compatible endpoint. Many thanks in advance.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7303