[GH-ISSUE #10986] Ollama 0.9.0, MacOS, gemma3:latest, and vision: Metal acceleration internal error produces inconsistent results #53756

Open
opened 2026-04-29 04:40:59 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @stannenb on GitHub (Jun 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10986

What is the issue?

When using Ollama 0.9.0 on a Mac Studio M2 Max to run gemma3:latest to describe in image, server logs show an internal error, but Ollama continues processing, producing bogus results.

[GIN] 2025/06/05 - 13:03:33 | 200 |  2.125130792s |       127.0.0.1 | POST     "/api/generate"
ggml_metal_graph_compute: command buffer 1 failed with status 5
error: Internal Error (0000000e:Internal Error)

If one immediately issues a "/set parameter num_gpu 0" command, Ollama processes in the image, producing a valid result.

❯ ollama run gemma3:latest
>>> describe this image. /Users/xxxx/Downloads/IMG_2001@0.5x.png
Added image '/Users/xxxx/Downloads/IMG_2001@0.5x.png'
This is a screenshot of a text message that is saying "This is a screenshot of a text message that is saying "This is a screenshot of
a text message that is saying ".

The message is a self-referential joke!  It's a way of saying something similar.

>>> /set parameter num_gpu 0
Set parameter 'num_gpu' to '0'
>>> describe this image. /Users/xxxx/Downloads/IMG_2001@0.5x.png
Added image '/Users/xxxx/Downloads/IMG_2001@0.5x.png'
Okay, here’s a description of the image:

The image is a close-up portrait of a middle-aged man. He has a pale, somewhat weathered complexion. His most striking features are
his thick, full, and white, slightly unkempt beard and mustache. He is wearing dark, rectangular, aviator-style glasses. He’s looking
directly at the camera with a serious, perhaps slightly skeptical, expression. The background is a blurry, out-of-focus wall,
suggesting the photo was taken indoors. He is wearing a dark, likely gray or black, shirt. The lighting is fairly neutral.

Relevant log output

ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
time=2025-06-05T13:03:32.861-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=Metal buffer_type=Metal size="1.1 GiB"
time=2025-06-05T13:03:32.861-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=BLAS buffer_type=CPU size="0 B"
time=2025-06-05T13:03:32.861-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-06-05T13:03:32.893-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=Metal buffer_type=Metal size="1.1 GiB"
time=2025-06-05T13:03:32.893-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=BLAS buffer_type=CPU size="5.0 MiB"
time=2025-06-05T13:03:32.893-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-06-05T13:03:33.844-04:00 level=INFO source=server.go:630 msg="llama runner started in 2.01 seconds"
[GIN] 2025/06/05 - 13:03:33 | 200 |  2.125130792s |       127.0.0.1 | POST     "/api/generate"
ggml_metal_graph_compute: command buffer 1 failed with status 5
error: Internal Error (0000000e:Internal Error)
[GIN] 2025/06/05 - 13:03:40 | 200 |  2.147018541s |       127.0.0.1 | POST     "/api/chat"
time=2025-06-05T13:06:10.840-04:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="signal: killed"
time=2025-06-05T13:06:10.927-04:00 level=INFO source=server.go:135 msg="system memory" total="64.0 GiB" free="29.8 GiB" free_swap="0 B"
time=2025-06-05T13:06:10.928-04:00 level=INFO source=server.go:168 msg=offload library=cpu layers.requested=0 layers.model=35 layers.offload=0 layers.split="" memory.available="[29.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.4 GiB" memory.required.partial="0 B" memory.required.kv="225.0 MiB" memory.required.allocations="[62.8 MiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="1.8 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="517.0 MiB" memory.graph.partial="1.0 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-06-05T13:06:10.928-04:00 level=WARN source=server.go:199 msg="flash attention enabled but not supported by gpu"
time=2025-06-05T13:06:10.928-04:00 level=WARN source=server.go:222 msg="quantized kv cache requested but flash attention disabled" type=q8_0
time=2025-06-05T13:06:10.970-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/opt/homebrew/Cellar/ollama/0.9.0/bin/ollama runner --ollama-engine --model /Users/saul/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 8192 --batch-size 512 --n-gpu-layers 0 --threads 8 --no-mmap --parallel 2 --port 50865"
time=2025-06-05T13:06:10.972-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-06-05T13:06:10.972-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-06-05T13:06:10.972-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-06-05T13:06:10.995-04:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-06-05T13:06:10.995-04:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:50865"
time=2025-06-05T13:06:11.034-04:00 level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=36
time=2025-06-05T13:06:11.034-04:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
time=2025-06-05T13:06:11.051-04:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="3.6 GiB"
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
time=2025-06-05T13:06:11.181-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=Metal buffer_type=Metal size="0 B"
time=2025-06-05T13:06:11.181-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=BLAS buffer_type=CPU size="1.1 GiB"
time=2025-06-05T13:06:11.181-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-06-05T13:06:11.215-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=Metal buffer_type=Metal size="0 B"
time=2025-06-05T13:06:11.215-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=BLAS buffer_type=CPU size="1.1 GiB"
time=2025-06-05T13:06:11.215-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-06-05T13:06:11.224-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-06-05T13:06:11.977-04:00 level=INFO source=server.go:630 msg="llama runner started in 1.01 seconds"
[GIN] 2025/06/05 - 13:06:44 | 200 | 34.082673125s |       127.0.0.1 | POST     "/api/chat"

OS

MacOS

GPU

M2 Max

CPU

M2 Max

Ollama version

0.9.0

Originally created by @stannenb on GitHub (Jun 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10986 ### What is the issue? When using Ollama 0.9.0 on a Mac Studio M2 Max to run gemma3:latest to describe in image, server logs show an internal error, but Ollama continues processing, producing bogus results. ``` [GIN] 2025/06/05 - 13:03:33 | 200 | 2.125130792s | 127.0.0.1 | POST "/api/generate" ggml_metal_graph_compute: command buffer 1 failed with status 5 error: Internal Error (0000000e:Internal Error) ``` If one immediately issues a "/set parameter num_gpu 0" command, Ollama processes in the image, producing a valid result. ``` ❯ ollama run gemma3:latest >>> describe this image. /Users/xxxx/Downloads/IMG_2001@0.5x.png Added image '/Users/xxxx/Downloads/IMG_2001@0.5x.png' This is a screenshot of a text message that is saying "This is a screenshot of a text message that is saying "This is a screenshot of a text message that is saying ". The message is a self-referential joke! It's a way of saying something similar. >>> /set parameter num_gpu 0 Set parameter 'num_gpu' to '0' >>> describe this image. /Users/xxxx/Downloads/IMG_2001@0.5x.png Added image '/Users/xxxx/Downloads/IMG_2001@0.5x.png' Okay, here’s a description of the image: The image is a close-up portrait of a middle-aged man. He has a pale, somewhat weathered complexion. His most striking features are his thick, full, and white, slightly unkempt beard and mustache. He is wearing dark, rectangular, aviator-style glasses. He’s looking directly at the camera with a serious, perhaps slightly skeptical, expression. The background is a blurry, out-of-focus wall, suggesting the photo was taken indoors. He is wearing a dark, likely gray or black, shirt. The lighting is fairly neutral. ``` ### Relevant log output ```shell ggml_metal_init: allocating ggml_metal_init: picking default device: Apple M2 Max ggml_metal_load_library: using embedded metal library ggml_metal_init: GPU name: Apple M2 Max ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction = true ggml_metal_init: simdgroup matrix mul. = true ggml_metal_init: has residency sets = true ggml_metal_init: has bfloat = true ggml_metal_init: use bfloat = false ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB ggml_metal_init: skipping kernel_get_rows_bf16 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported) ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported) ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported) ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported) time=2025-06-05T13:03:32.861-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=Metal buffer_type=Metal size="1.1 GiB" time=2025-06-05T13:03:32.861-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=BLAS buffer_type=CPU size="0 B" time=2025-06-05T13:03:32.861-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" time=2025-06-05T13:03:32.893-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=Metal buffer_type=Metal size="1.1 GiB" time=2025-06-05T13:03:32.893-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=BLAS buffer_type=CPU size="5.0 MiB" time=2025-06-05T13:03:32.893-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" time=2025-06-05T13:03:33.844-04:00 level=INFO source=server.go:630 msg="llama runner started in 2.01 seconds" [GIN] 2025/06/05 - 13:03:33 | 200 | 2.125130792s | 127.0.0.1 | POST "/api/generate" ggml_metal_graph_compute: command buffer 1 failed with status 5 error: Internal Error (0000000e:Internal Error) [GIN] 2025/06/05 - 13:03:40 | 200 | 2.147018541s | 127.0.0.1 | POST "/api/chat" time=2025-06-05T13:06:10.840-04:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="signal: killed" time=2025-06-05T13:06:10.927-04:00 level=INFO source=server.go:135 msg="system memory" total="64.0 GiB" free="29.8 GiB" free_swap="0 B" time=2025-06-05T13:06:10.928-04:00 level=INFO source=server.go:168 msg=offload library=cpu layers.requested=0 layers.model=35 layers.offload=0 layers.split="" memory.available="[29.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.4 GiB" memory.required.partial="0 B" memory.required.kv="225.0 MiB" memory.required.allocations="[62.8 MiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="1.8 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="517.0 MiB" memory.graph.partial="1.0 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" time=2025-06-05T13:06:10.928-04:00 level=WARN source=server.go:199 msg="flash attention enabled but not supported by gpu" time=2025-06-05T13:06:10.928-04:00 level=WARN source=server.go:222 msg="quantized kv cache requested but flash attention disabled" type=q8_0 time=2025-06-05T13:06:10.970-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/opt/homebrew/Cellar/ollama/0.9.0/bin/ollama runner --ollama-engine --model /Users/saul/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 8192 --batch-size 512 --n-gpu-layers 0 --threads 8 --no-mmap --parallel 2 --port 50865" time=2025-06-05T13:06:10.972-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-06-05T13:06:10.972-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-06-05T13:06:10.972-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-06-05T13:06:10.995-04:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-06-05T13:06:10.995-04:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:50865" time=2025-06-05T13:06:11.034-04:00 level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=36 time=2025-06-05T13:06:11.034-04:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) time=2025-06-05T13:06:11.051-04:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="3.6 GiB" ggml_metal_init: allocating ggml_metal_init: picking default device: Apple M2 Max ggml_metal_load_library: using embedded metal library ggml_metal_init: GPU name: Apple M2 Max ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction = true ggml_metal_init: simdgroup matrix mul. = true ggml_metal_init: has residency sets = true ggml_metal_init: has bfloat = true ggml_metal_init: use bfloat = false ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB ggml_metal_init: skipping kernel_get_rows_bf16 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported) ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported) ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported) ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported) ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported) time=2025-06-05T13:06:11.181-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=Metal buffer_type=Metal size="0 B" time=2025-06-05T13:06:11.181-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=BLAS buffer_type=CPU size="1.1 GiB" time=2025-06-05T13:06:11.181-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" time=2025-06-05T13:06:11.215-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=Metal buffer_type=Metal size="0 B" time=2025-06-05T13:06:11.215-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=BLAS buffer_type=CPU size="1.1 GiB" time=2025-06-05T13:06:11.215-04:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" time=2025-06-05T13:06:11.224-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-06-05T13:06:11.977-04:00 level=INFO source=server.go:630 msg="llama runner started in 1.01 seconds" [GIN] 2025/06/05 - 13:06:44 | 200 | 34.082673125s | 127.0.0.1 | POST "/api/chat" ``` ### OS MacOS ### GPU M2 Max ### CPU M2 Max ### Ollama version 0.9.0
GiteaMirror added the bug label 2026-04-29 04:40:59 -05:00
Author
Owner

@cwallen commented on GitHub (Jun 6, 2025):

I'm seeing similar issues:

ggml_metal_graph_compute: command buffer 1 failed with status 5
error: Internal Error (0000000e:Internal Error)
panic: failed to sample token: sample: logits sum to NaN, check model output

goroutine 11 [running]:
github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0x1400056b560, {0x10152ed50, 0x14000530640})
	/Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:364 +0x70
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	/Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:960 +0x898
time=2025-06-06T12:20:21.615-04:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2"
[GIN] 2025/06/06 - 12:20:21 | 500 | 45.959320292s |       127.0.0.1 | POST     "/api/generate"

Also get the first 2 lines on their own, sometimes with buffer 0 instead of 1.
Seeing the same errors on qwen2.5vl as well as gemma3
I'm also on Apple M2 Max

<!-- gh-comment-id:2949975867 --> @cwallen commented on GitHub (Jun 6, 2025): I'm seeing similar issues: ``` ggml_metal_graph_compute: command buffer 1 failed with status 5 error: Internal Error (0000000e:Internal Error) panic: failed to sample token: sample: logits sum to NaN, check model output goroutine 11 [running]: github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0x1400056b560, {0x10152ed50, 0x14000530640}) /Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:364 +0x70 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 /Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:960 +0x898 time=2025-06-06T12:20:21.615-04:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2" [GIN] 2025/06/06 - 12:20:21 | 500 | 45.959320292s | 127.0.0.1 | POST "/api/generate" ``` Also get the first 2 lines on their own, sometimes with buffer 0 instead of 1. Seeing the same errors on qwen2.5vl as well as gemma3 I'm also on Apple M2 Max
Author
Owner

@yarmoliq commented on GitHub (Jun 8, 2025):

I also can't get anything from any vision model. It hallucinates random stuff. I'm also on M2 Max, ollama v0.9.0

<!-- gh-comment-id:2953985414 --> @yarmoliq commented on GitHub (Jun 8, 2025): I also can't get anything from any vision model. It hallucinates random stuff. I'm also on M2 Max, ollama v0.9.0
Author
Owner

@cwallen commented on GitHub (Jun 8, 2025):

Bit of non-scientific experimentation this weekend:
The errors in the log that I was seeing for qwen2.5vl:7b go away entirely with qwen2.5vl:7b-fp16 (thought they had but actually still seeing them.)
gemma3:4b-it-fp16 might have a lower error rate than gemma3:4b, but hard to tell, definitely not zero.

@yarmoliq Are you using CLI, API or something else?
If you are getting garbage for every image on every model, sounds like that might be a higher level problem than what I'm seeing. Even the error prone models for me work most of the time.
When I was first playing around with scripting against the API, I was getting garbage on everything, problem was my base64 encoded jpgs were getting read as pngs. Garbage in, garbage out.

<!-- gh-comment-id:2954326869 --> @cwallen commented on GitHub (Jun 8, 2025): Bit of non-scientific experimentation this weekend: The errors in the log that I was seeing for qwen2.5vl:7b ~go away entirely with qwen2.5vl:7b-fp16~ (thought they had but actually still seeing them.) gemma3:4b-it-fp16 might have a lower error rate than gemma3:4b, but hard to tell, definitely not zero. @yarmoliq Are you using CLI, API or something else? If you are getting garbage for every image on every model, sounds like that might be a higher level problem than what I'm seeing. Even the error prone models for me work most of the time. When I was first playing around with scripting against the API, I was getting garbage on everything, problem was my base64 encoded jpgs were getting read as pngs. Garbage in, garbage out.
Author
Owner

@yarmoliq commented on GitHub (Jun 8, 2025):

First I stumbled upon this issue using local API. At first I thought that small vision models are just garbage (I fed an image of a receipt to gemma3:27b asking what it sees, and got responses like "a cat" or "a lion head"). But then I understood that they are not garbage, and that something weird was going on, and decided to try chatting in CLI (ollama run), but still got garbage results. And this is how I got here.

Using /set parameter num_gpu 0 seems to be helping.

Out of like 30+ tries I managed to randomly receive 1 or 2 actually good results (1 from gemma, 1 from llama), but all the other tries just straight up 100% hallucination (all the tries were with the same image)

<!-- gh-comment-id:2954331329 --> @yarmoliq commented on GitHub (Jun 8, 2025): First I stumbled upon this issue using local API. At first I thought that small vision models are just garbage (I fed an image of a receipt to gemma3:27b asking what it sees, and got responses like "a cat" or "a lion head"). But then I understood that they are not garbage, and that something weird was going on, and decided to try chatting in CLI (ollama run), but still got garbage results. And this is how I got here. Using `/set parameter num_gpu 0` seems to be helping. Out of like 30+ tries I managed to randomly receive 1 or 2 actually good results (1 from gemma, 1 from llama), but all the other tries just straight up 100% hallucination (all the tries were with the same image)
Author
Owner

@yarmoliq commented on GitHub (Jun 8, 2025):

I was also thinking that maybe the models do some weird cropping that messes everything up. Tried resizing images my self, but that didn't yield anything.

<!-- gh-comment-id:2954332803 --> @yarmoliq commented on GitHub (Jun 8, 2025): I was also thinking that maybe the models do some weird cropping that messes everything up. Tried resizing images my self, but that didn't yield anything.
Author
Owner

@stannenb commented on GitHub (Jun 10, 2025):

I think there are two issues here:

  1. Metal acceleration for (some) vision models on M2 Max cpus is broken. You can work around this by disabling gpu processing with "/set parameter num_gpu 0".
  2. Ollama doesn't notice that metal-accelerated computation is failing and pulls a response out of, well, somewhere. The response has nothing to do with the image or the prompt.

How to debug this any further is beyond my current skill set.

<!-- gh-comment-id:2957412518 --> @stannenb commented on GitHub (Jun 10, 2025): I think there are two issues here: 1. Metal acceleration for (some) vision models on M2 Max cpus is broken. You can work around this by disabling gpu processing with "/set parameter num_gpu 0". 2. Ollama doesn't notice that metal-accelerated computation is failing and pulls a response out of, well, somewhere. The response has nothing to do with the image or the prompt. How to debug this any further is beyond my current skill set.
Author
Owner

@smileyboy2019 commented on GitHub (Jun 10, 2025):

图片理解,根本就不对。

<!-- gh-comment-id:2957786458 --> @smileyboy2019 commented on GitHub (Jun 10, 2025): 图片理解,根本就不对。
Author
Owner

@cwallen commented on GitHub (Jun 10, 2025):

I've seen one other symptom that I'm not sure is related but occurs to me that I'm just seeing on the same models. Most prompts return in a pretty reasonable amount of time, but occasionally it seems to just hang until the fetch request times out at 5min.

Some extra logging from qwen2.5vl:7b-fp16 with debug:

time=2025-06-10T00:54:13.764-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0]
time=2025-06-10T00:54:13.857-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0]
time=2025-06-10T00:54:13.857-04:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=1352 used=0 remaining=1352
time=2025-06-10T00:54:13.873-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0]
ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Internal Error (0000000e:Internal Error)
time=2025-06-10T00:54:30.838-04:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=1 cache=0 prompt=1342 used=0 remaining=1342
ggml_metal_graph_compute: command buffer 1 failed with status 5
error: Internal Error (0000000e:Internal Error)
panic: failed to sample token: sample: logits sum to NaN, check model output

goroutine 39 [running]:
github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0x14000493560, {0x10605ed50, 0x140004cd7c0})
	/Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:364 +0x70
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	/Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:960 +0x898
time=2025-06-10T00:54:37.735-04:00 level=DEBUG source=server.go:1023 msg="stopping llama server" pid=93162

One that didn't panic:

time=2025-06-10T01:13:08.774-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0]
time=2025-06-10T01:13:08.774-04:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=1401 prompt=1349 used=80 remaining=1269
time=2025-06-10T01:13:08.778-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0]
time=2025-06-10T01:13:30.400-04:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=1 cache=1394 prompt=1341 used=80 remaining=1261
ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Internal Error (0000000e:Internal Error)
[GIN] 2025/06/10 - 01:13:45 | 200 | 37.164364417s |       127.0.0.1 | POST     "/api/chat"

Timeout:

time=2025-06-10T01:18:01.186-04:00 level=DEBUG source=cache.go:272 msg="context limit hit - shifting" id=1 limit=4096 input=4096 keep=4 discard=2046
[GIN] 2025/06/10 - 01:18:09 | 500 |          5m0s |       127.0.0.1 | POST     "/api/chat"

Happy to pull more logs if it helps, or figure out how to run a dev build to test.

<!-- gh-comment-id:2957912981 --> @cwallen commented on GitHub (Jun 10, 2025): I've seen one other symptom that I'm not sure is related but occurs to me that I'm just seeing on the same models. Most prompts return in a pretty reasonable amount of time, but occasionally it seems to just hang until the fetch request times out at 5min. Some extra logging from qwen2.5vl:7b-fp16 with debug: ``` time=2025-06-10T00:54:13.764-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0] time=2025-06-10T00:54:13.857-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0] time=2025-06-10T00:54:13.857-04:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=1352 used=0 remaining=1352 time=2025-06-10T00:54:13.873-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0] ggml_metal_graph_compute: command buffer 0 failed with status 5 error: Internal Error (0000000e:Internal Error) time=2025-06-10T00:54:30.838-04:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=1 cache=0 prompt=1342 used=0 remaining=1342 ggml_metal_graph_compute: command buffer 1 failed with status 5 error: Internal Error (0000000e:Internal Error) panic: failed to sample token: sample: logits sum to NaN, check model output goroutine 39 [running]: github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0x14000493560, {0x10605ed50, 0x140004cd7c0}) /Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:364 +0x70 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 /Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:960 +0x898 time=2025-06-10T00:54:37.735-04:00 level=DEBUG source=server.go:1023 msg="stopping llama server" pid=93162 ``` One that didn't panic: ``` time=2025-06-10T01:13:08.774-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0] time=2025-06-10T01:13:08.774-04:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=1401 prompt=1349 used=80 remaining=1269 time=2025-06-10T01:13:08.778-04:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0] time=2025-06-10T01:13:30.400-04:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=1 cache=1394 prompt=1341 used=80 remaining=1261 ggml_metal_graph_compute: command buffer 0 failed with status 5 error: Internal Error (0000000e:Internal Error) [GIN] 2025/06/10 - 01:13:45 | 200 | 37.164364417s | 127.0.0.1 | POST "/api/chat" ``` Timeout: ``` time=2025-06-10T01:18:01.186-04:00 level=DEBUG source=cache.go:272 msg="context limit hit - shifting" id=1 limit=4096 input=4096 keep=4 discard=2046 [GIN] 2025/06/10 - 01:18:09 | 500 | 5m0s | 127.0.0.1 | POST "/api/chat" ``` Happy to pull more logs if it helps, or figure out how to run a dev build to test.
Author
Owner

@yarmoliq commented on GitHub (Aug 25, 2025):

any news?

<!-- gh-comment-id:3221788080 --> @yarmoliq commented on GitHub (Aug 25, 2025): any news?
Author
Owner

@cwallen commented on GitHub (Aug 26, 2025):

@yarmoliq At least for me, the PR from #11070 makes it so that it's no longer giving garbage responses, however it just throws an error and crashes instead, so in my scripts I just catch the error and retry and usually that works. Would be nice to have a full fix.

<!-- gh-comment-id:3222162954 --> @cwallen commented on GitHub (Aug 26, 2025): @yarmoliq At least for me, the PR from #11070 makes it so that it's no longer giving garbage responses, however it just throws an error and crashes instead, so in my scripts I just catch the error and retry and usually that works. Would be nice to have a full fix.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53756