[GH-ISSUE #14076] qwen3-coder-next: Metal GPU crash with 'tensor buffer is nil' on Apple M4 Max (0.15.5-rc2) #55706

Closed
opened 2026-04-29 09:36:52 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @antonelli182 on GitHub (Feb 4, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14076

What is the issue?

Running qwen3-coder-next on Apple M4 Max with Metal GPU crashes during inference with:

ggml_metal_buffer_get_id: error: tensor ' (view)' buffer is nil

The model loads successfully into VRAM (~43 layers on GPU, ~6 on CPU), but crashes when attempting to generate tokens.

Workaround: Setting num_gpu: 0 forces CPU inference and works correctly, but is significantly slower.

OS

macOS 15.3 (Sequoia) - Apple M4 Max, 64GB unified memory

GPU

Apple M4 Max (Metal)

  • 51.8 GB available VRAM detected
  • MTLGPUFamilyApple9 / MTLGPUFamilyMetal4

CPU

Apple M4 Max (12 cores)

Ollama version

0.15.5-rc2 (pre-release)

Steps to reproduce

  1. Install Ollama 0.15.5-rc2 pre-release
  2. Pull the model: ollama pull qwen3-coder-next
  3. Attempt any inference:
curl http://localhost:11434/api/generate -d '{"model":"qwen3-coder-next","prompt":"hello","stream":false}'

Relevant server log

ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_device_init: GPU name:   Apple M4 Max
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: has tensor            = false
...
time=2026-02-04T13:25:08.090-08:00 level=INFO source=server.go:1387 msg="llama runner started in 18.42 seconds"
ggml_metal_buffer_get_id: error: tensor ' (view)' buffer is nil
ggml_metal_buffer_get_id: error: tensor ' (view)' buffer is nil
[GIN] 2026/02/04 - 13:25:49 | 500 | 59.989722208s | 127.0.0.1 | POST "/api/generate"

Memory allocation (from logs)

  • System memory: 64.0 GiB total, 53.2 GiB free
  • GPU memory: 51.8 GiB available
  • Model weights (Metal): 42.8 GiB
  • Model weights (CPU): 5.4 GiB
  • KV cache (Metal): 7.7 GiB
  • KV cache (CPU): 788.4 MiB
  • Total: 57.4 GiB

Additional context

  • Issue may be related to qwen3next architecture handling in ggml Metal backend
  • The line tensor API disabled for pre-M5 and pre-A19 devices suggests M4 Max is being treated differently
  • CPU inference (num_gpu: 0) works correctly at ~15 tok/s
  • Similar CUDA issues reported in #14068 (Windows)
Originally created by @antonelli182 on GitHub (Feb 4, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14076 ### What is the issue? Running `qwen3-coder-next` on Apple M4 Max with Metal GPU crashes during inference with: ``` ggml_metal_buffer_get_id: error: tensor ' (view)' buffer is nil ``` The model loads successfully into VRAM (~43 layers on GPU, ~6 on CPU), but crashes when attempting to generate tokens. **Workaround:** Setting `num_gpu: 0` forces CPU inference and works correctly, but is significantly slower. ### OS macOS 15.3 (Sequoia) - Apple M4 Max, 64GB unified memory ### GPU Apple M4 Max (Metal) - 51.8 GB available VRAM detected - MTLGPUFamilyApple9 / MTLGPUFamilyMetal4 ### CPU Apple M4 Max (12 cores) ### Ollama version 0.15.5-rc2 (pre-release) ### Steps to reproduce 1. Install Ollama 0.15.5-rc2 pre-release 2. Pull the model: `ollama pull qwen3-coder-next` 3. Attempt any inference: ```bash curl http://localhost:11434/api/generate -d '{"model":"qwen3-coder-next","prompt":"hello","stream":false}' ``` ### Relevant server log ``` ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices ggml_metal_device_init: GPU name: Apple M4 Max ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002) ggml_metal_device_init: has tensor = false ... time=2026-02-04T13:25:08.090-08:00 level=INFO source=server.go:1387 msg="llama runner started in 18.42 seconds" ggml_metal_buffer_get_id: error: tensor ' (view)' buffer is nil ggml_metal_buffer_get_id: error: tensor ' (view)' buffer is nil [GIN] 2026/02/04 - 13:25:49 | 500 | 59.989722208s | 127.0.0.1 | POST "/api/generate" ``` ### Memory allocation (from logs) - System memory: 64.0 GiB total, 53.2 GiB free - GPU memory: 51.8 GiB available - Model weights (Metal): 42.8 GiB - Model weights (CPU): 5.4 GiB - KV cache (Metal): 7.7 GiB - KV cache (CPU): 788.4 MiB - Total: 57.4 GiB ### Additional context - Issue may be related to `qwen3next` architecture handling in ggml Metal backend - The line `tensor API disabled for pre-M5 and pre-A19 devices` suggests M4 Max is being treated differently - CPU inference (`num_gpu: 0`) works correctly at ~15 tok/s - Similar CUDA issues reported in #14068 (Windows)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55706