[GH-ISSUE #15822] MLX runner failed with qwen3.6:35b-a3b-coding-bf16 format=json #56595

Open
opened 2026-04-29 11:04:27 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @dujeonglee on GitHub (Apr 26, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15822

What is the issue?

curl http://localhost:11434/api/chat -d '{
"model": "qwen3.6:35b-a3b-coding-bf16",
"messages": [{"role":"user","content":"hi"}],
"format": "json",
"stream": false
}'

Relevant log output

{"error":"mlx runner failed: time=2026-04-26T21:12:59.383+09:00 level=INFO source=pipeline.go:129 msg=\"Prompt processing progress\" processed=10 total=11"}

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

ollama version is 0.21.2

Originally created by @dujeonglee on GitHub (Apr 26, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15822 ### What is the issue? curl http://localhost:11434/api/chat -d '{ "model": "qwen3.6:35b-a3b-coding-bf16", "messages": [{"role":"user","content":"hi"}], "format": "json", "stream": false }' ### Relevant log output ```shell {"error":"mlx runner failed: time=2026-04-26T21:12:59.383+09:00 level=INFO source=pipeline.go:129 msg=\"Prompt processing progress\" processed=10 total=11"} ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version ollama version is 0.21.2
GiteaMirror added the bug label 2026-04-29 11:04:27 -05:00
Author
Owner

@dujeonglee commented on GitHub (Apr 26, 2026):

it is 100% reproduced with the above query with qwen3.6:35b-a3b-coding-bf16 model only.

<!-- gh-comment-id:4322031308 --> @dujeonglee commented on GitHub (Apr 26, 2026): it is 100% reproduced with the above query with qwen3.6:35b-a3b-coding-bf16 model only.
Author
Owner

@andreinknv commented on GitHub (Apr 26, 2026):

https://github.com/ollama/ollama/pull/15793 check if this fix helps you, or it is different bug

<!-- gh-comment-id:4322309072 --> @andreinknv commented on GitHub (Apr 26, 2026): https://github.com/ollama/ollama/pull/15793 check if this fix helps you, or it is different bug
Author
Owner

@andreinknv commented on GitHub (Apr 28, 2026):

Following up — I dug into this further and isolated a separate root cause that affects all qwen3.6:35b-a3b-* (and qwen3.5:35b-a3b-*) variants on the MLX runner: the gated_delta_step Metal kernel writes the recurrent state back as InT (the input dtype, bf16 here) instead of StT (fp32). The reference mlx_lm.models.gated_delta uses a separate StT template arg and keeps state in fp32 between recurrent steps; without that, every decode step round-trips state through bf16 and the linear-attention layers' state degrades quickly.

Filed as a focused report with patch + before/after evidence here: #15865.

For the format=json case in this thread specifically: I expect the json mode triggers a different sampling/grammar path and may be a separate bug on top, but worth retesting once the kernel patch lands.

<!-- gh-comment-id:4339464922 --> @andreinknv commented on GitHub (Apr 28, 2026): Following up — I dug into this further and isolated a separate root cause that affects all `qwen3.6:35b-a3b-*` (and `qwen3.5:35b-a3b-*`) variants on the MLX runner: the `gated_delta_step` Metal kernel writes the recurrent state back as **`InT`** (the input dtype, bf16 here) instead of **`StT`** (fp32). The reference `mlx_lm.models.gated_delta` uses a separate `StT` template arg and keeps state in fp32 between recurrent steps; without that, every decode step round-trips state through bf16 and the linear-attention layers' state degrades quickly. Filed as a focused report with patch + before/after evidence here: **#15865**. For the `format=json` case in this thread specifically: I expect the json mode triggers a different sampling/grammar path and may be a separate bug on top, but worth retesting once the kernel patch lands.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56595