[GH-ISSUE #15266] Qwen 35b nvfp4 mlx Infinity thinking loop #71824

Open
opened 2026-05-05 02:38:23 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @Urcherd on GitHub (Apr 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15266

What is the issue?

Command: ollama run qwen3.5:35b-a3b-coding-nvfp4 --verbose <<< "Write history in 500 lines"
Result: Error: mlx runner failed: time=2026-04-03T12:31:06.328+03:00 level=INFO source=cache.go:126 msg="cache hit" total=18 matched=18 cached=17 left=1
Env: Apple M4 Max | 64GB | Tahoe 26.3 (25D125)

Relevant log output

Server Log:
[GIN] 2026/04/03 - 12:31:06 | 200 |      53.625µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/04/03 - 12:31:06 | 200 |   85.233083ms |       127.0.0.1 | POST     "/api/show"
time=2026-04-03T12:31:06.327+03:00 level=INFO source=server.go:183 msg=ServeHTTP method=GET path=/v1/status took=14.833µs status="200 OK"
time=2026-04-03T12:31:06.328+03:00 level=INFO source=cache.go:126 msg="cache hit" total=18 matched=18 cached=17 left=1
[GIN] 2026/04/03 - 12:41:06 | 200 |         10m0s |       127.0.0.1 | POST     "/api/generate"
time=2026-04-03T12:41:06.335+03:00 level=INFO source=server.go:183 msg=ServeHTTP method=POST path=/v1/completions took=10m0.004317333s status="200 OK"
time=2026-04-03T12:41:06.419+03:00 level=INFO source=pipeline.go:55 msg="peak memory" size="21.20 GiB"
time=2026-04-03T12:41:06.419+03:00 level=INFO source=runner.go:149 msg="Request terminated" error="context canceled"

Output Log:
 *(Wait, I'll write the history now).*
    *(Okay).*
    *(Wait, I'll number them).*
    *(Okay).*
    *(Wait, I'll start).*
    *(Okay).*

    *(Wait, I need to be careful not to produce a wall of text that looks b
bad).*
    *(I'll make it readable).*
    *(Okay).*

    *(Wait, I'll start the output).*
    *(Okay).*
    *(Wait, I'll write the lines).*
    *(Okay).*
    *(Wait, I'll make sure I reach 500).*
    *(Okay).*

    *(Wait, I'll write the lines now).*
    *(Okay).*

    *(Wait, I'll start).*
    *(Okay).*

    *(Wait, I'll write the lines).*
    *(Okay).*

And last 3000 lines is:
    *(Wait, I'll write the lines).*
    *(Okay).*

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.20.0

Originally created by @Urcherd on GitHub (Apr 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15266 ### What is the issue? **Command:** ollama run qwen3.5:35b-a3b-coding-nvfp4 --verbose <<< "Write history in 500 lines" **Result:** Error: mlx runner failed: time=2026-04-03T12:31:06.328+03:00 level=INFO source=cache.go:126 msg="cache hit" total=18 matched=18 cached=17 left=1 Env: Apple M4 Max | 64GB | Tahoe 26.3 (25D125) ### Relevant log output ```shell Server Log: [GIN] 2026/04/03 - 12:31:06 | 200 | 53.625µs | 127.0.0.1 | HEAD "/" [GIN] 2026/04/03 - 12:31:06 | 200 | 85.233083ms | 127.0.0.1 | POST "/api/show" time=2026-04-03T12:31:06.327+03:00 level=INFO source=server.go:183 msg=ServeHTTP method=GET path=/v1/status took=14.833µs status="200 OK" time=2026-04-03T12:31:06.328+03:00 level=INFO source=cache.go:126 msg="cache hit" total=18 matched=18 cached=17 left=1 [GIN] 2026/04/03 - 12:41:06 | 200 | 10m0s | 127.0.0.1 | POST "/api/generate" time=2026-04-03T12:41:06.335+03:00 level=INFO source=server.go:183 msg=ServeHTTP method=POST path=/v1/completions took=10m0.004317333s status="200 OK" time=2026-04-03T12:41:06.419+03:00 level=INFO source=pipeline.go:55 msg="peak memory" size="21.20 GiB" time=2026-04-03T12:41:06.419+03:00 level=INFO source=runner.go:149 msg="Request terminated" error="context canceled" Output Log: *(Wait, I'll write the history now).* *(Okay).* *(Wait, I'll number them).* *(Okay).* *(Wait, I'll start).* *(Okay).* *(Wait, I need to be careful not to produce a wall of text that looks b bad).* *(I'll make it readable).* *(Okay).* *(Wait, I'll start the output).* *(Okay).* *(Wait, I'll write the lines).* *(Okay).* *(Wait, I'll make sure I reach 500).* *(Okay).* *(Wait, I'll write the lines now).* *(Okay).* *(Wait, I'll start).* *(Okay).* *(Wait, I'll write the lines).* *(Okay).* And last 3000 lines is: *(Wait, I'll write the lines).* *(Okay).* ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.20.0
GiteaMirror added the bug label 2026-05-05 02:38:23 -05:00
Author
Owner

@andrenaP commented on GitHub (Apr 13, 2026):

Having the same issue using qwen3.5:latest

Output Log

system-promt="Keep your thinking process under 5 sentences."


Wait, I need to make sure I don't exceed 5 sentences in the thinking block.

Thinking block:

(doing thinking)

Okay, that's 5 sentences. I need to be under 5.

Let's combine.

Revised Thinking:

(doing thinking...)

That is 5 sentences. I need to be under 5 sentences (so max 4).

Let's merge 1 and 2.

(doing thinking...)

That's 3 sentences.

Wait, I need to output the code immediately after my initial plan.

The constraint says "Keep your thinking process under 5 sentences."

"Output the final code immediately after your initial plan."

So I write the thinking, then the code.

Okay, ready.

Wait, one more check: "Do not repeat logical checks."

Okay.

Wait, I need to make sure the code is valid.

(It keeps going until it loops always returning to `Wait, I need`)

OS

Arch Linux

GPU

Nvidia (4 GB VRAM)

CPU

Intel (16 GB RAM)

Ollama version

0.20.5

<!-- gh-comment-id:4234671521 --> @andrenaP commented on GitHub (Apr 13, 2026): Having the same issue using qwen3.5:latest ### Output Log `system-promt="Keep your thinking process under 5 sentences."` ```shell Wait, I need to make sure I don't exceed 5 sentences in the thinking block. Thinking block: (doing thinking) Okay, that's 5 sentences. I need to be under 5. Let's combine. Revised Thinking: (doing thinking...) That is 5 sentences. I need to be under 5 sentences (so max 4). Let's merge 1 and 2. (doing thinking...) That's 3 sentences. Wait, I need to output the code immediately after my initial plan. The constraint says "Keep your thinking process under 5 sentences." "Output the final code immediately after your initial plan." So I write the thinking, then the code. Okay, ready. Wait, one more check: "Do not repeat logical checks." Okay. Wait, I need to make sure the code is valid. (It keeps going until it loops always returning to `Wait, I need`) ``` ### OS Arch Linux ### GPU Nvidia (4 GB VRAM) ### CPU Intel (16 GB RAM) ### Ollama version 0.20.5
Author
Owner

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15266
Analyzed: 2026-04-18T18:22:46.612020

Analysis

  • Type: unknown
  • Severity: medium
  • Components: unknown

Implementation Plan

  • Effort: medium
  • Steps:

This issue has been triaged and marked for implementation.

<!-- gh-comment-id:4274310601 --> @PureBlissAK commented on GitHub (Apr 18, 2026): <!-- ollama-issue-orchestrator:v1 issue:15266 --> ## 🤖 Automated Triage & Analysis Report **Issue**: #15266 **Analyzed**: 2026-04-18T18:22:46.612020 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71824