[GH-ISSUE #15746] mlx runner: SparseMoE.Forward panic on mlx-lm mixed-precision NVFP4 MoE imports #56549

Open
opened 2026-04-29 11:00:17 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @jodagreyhame on GitHub (Apr 22, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15746

What is the issue?

ollama create --experimental successfully imports an mlx-lm mixed-precision NVFP4 MoE model (e.g. Qwen 3.6 35B-A3B NVFP4), but the first inference request panics in SparseMoE.Forward with index out of range [0] with length 0. The model loads, tensors read fine, and the runner subprocess starts; the crash happens on the first forward pass.

Reproducing

  1. Obtain an mlx-lm NVFP4 MoE build of Qwen 3.6 35B-A3B (e.g. a community huihui/Qwen3.6-35B-A3B-abliterated-lineage NVFP4 conversion on HuggingFace). The config.json must contain a quantization block with per-path group_size/bits overrides on router gates — that's what mlx-lm emits for mixed-precision MoE.
  2. ollama create my-model --experimental -f Modelfile where Modelfile is just FROM /path/to/model-dir.
  3. ollama run my-model "hi" (or hit /api/generate).

Expected

Model runs. (It runs fine under mlx-lm / mlx-vlm directly — this is Ollama-specific.)

Actual

panic: runtime error: index out of range [0] with length 0
goroutine N [running]:
    x/models/qwen3_5/qwen3_5.go: SparseMoE.Forward
    x/mlxrunner/pipeline.go: 
    x/mlxrunner/runner.go:
    golang.org/x/sync/errgroup

The HTTP response surfaces as:

Error: 500 Internal Server Error: mlx runner failed: .../errgroup.go:78 +0x90

Root cause (summary)

mlx-lm stores per-path quantisation overrides in config.json's quantization block, not in the tensor blob __metadata__. Ollama's MLX runner only consults blob metadata, which ollama create populates from the global quant params. The MoE router gate (stored as affine 8-bit with BF16 scales+biases at group_size 64) is therefore fed to the NVFP4 dequant kernel at the global group_size 16, producing a zero-shape output; Argpartition on the zero-shape tensor panics.

A secondary contributor: some mlx-lm blobs emit the sibling-plural aux naming (<module>.scales / <module>.biases) rather than Ollama's dot-child singular form (<weight>.scale / <weight>.bias). ollama create rewrites most but occasionally leaves an orphan blob with the plural form, which downstream consumers then fail to look up.

Environment

  • Ollama: 0.21 / main (reproduced on upstream at the time of filing).
  • OS: macOS, Apple Silicon.
  • Runtime path: ollama create --experimental (the MLX-native path, not the legacy GGUF path).

Proposed fix

Two small PRs, stacked:

  • #15743 — recognise mlx-lm plural aux naming at load time (the secondary contributor).
  • #15744 — read config.json's quantization block and apply per-path overrides in Root.Open (the primary fix; includes a regression guard).

Verified: with both PRs applied, the repro model generates tokens cleanly and existing Ollama-registry-published NVFP4 models (e.g. qwen3.6:35b-a3b-nvfp4) continue to work unchanged.

Related prior art: #15409 put the per-tensor-quant-metadata machinery in place assuming overrides live in blob __metadata__. This issue is what happens when the overrides live in config.json instead.

Originally created by @jodagreyhame on GitHub (Apr 22, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15746 ## What is the issue? `ollama create --experimental` successfully imports an mlx-lm mixed-precision NVFP4 MoE model (e.g. Qwen 3.6 35B-A3B NVFP4), but the first inference request panics in `SparseMoE.Forward` with `index out of range [0] with length 0`. The model loads, tensors read fine, and the runner subprocess starts; the crash happens on the first forward pass. ## Reproducing 1. Obtain an mlx-lm NVFP4 MoE build of Qwen 3.6 35B-A3B (e.g. a community `huihui`/`Qwen3.6-35B-A3B-abliterated`-lineage NVFP4 conversion on HuggingFace). The config.json must contain a `quantization` block with per-path `group_size`/`bits` overrides on router gates — that's what mlx-lm emits for mixed-precision MoE. 2. `ollama create my-model --experimental -f Modelfile` where `Modelfile` is just `FROM /path/to/model-dir`. 3. `ollama run my-model "hi"` (or hit `/api/generate`). ## Expected Model runs. (It runs fine under `mlx-lm` / `mlx-vlm` directly — this is Ollama-specific.) ## Actual ``` panic: runtime error: index out of range [0] with length 0 goroutine N [running]: x/models/qwen3_5/qwen3_5.go: SparseMoE.Forward x/mlxrunner/pipeline.go: x/mlxrunner/runner.go: golang.org/x/sync/errgroup ``` The HTTP response surfaces as: ``` Error: 500 Internal Server Error: mlx runner failed: .../errgroup.go:78 +0x90 ``` ## Root cause (summary) mlx-lm stores **per-path** quantisation overrides in `config.json`'s `quantization` block, not in the tensor blob `__metadata__`. Ollama's MLX runner only consults blob metadata, which `ollama create` populates from the **global** quant params. The MoE router gate (stored as affine 8-bit with BF16 scales+biases at group_size 64) is therefore fed to the NVFP4 dequant kernel at the global group_size 16, producing a zero-shape output; `Argpartition` on the zero-shape tensor panics. A secondary contributor: some mlx-lm blobs emit the sibling-plural aux naming (`<module>.scales` / `<module>.biases`) rather than Ollama's dot-child singular form (`<weight>.scale` / `<weight>.bias`). `ollama create` rewrites most but occasionally leaves an orphan blob with the plural form, which downstream consumers then fail to look up. ## Environment - Ollama: 0.21 / `main` (reproduced on upstream at the time of filing). - OS: macOS, Apple Silicon. - Runtime path: `ollama create --experimental` (the MLX-native path, not the legacy GGUF path). ## Proposed fix Two small PRs, stacked: - #15743 — recognise mlx-lm plural aux naming at load time (the secondary contributor). - #15744 — read `config.json`'s `quantization` block and apply per-path overrides in `Root.Open` (the primary fix; includes a regression guard). Verified: with both PRs applied, the repro model generates tokens cleanly and existing Ollama-registry-published NVFP4 models (e.g. `qwen3.6:35b-a3b-nvfp4`) continue to work unchanged. Related prior art: #15409 put the per-tensor-quant-metadata machinery in place assuming overrides live in blob `__metadata__`. This issue is what happens when the overrides live in `config.json` instead.
Author
Owner

@jodagreyhame commented on GitHub (Apr 23, 2026):

Reopened. The proposed-fix PR (#15744) was auto-closed when its head branch was transiently deleted from my fork; the new PR is #15760 (same branch, with Codex review feedback addressed).

<!-- gh-comment-id:4301198268 --> @jodagreyhame commented on GitHub (Apr 23, 2026): Reopened. The proposed-fix PR (#15744) was auto-closed when its head branch was transiently deleted from my fork; the new PR is **#15760** (same branch, with Codex review feedback addressed).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56549