[PR #15759] x/mlxrunner: recognise mlx-lm plural aux naming at load time (supersedes #15743) #61992

Open
opened 2026-04-29 16:57:05 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15759
Author: @jodagreyhame
Created: 4/23/2026
Status: 🔄 Open

Base: mainHead: pr1/mlxrunner-mlx-lm-quant-naming


📝 Commits (1)

  • 620bcf8 x/mlxrunner: recognise mlx-lm plural aux naming at load time

📊 Changes

7 files changed (+310 additions, -27 deletions)

View changed files

📝 x/mlxrunner/model/embedding.go (+14 -3)
📝 x/mlxrunner/model/embedding_test.go (+32 -0)
📝 x/mlxrunner/model/linear.go (+18 -3)
x/mlxrunner/model/linear_test.go (+44 -0)
📝 x/mlxrunner/model/root.go (+6 -1)
📝 x/mlxrunner/runner.go (+69 -20)
x/mlxrunner/runner_test.go (+127 -0)

📄 Description

Supersedes #15743, which was auto-closed when its head branch was transiently deleted from my fork during local cleanup; GitHub does not allow reopening cross-fork PRs once their head ref is deleted, so this is a fresh PR pointing at the same (now restored) branch.

Branch contents are unchanged from the closed PR's last state, plus the Codex review feedback on runner.go:153 (P2: deterministic precedence when both naming conventions target the same canonical key — singular always wins).


Summary

Teaches the MLX runner's tensor loading path to accept mlx-lm's sibling-plural aux naming (<module>.scales / <module>.biases) alongside Ollama's existing dot-child singular naming (<weight>.scale / <weight>.bias). Quant-parameter resolution logic and forward-pass code are unchanged; only aux-name recognition is added. For blobs that use the plural form, the aux tensors are now found at load time (so the layer is constructed as quantised instead of falling through to a raw-float dense linear); for blobs that use the singular form, behaviour is identical to main.

Without this change, ollama create --experimental can produce a tensor map where scales exist on disk but under a key that downstream consumers (linear/embedding constructors) don't look up — the layer is constructed as unquantised, silently loading a U32-packed weight as raw float data. The inference-correctness consequences of that are out of scope here and are addressed in a follow-up PR.

Context

Extends #15409 ("mlx: mixed-precision quant and capability detection improvements") for the mlx-lm import case. That PR taught the runner to parse per-tensor quant metadata at model load time and record it during ollama create. Both paths assumed Ollama's dot-child singular aux naming (<weight>.scale / <weight>.bias). This PR extends the same paths to also accept mlx-lm's native sibling-plural convention.

mlx-lm, LM Studio, and any tool using mx.nn.quantize natively emit aux tensors with the sibling-plural convention:

foo.weight
foo.scales      <- NOT foo.weight.scale or foo.weight_scale
foo.biases

Ollama's internal format uses the dot-child singular convention:

foo.weight
foo.weight.scale
foo.weight.bias

ollama create --experimental normally rewrites plural to singular during import, but has been observed emitting occasional orphan blobs that retain the original plural key (reproducible on Qwen3.6 35B-A3B NVFP4 imports for a subset of layers: layers.31.mlp.switch_mlp.gate_proj.scales, layers.20.linear_attn.in_proj_qkv.scales). Normalising at load time makes the runner resilient to both forms and removes the orphan as a failure mode.

What changed

Model constructors accept both aux names

  • x/mlxrunner/model/linear.goMakeLinearLayer tries <path>.weight_scale first, then falls back to <path>.scales. Same for <path>.weight_qbias / <path>.biases.
  • x/mlxrunner/model/embedding.go — identical fallback in MakeEmbeddingLayer.
  • x/mlxrunner/model/root.gomainTensorNames filters both singular and plural suffixes (so plural aux keys don't get treated as main tensors).

Runner normalises aux names at load time

  • x/mlxrunner/runner.goloadTensorsFromManifest Phase 2 remaps both dot-child singular (<weight>.scale) and sibling-plural (<module>.scales) to the canonical <weight>_scale form. Analogous rule for .bias / .biases<weight>_qbias. Singular wins deterministically when both target the same canonical key (per Codex P2 review).

Scope

This PR is strictly about name recognition at load time. It does not change:

  • Per-tensor quantisation parameter resolution (TensorQuantParams, ResolveLinearQuantParams).
  • config.json handling (the quantization block is still ignored).
  • Any model forward-pass code.

The inference-correctness consequence on Qwen3.6-class MoE models is addressed in a separate PR.

A note on model vs architecture names

This PR's motivation references Qwen 3.6 35B-A3B, which is the user-facing model version. The architecture class it's built on (the Python / HF transformer class) is still called Qwen3_5MoeForConditionalGeneration, inherited unchanged from Qwen 3.5. Ollama's source directory x/models/qwen3_5/ follows the architecture name. This PR does not touch that directory.

Tests

  • x/mlxrunner/model/linear_test.go (new) — TestMakeLinearLayer_MLXLMSiblingQuantized.
  • x/mlxrunner/model/embedding_test.go — added TestMakeEmbeddingLayerQuantizedMLXLMSibling.
  • x/mlxrunner/runner_test.go (new) — TestNormaliseAuxNames_PluralAndSingular covering singular dot-child, mlx-lm sibling plural, mixed conventions, dense bias passthrough, plural bias passthrough, and the deterministic singular-wins precedence (Codex review case).

Test plan

  • go test ./x/mlxrunner/... on macOS arm64 with MLX available — all pass.
  • go test ./x/mlxrunner/... on Linux / CI without MLX — MLX-dependent cases skip via skipIfNoMLX; pure-Go cases pass.
  • Manual: ollama run on a vanilla Ollama-registry-published NVFP4 model — unchanged behaviour, tokens generated.

Risk

Minimal. The fallback branches only fire when the singular key is absent, so Ollama-native models are unaffected. The runner remap preserves the original key as a final fallthrough when neither convention matches.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15759 **Author:** [@jodagreyhame](https://github.com/jodagreyhame) **Created:** 4/23/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `pr1/mlxrunner-mlx-lm-quant-naming` --- ### 📝 Commits (1) - [`620bcf8`](https://github.com/ollama/ollama/commit/620bcf80228ce31ab9ac1964c03f4627428f1009) x/mlxrunner: recognise mlx-lm plural aux naming at load time ### 📊 Changes **7 files changed** (+310 additions, -27 deletions) <details> <summary>View changed files</summary> 📝 `x/mlxrunner/model/embedding.go` (+14 -3) 📝 `x/mlxrunner/model/embedding_test.go` (+32 -0) 📝 `x/mlxrunner/model/linear.go` (+18 -3) ➕ `x/mlxrunner/model/linear_test.go` (+44 -0) 📝 `x/mlxrunner/model/root.go` (+6 -1) 📝 `x/mlxrunner/runner.go` (+69 -20) ➕ `x/mlxrunner/runner_test.go` (+127 -0) </details> ### 📄 Description Supersedes #15743, which was auto-closed when its head branch was transiently deleted from my fork during local cleanup; GitHub does not allow reopening cross-fork PRs once their head ref is deleted, so this is a fresh PR pointing at the same (now restored) branch. Branch contents are unchanged from the closed PR's last state, **plus** the Codex review feedback on `runner.go:153` (P2: deterministic precedence when both naming conventions target the same canonical key — singular always wins). --- ## Summary Teaches the MLX runner's tensor loading path to accept mlx-lm's sibling-plural aux naming (`<module>.scales` / `<module>.biases`) alongside Ollama's existing dot-child singular naming (`<weight>.scale` / `<weight>.bias`). Quant-parameter resolution logic and forward-pass code are unchanged; only aux-name recognition is added. For blobs that use the plural form, the aux tensors are now found at load time (so the layer is constructed as quantised instead of falling through to a raw-float dense linear); for blobs that use the singular form, behaviour is identical to main. Without this change, `ollama create --experimental` can produce a tensor map where scales exist on disk but under a key that downstream consumers (linear/embedding constructors) don't look up — the layer is constructed as unquantised, silently loading a U32-packed weight as raw float data. The inference-correctness consequences of that are out of scope here and are addressed in a follow-up PR. ## Context Extends #15409 ("mlx: mixed-precision quant and capability detection improvements") for the mlx-lm import case. That PR taught the runner to parse per-tensor quant metadata at model load time and record it during `ollama create`. Both paths assumed Ollama's dot-child singular aux naming (`<weight>.scale` / `<weight>.bias`). This PR extends the same paths to also accept mlx-lm's native sibling-plural convention. mlx-lm, LM Studio, and any tool using `mx.nn.quantize` natively emit aux tensors with the sibling-plural convention: ``` foo.weight foo.scales <- NOT foo.weight.scale or foo.weight_scale foo.biases ``` Ollama's internal format uses the dot-child singular convention: ``` foo.weight foo.weight.scale foo.weight.bias ``` `ollama create --experimental` normally rewrites plural to singular during import, but has been observed emitting occasional orphan blobs that retain the original plural key (reproducible on Qwen3.6 35B-A3B NVFP4 imports for a subset of layers: `layers.31.mlp.switch_mlp.gate_proj.scales`, `layers.20.linear_attn.in_proj_qkv.scales`). Normalising at load time makes the runner resilient to both forms and removes the orphan as a failure mode. ## What changed ### Model constructors accept both aux names - `x/mlxrunner/model/linear.go` — `MakeLinearLayer` tries `<path>.weight_scale` first, then falls back to `<path>.scales`. Same for `<path>.weight_qbias` / `<path>.biases`. - `x/mlxrunner/model/embedding.go` — identical fallback in `MakeEmbeddingLayer`. - `x/mlxrunner/model/root.go` — `mainTensorNames` filters both singular and plural suffixes (so plural aux keys don't get treated as main tensors). ### Runner normalises aux names at load time - `x/mlxrunner/runner.go` — `loadTensorsFromManifest` Phase 2 remaps both dot-child singular (`<weight>.scale`) and sibling-plural (`<module>.scales`) to the canonical `<weight>_scale` form. Analogous rule for `.bias` / `.biases` → `<weight>_qbias`. Singular wins deterministically when both target the same canonical key (per Codex P2 review). ## Scope This PR is strictly about **name recognition at load time**. It does not change: - Per-tensor quantisation parameter resolution (`TensorQuantParams`, `ResolveLinearQuantParams`). - `config.json` handling (the `quantization` block is still ignored). - Any model forward-pass code. The inference-correctness consequence on Qwen3.6-class MoE models is addressed in a separate PR. ## A note on model vs architecture names This PR's motivation references **Qwen 3.6 35B-A3B**, which is the user-facing model version. The architecture class it's built on (the Python / HF transformer class) is still called **`Qwen3_5MoeForConditionalGeneration`**, inherited unchanged from Qwen 3.5. Ollama's source directory `x/models/qwen3_5/` follows the architecture name. This PR does not touch that directory. ## Tests - `x/mlxrunner/model/linear_test.go` (new) — `TestMakeLinearLayer_MLXLMSiblingQuantized`. - `x/mlxrunner/model/embedding_test.go` — added `TestMakeEmbeddingLayerQuantizedMLXLMSibling`. - `x/mlxrunner/runner_test.go` (new) — `TestNormaliseAuxNames_PluralAndSingular` covering singular dot-child, mlx-lm sibling plural, mixed conventions, dense bias passthrough, plural bias passthrough, and the deterministic singular-wins precedence (Codex review case). ## Test plan - [ ] `go test ./x/mlxrunner/...` on macOS arm64 with MLX available — all pass. - [ ] `go test ./x/mlxrunner/...` on Linux / CI without MLX — MLX-dependent cases skip via `skipIfNoMLX`; pure-Go cases pass. - [ ] Manual: `ollama run` on a vanilla Ollama-registry-published NVFP4 model — unchanged behaviour, tokens generated. ## Risk Minimal. The fallback branches only fire when the singular key is absent, so Ollama-native models are unaffected. The runner remap preserves the original key as a final fallthrough when neither convention matches. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 16:57:05 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#61992