[PR #15743] [CLOSED] x/mlxrunner: recognise mlx-lm plural aux naming at load time #61984

Closed
opened 2026-04-29 16:56:44 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15743
Author: @jodagreyhame
Created: 4/22/2026
Status: Closed

Base: mainHead: pr1/mlxrunner-mlx-lm-quant-naming


📝 Commits (1)

  • 3d9e540 x/mlxrunner: recognise mlx-lm plural aux naming at load time

📊 Changes

7 files changed (+294 additions, -28 deletions)

View changed files

📝 x/mlxrunner/model/embedding.go (+14 -3)
📝 x/mlxrunner/model/embedding_test.go (+32 -0)
📝 x/mlxrunner/model/linear.go (+18 -3)
x/mlxrunner/model/linear_test.go (+44 -0)
📝 x/mlxrunner/model/root.go (+6 -1)
📝 x/mlxrunner/runner.go (+68 -21)
x/mlxrunner/runner_test.go (+112 -0)

📄 Description

Summary

Teaches the MLX runner's tensor loading path to accept mlx-lm's sibling-plural aux naming (<module>.scales / <module>.biases) alongside Ollama's existing dot-child singular naming (<weight>.scale / <weight>.bias). Quant-parameter resolution logic and forward-pass code are unchanged; only aux-name recognition is added. For blobs that use the plural form, the aux tensors are now found at load time (so the layer is constructed as quantised instead of falling through to a raw-float dense linear); for blobs that use the singular form, behaviour is identical to main.

Without this change, ollama create --experimental can produce a tensor map where scales exist on disk but under a key that downstream consumers (linear/embedding constructors) don't look up — the layer is constructed as unquantised, silently loading a U32-packed weight as raw float data. The inference-correctness consequences of that are out of scope here and are addressed in a follow-up PR.

Context

Extends #15409 ("mlx: mixed-precision quant and capability detection improvements") for the mlx-lm import case. That PR taught the runner to parse per-tensor quant metadata at model load time and record it during ollama create. Both paths assumed Ollama's dot-child singular aux naming (<weight>.scale / <weight>.bias). This PR extends the same paths to also accept mlx-lm's native sibling-plural convention.

mlx-lm, LM Studio, and any tool using mx.nn.quantize natively emit aux tensors with the sibling-plural convention:

foo.weight
foo.scales      <- NOT foo.weight.scale or foo.weight_scale
foo.biases

Ollama's internal format uses the dot-child singular convention:

foo.weight
foo.weight.scale
foo.weight.bias

ollama create --experimental normally rewrites plural to singular during import, but has been observed emitting occasional orphan blobs that retain the original plural key (reproducible on Qwen3.6 35B-A3B NVFP4 imports for a subset of layers: layers.31.mlp.switch_mlp.gate_proj.scales, layers.20.linear_attn.in_proj_qkv.scales). Normalising at load time makes the runner resilient to both forms and removes the orphan as a failure mode.

What changed

Model constructors accept both aux names

  • x/mlxrunner/model/linear.goMakeLinearLayer tries <path>.weight_scale first, then falls back to <path>.scales. Same for <path>.weight_qbias / <path>.biases.
  • x/mlxrunner/model/embedding.go — identical fallback in MakeEmbeddingLayer.
  • x/mlxrunner/model/root.gomainTensorNames filters both singular and plural suffixes (so plural aux keys don't get treated as main tensors).

Runner normalises aux names at load time

  • x/mlxrunner/runner.goloadTensorsFromManifest Phase 2 remaps both dot-child singular (<weight>.scale) and sibling-plural (<module>.scales) to the canonical <weight>_scale form. Analogous rule for .bias / .biases<weight>_qbias.

Scope

This PR is strictly about name recognition at load time. It does not change:

  • Per-tensor quantisation parameter resolution (TensorQuantParams, ResolveLinearQuantParams).
  • config.json handling (the quantization block is still ignored).
  • Any model forward-pass code.

The inference-correctness consequence on Qwen3.6-class MoE models is addressed in a separate PR.

A note on model vs architecture names

This PR's motivation references Qwen 3.6 35B-A3B, which is the user-facing model version. The architecture class it's built on (the Python / HF transformer class) is still called Qwen3_5MoeForConditionalGeneration, inherited unchanged from Qwen 3.5. Ollama's source directory x/models/qwen3_5/ follows the architecture name. This PR does not touch that directory.

Tests

New targeted tests — added to the existing same-package test files where present, created where not:

  • x/mlxrunner/model/linear_test.go (new) — TestMakeLinearLayer_MLXLMSiblingQuantized: given {foo.weight, foo.scales, foo.biases} in the tensor map and nil tensorQuant, constructs a *nn.QuantizedLinear with the sibling scales/biases correctly wired.
  • x/mlxrunner/model/embedding_test.go — one added test case with the plural naming, mirroring the existing quantised-embedding test.
  • x/mlxrunner/runner_test.go (new) — TestLoadTensorsFromManifest_NormalisesPluralAux: a fabricated manifest with a blob containing a sibling-plural scales key gets normalised to <weight>_scale in the returned map. Covers both plural and singular input forms plus mixed within a single blob.

These tests exercise behaviour that this PR adds. They will fail on main before the PR is merged.

Test plan

  • go test ./x/mlxrunner/... on macOS arm64 with MLX available — all pass.
  • go test ./x/mlxrunner/... on Linux / CI without MLX — MLX-dependent cases skip via skipIfNoMLX; pure-Go cases pass.
  • Manual: ollama run on a vanilla Ollama-registry-published NVFP4 model — unchanged behaviour, tokens generated.

Files touched

x/mlxrunner/model/embedding.go           +8 lines
x/mlxrunner/model/embedding_test.go     +~30 lines (one new case + shared setup)
x/mlxrunner/model/linear.go              +8 lines
x/mlxrunner/model/linear_test.go        new, ~60 lines (targeted test only)
x/mlxrunner/model/root.go               +~10 lines (filter extension)
x/mlxrunner/runner.go                   +~40 lines
x/mlxrunner/runner_test.go              new, ~80 lines

Risk

Minimal. The fallback branches only fire when the singular key is absent, so Ollama-native models are unaffected. The runner remap preserves the original key as a final fallthrough when neither convention matches, so unknown suffixes pass through unchanged.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15743 **Author:** [@jodagreyhame](https://github.com/jodagreyhame) **Created:** 4/22/2026 **Status:** ❌ Closed **Base:** `main` ← **Head:** `pr1/mlxrunner-mlx-lm-quant-naming` --- ### 📝 Commits (1) - [`3d9e540`](https://github.com/ollama/ollama/commit/3d9e54097a9cd44e8ad92897836084cc6874f3ef) x/mlxrunner: recognise mlx-lm plural aux naming at load time ### 📊 Changes **7 files changed** (+294 additions, -28 deletions) <details> <summary>View changed files</summary> 📝 `x/mlxrunner/model/embedding.go` (+14 -3) 📝 `x/mlxrunner/model/embedding_test.go` (+32 -0) 📝 `x/mlxrunner/model/linear.go` (+18 -3) ➕ `x/mlxrunner/model/linear_test.go` (+44 -0) 📝 `x/mlxrunner/model/root.go` (+6 -1) 📝 `x/mlxrunner/runner.go` (+68 -21) ➕ `x/mlxrunner/runner_test.go` (+112 -0) </details> ### 📄 Description ## Summary Teaches the MLX runner's tensor loading path to accept mlx-lm's sibling-plural aux naming (`<module>.scales` / `<module>.biases`) alongside Ollama's existing dot-child singular naming (`<weight>.scale` / `<weight>.bias`). Quant-parameter resolution logic and forward-pass code are unchanged; only aux-name recognition is added. For blobs that use the plural form, the aux tensors are now found at load time (so the layer is constructed as quantised instead of falling through to a raw-float dense linear); for blobs that use the singular form, behaviour is identical to main. Without this change, `ollama create --experimental` can produce a tensor map where scales exist on disk but under a key that downstream consumers (linear/embedding constructors) don't look up — the layer is constructed as unquantised, silently loading a U32-packed weight as raw float data. The inference-correctness consequences of that are out of scope here and are addressed in a follow-up PR. ## Context Extends #15409 ("mlx: mixed-precision quant and capability detection improvements") for the mlx-lm import case. That PR taught the runner to parse per-tensor quant metadata at model load time and record it during `ollama create`. Both paths assumed Ollama's dot-child singular aux naming (`<weight>.scale` / `<weight>.bias`). This PR extends the same paths to also accept mlx-lm's native sibling-plural convention. mlx-lm, LM Studio, and any tool using `mx.nn.quantize` natively emit aux tensors with the sibling-plural convention: ``` foo.weight foo.scales <- NOT foo.weight.scale or foo.weight_scale foo.biases ``` Ollama's internal format uses the dot-child singular convention: ``` foo.weight foo.weight.scale foo.weight.bias ``` `ollama create --experimental` normally rewrites plural to singular during import, but has been observed emitting occasional orphan blobs that retain the original plural key (reproducible on Qwen3.6 35B-A3B NVFP4 imports for a subset of layers: `layers.31.mlp.switch_mlp.gate_proj.scales`, `layers.20.linear_attn.in_proj_qkv.scales`). Normalising at load time makes the runner resilient to both forms and removes the orphan as a failure mode. ## What changed ### Model constructors accept both aux names - `x/mlxrunner/model/linear.go` — `MakeLinearLayer` tries `<path>.weight_scale` first, then falls back to `<path>.scales`. Same for `<path>.weight_qbias` / `<path>.biases`. - `x/mlxrunner/model/embedding.go` — identical fallback in `MakeEmbeddingLayer`. - `x/mlxrunner/model/root.go` — `mainTensorNames` filters both singular and plural suffixes (so plural aux keys don't get treated as main tensors). ### Runner normalises aux names at load time - `x/mlxrunner/runner.go` — `loadTensorsFromManifest` Phase 2 remaps both dot-child singular (`<weight>.scale`) and sibling-plural (`<module>.scales`) to the canonical `<weight>_scale` form. Analogous rule for `.bias` / `.biases` → `<weight>_qbias`. ## Scope This PR is strictly about **name recognition at load time**. It does not change: - Per-tensor quantisation parameter resolution (`TensorQuantParams`, `ResolveLinearQuantParams`). - `config.json` handling (the `quantization` block is still ignored). - Any model forward-pass code. The inference-correctness consequence on Qwen3.6-class MoE models is addressed in a separate PR. ## A note on model vs architecture names This PR's motivation references **Qwen 3.6 35B-A3B**, which is the user-facing model version. The architecture class it's built on (the Python / HF transformer class) is still called **`Qwen3_5MoeForConditionalGeneration`**, inherited unchanged from Qwen 3.5. Ollama's source directory `x/models/qwen3_5/` follows the architecture name. This PR does not touch that directory. ## Tests New targeted tests — added to the existing same-package test files where present, created where not: - `x/mlxrunner/model/linear_test.go` (new) — `TestMakeLinearLayer_MLXLMSiblingQuantized`: given `{foo.weight, foo.scales, foo.biases}` in the tensor map and nil tensorQuant, constructs a `*nn.QuantizedLinear` with the sibling scales/biases correctly wired. - `x/mlxrunner/model/embedding_test.go` — one added test case with the plural naming, mirroring the existing quantised-embedding test. - `x/mlxrunner/runner_test.go` (new) — `TestLoadTensorsFromManifest_NormalisesPluralAux`: a fabricated manifest with a blob containing a sibling-plural scales key gets normalised to `<weight>_scale` in the returned map. Covers both plural and singular input forms plus mixed within a single blob. These tests exercise behaviour that this PR adds. They will fail on main before the PR is merged. ## Test plan - [ ] `go test ./x/mlxrunner/...` on macOS arm64 with MLX available — all pass. - [ ] `go test ./x/mlxrunner/...` on Linux / CI without MLX — MLX-dependent cases skip via `skipIfNoMLX`; pure-Go cases pass. - [ ] Manual: `ollama run` on a vanilla Ollama-registry-published NVFP4 model — unchanged behaviour, tokens generated. ## Files touched ``` x/mlxrunner/model/embedding.go +8 lines x/mlxrunner/model/embedding_test.go +~30 lines (one new case + shared setup) x/mlxrunner/model/linear.go +8 lines x/mlxrunner/model/linear_test.go new, ~60 lines (targeted test only) x/mlxrunner/model/root.go +~10 lines (filter extension) x/mlxrunner/runner.go +~40 lines x/mlxrunner/runner_test.go new, ~80 lines ``` ## Risk Minimal. The fallback branches only fire when the singular key is absent, so Ollama-native models are unaffected. The runner remap preserves the original key as a final fallthrough when neither convention matches, so unknown suffixes pass through unchanged. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 16:56:44 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#61984