[PR #15951] mlx: unify MoE expert grouping across MLX model imports #77666

Open
opened 2026-05-05 10:20:31 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15951
Author: @dhiltgen
Created: 5/3/2026
Status: 🔄 Open

Base: mainHead: gemma4-grouping


📝 Commits (1)

  • 6db30b2 mlx: unify MoE expert grouping across MLX model imports

📊 Changes

14 files changed (+624 additions, -435 deletions)

View changed files

📝 x/create/client/create.go (+8 -1)
📝 x/create/client/create_test.go (+0 -67)
📝 x/create/client/quantize.go (+7 -217)
📝 x/create/create.go (+118 -4)
📝 x/create/create_test.go (+109 -0)
📝 x/create/gemma4.go (+30 -76)
📝 x/create/gemma4_test.go (+55 -18)
📝 x/mlxrunner/mlx/io.go (+9 -0)
📝 x/models/gemma4/gemma4.go (+16 -48)
📝 x/models/glm4_moe_lite/glm4_moe_lite.go (+66 -2)
📝 x/models/laguna/laguna.go (+6 -0)
📝 x/models/qwen3_5/qwen3_5.go (+6 -0)
📝 x/safetensors/extractor.go (+84 -2)
📝 x/safetensors/extractor_test.go (+110 -0)

📄 Description

Make qwen3.5, gemma4, laguna, and glm4_moe_lite produce the same on-disk layout — 3-D switch_mlp stacked expert tensors — regardless of the HF source format or whether the import is quantized.

Bugs fixed along the way:

  • gemma4's import transform split pre-stacked expert tensors per-expert along axis 0 instead of keeping them stacked. A re-stacking step in the quantizer silently undid the split for quantized imports, masking the bug; unquantized imports leaked the per-expert layout to disk. Fixing this + ordering improves gemma4:26b-mlx-bf16 cold TTFT by 75%

  • glm4_moe_lite's runtime only read per-expert tensor names, but the quantizer always re-stacked into switch_mlp 3-D form, so quantized glm4_moe_lite imports were likely unloadable.

  • gemma4's runtime carried a dead HF-direct-loading branch that read raw *.experts.gate_up_proj / *.moe.gate_proj keys.

  • mlx.SaveSafetensorsWithMetadata wrote tensors in MLX's internal unordered_map iteration order, so two creates of the same model could produce byte-different blobs. Now post-processed through safetensors.CanonicalizeFile so tensors land in sorted-name order. This improves gemma4:26b-nvfp4 cold TTFT by 25%.

create.StackPerExpertGroup is the single normalization, running once before packedCreator with pure-Go byte concatenation; the quant-only parsePerExpertInputs / stackAndQuantizeExpertGroup workaround is removed. Per-expert loaders remain in each MoE runtime as backwards-compat fallbacks for previously-published blobs.

gemma4 keeps the prefix .moe.switch_mlp. instead of the .mlp.switch_mlp. shared by qwen3.5, laguna, and glm4_moe_lite. Switching it would force old clients to upgrade to run the republished model.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15951 **Author:** [@dhiltgen](https://github.com/dhiltgen) **Created:** 5/3/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `gemma4-grouping` --- ### 📝 Commits (1) - [`6db30b2`](https://github.com/ollama/ollama/commit/6db30b2db9ddbb6898bcff555515b1cc7ee10a54) mlx: unify MoE expert grouping across MLX model imports ### 📊 Changes **14 files changed** (+624 additions, -435 deletions) <details> <summary>View changed files</summary> 📝 `x/create/client/create.go` (+8 -1) 📝 `x/create/client/create_test.go` (+0 -67) 📝 `x/create/client/quantize.go` (+7 -217) 📝 `x/create/create.go` (+118 -4) 📝 `x/create/create_test.go` (+109 -0) 📝 `x/create/gemma4.go` (+30 -76) 📝 `x/create/gemma4_test.go` (+55 -18) 📝 `x/mlxrunner/mlx/io.go` (+9 -0) 📝 `x/models/gemma4/gemma4.go` (+16 -48) 📝 `x/models/glm4_moe_lite/glm4_moe_lite.go` (+66 -2) 📝 `x/models/laguna/laguna.go` (+6 -0) 📝 `x/models/qwen3_5/qwen3_5.go` (+6 -0) 📝 `x/safetensors/extractor.go` (+84 -2) 📝 `x/safetensors/extractor_test.go` (+110 -0) </details> ### 📄 Description Make qwen3.5, gemma4, laguna, and glm4_moe_lite produce the same on-disk layout — 3-D switch_mlp stacked expert tensors — regardless of the HF source format or whether the import is quantized. Bugs fixed along the way: - gemma4's import transform split pre-stacked expert tensors per-expert along axis 0 instead of keeping them stacked. A re-stacking step in the quantizer silently undid the split for quantized imports, masking the bug; unquantized imports leaked the per-expert layout to disk. Fixing this + ordering improves gemma4:26b-mlx-bf16 cold TTFT by 75% - glm4_moe_lite's runtime only read per-expert tensor names, but the quantizer always re-stacked into switch_mlp 3-D form, so quantized glm4_moe_lite imports were likely unloadable. - gemma4's runtime carried a dead HF-direct-loading branch that read raw *.experts.gate_up_proj / *.moe.gate_proj keys. - mlx.SaveSafetensorsWithMetadata wrote tensors in MLX's internal unordered_map iteration order, so two creates of the same model could produce byte-different blobs. Now post-processed through safetensors.CanonicalizeFile so tensors land in sorted-name order. This improves gemma4:26b-nvfp4 cold TTFT by 25%. create.StackPerExpertGroup is the single normalization, running once before packedCreator with pure-Go byte concatenation; the quant-only parsePerExpertInputs / stackAndQuantizeExpertGroup workaround is removed. Per-expert loaders remain in each MoE runtime as backwards-compat fallbacks for previously-published blobs. gemma4 keeps the prefix .moe.switch_mlp.<proj> instead of the .mlp.switch_mlp.<proj> shared by qwen3.5, laguna, and glm4_moe_lite. Switching it would force old clients to upgrade to run the republished model. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 10:20:31 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#77666