[PR #14884] [MERGED] mlx: quantized embeddings, fast SwiGLU, and runtime fixes #14893

Closed
opened 2026-04-13 01:05:05 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14884
Author: @pdevine
Created: 3/16/2026
Status: Merged
Merged: 3/17/2026
Merged by: @pdevine

Base: mainHead: pdevine/qwen35-speedup


📝 Commits (1)

  • 71e2707 mlx: quantized embeddings, fast SwiGLU, and runtime fixes

📊 Changes

12 files changed (+405 additions, -37 deletions)

View changed files

📝 x/mlxrunner/mlx/ops_extra.go (+6 -0)
x/mlxrunner/model/embedding.go (+42 -0)
x/mlxrunner/model/embedding_test.go (+78 -0)
📝 x/mlxrunner/server.go (+1 -1)
📝 x/models/gemma3/gemma3.go (+5 -5)
📝 x/models/glm4_moe_lite/glm4_moe_lite.go (+2 -4)
📝 x/models/llama/llama.go (+6 -6)
📝 x/models/nn/nn.go (+54 -1)
📝 x/models/qwen3/qwen3.go (+6 -6)
📝 x/models/qwen3_5/qwen3_5.go (+12 -14)
📝 x/models/qwen3_5/qwen3_5_test.go (+188 -0)
📝 x/tokenizer/tokenizer.go (+5 -0)

📄 Description

Add QuantizedEmbedding and EmbeddingLayer interface so models can use quantized embedding weights and expose tied output projections. This change updates gemma3, glm4_moe_lite, llama, qwen3, and qwen3_5 to use the new interface.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14884 **Author:** [@pdevine](https://github.com/pdevine) **Created:** 3/16/2026 **Status:** ✅ Merged **Merged:** 3/17/2026 **Merged by:** [@pdevine](https://github.com/pdevine) **Base:** `main` ← **Head:** `pdevine/qwen35-speedup` --- ### 📝 Commits (1) - [`71e2707`](https://github.com/ollama/ollama/commit/71e27078fb760c90ec9a13dd3244ff87421b5603) mlx: quantized embeddings, fast SwiGLU, and runtime fixes ### 📊 Changes **12 files changed** (+405 additions, -37 deletions) <details> <summary>View changed files</summary> 📝 `x/mlxrunner/mlx/ops_extra.go` (+6 -0) ➕ `x/mlxrunner/model/embedding.go` (+42 -0) ➕ `x/mlxrunner/model/embedding_test.go` (+78 -0) 📝 `x/mlxrunner/server.go` (+1 -1) 📝 `x/models/gemma3/gemma3.go` (+5 -5) 📝 `x/models/glm4_moe_lite/glm4_moe_lite.go` (+2 -4) 📝 `x/models/llama/llama.go` (+6 -6) 📝 `x/models/nn/nn.go` (+54 -1) 📝 `x/models/qwen3/qwen3.go` (+6 -6) 📝 `x/models/qwen3_5/qwen3_5.go` (+12 -14) 📝 `x/models/qwen3_5/qwen3_5_test.go` (+188 -0) 📝 `x/tokenizer/tokenizer.go` (+5 -0) </details> ### 📄 Description Add QuantizedEmbedding and EmbeddingLayer interface so models can use quantized embedding weights and expose tied output projections. This change updates gemma3, glm4_moe_lite, llama, qwen3, and qwen3_5 to use the new interface. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-13 01:05:05 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#14893