[PR #13810] [MERGED] model: add MLA absorption for glm4moelite #45648

Closed
opened 2026-04-25 01:18:23 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13810
Author: @jmorganca
Created: 1/21/2026
Status: Merged
Merged: 1/23/2026
Merged by: @jmorganca

Base: mainHead: glm4moelite-mla-absorption


📝 Commits (3)

  • 64c3c10 model: add MLA absorption for glm4moelite
  • 846731b ggml: enable MLA flash attention for GLM-4.7-flash
  • f36efbe model: add compatibility validation for glm4moelite architecture

📊 Changes

16 files changed (+522 additions, -23 deletions)

View changed files

📝 convert/convert_glm4moelite.go (+114 -0)
llama/patches/0032-ggml-enable-MLA-flash-attention-for-GLM-4.7-flash.patch (+248 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/fattn-mma-f16.cuh (+12 -3)
📝 ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cuh (+16 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/fattn.cu (+8 -4)
📝 ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_16-ncols2_4.cu (+1 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_4.cu (+1 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_4.cu (+1 -0)
📝 ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_4.cu (+1 -0)
📝 ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-device.m (+2 -6)
📝 ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal (+1 -0)
📝 ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-ops.cpp (+1 -1)
📝 ml/backend/ggml/ggml/src/ggml-metal/ggml-metal.metal (+1 -0)
📝 model/model.go (+14 -0)
📝 model/models/glm4moelite/model.go (+28 -9)
model/models/glm4moelite/model_test.go (+73 -0)

📄 Description

Split the combined KV_B tensor into separate K_B and V_B tensors during conversion, enabling MLA (Multi-head Latent Attention) absorption which compresses the KV cache for less memory usage for the KV cache.

This carries a patch to enable faster execution until https://github.com/ollama/ollama/pull/13832 is merged


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13810 **Author:** [@jmorganca](https://github.com/jmorganca) **Created:** 1/21/2026 **Status:** ✅ Merged **Merged:** 1/23/2026 **Merged by:** [@jmorganca](https://github.com/jmorganca) **Base:** `main` ← **Head:** `glm4moelite-mla-absorption` --- ### 📝 Commits (3) - [`64c3c10`](https://github.com/ollama/ollama/commit/64c3c10b9b9eff3f849ef5e3651fac57add49a3d) model: add MLA absorption for glm4moelite - [`846731b`](https://github.com/ollama/ollama/commit/846731b3f30210d23e85e227bd7cdbcd27a3eb3e) ggml: enable MLA flash attention for GLM-4.7-flash - [`f36efbe`](https://github.com/ollama/ollama/commit/f36efbe1f1cf0f6557f0fe86a2545b66f7eabdd8) model: add compatibility validation for glm4moelite architecture ### 📊 Changes **16 files changed** (+522 additions, -23 deletions) <details> <summary>View changed files</summary> 📝 `convert/convert_glm4moelite.go` (+114 -0) ➕ `llama/patches/0032-ggml-enable-MLA-flash-attention-for-GLM-4.7-flash.patch` (+248 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/fattn-mma-f16.cuh` (+12 -3) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cuh` (+16 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/fattn.cu` (+8 -4) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_16-ncols2_4.cu` (+1 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_4.cu` (+1 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_4.cu` (+1 -0) 📝 `ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_4.cu` (+1 -0) 📝 `ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-device.m` (+2 -6) 📝 `ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal` (+1 -0) 📝 `ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-ops.cpp` (+1 -1) 📝 `ml/backend/ggml/ggml/src/ggml-metal/ggml-metal.metal` (+1 -0) 📝 `model/model.go` (+14 -0) 📝 `model/models/glm4moelite/model.go` (+28 -9) ➕ `model/models/glm4moelite/model_test.go` (+73 -0) </details> ### 📄 Description Split the combined KV_B tensor into separate K_B and V_B tensors during conversion, enabling MLA (Multi-head Latent Attention) absorption which compresses the KV cache for less memory usage for the KV cache. This carries a patch to enable faster execution until https://github.com/ollama/ollama/pull/13832 is merged --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 01:18:23 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#45648