[PR #14341] [MERGED] Avoid Excessive MLX Memory Usage #25169

Closed
opened 2026-04-19 18:02:56 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14341
Author: @jessegross
Created: 2/21/2026
Status: Merged
Merged: 2/23/2026
Merged by: @jessegross

Base: mainHead: jessegross/mlx-memory


📝 Commits (2)

  • 35c89dc mlxrunner: Fix memory leaks with pin/sweep lifecycle management
  • 7a0f609 mlxrunner: Simplify KV cache to single-entry prefix matching

📊 Changes

17 files changed (+214 additions, -227 deletions)

View changed files

📝 x/mlxrunner/cache.go (+41 -70)
📝 x/mlxrunner/cache/cache.go (+13 -2)
📝 x/mlxrunner/mlx/array.go (+60 -65)
📝 x/mlxrunner/mlx/fast.go (+4 -4)
📝 x/mlxrunner/mlx/io.go (+3 -1)
📝 x/mlxrunner/mlx/ops.go (+31 -31)
📝 x/mlxrunner/mlx/ops_extra.go (+15 -23)
📝 x/mlxrunner/mlx/random.go (+1 -1)
📝 x/mlxrunner/mlx/slice.go (+2 -2)
📝 x/mlxrunner/model/base/base.go (+16 -3)
📝 x/mlxrunner/pipeline.go (+23 -7)
📝 x/mlxrunner/runner.go (+4 -4)
📝 x/mlxrunner/server.go (+1 -2)
📝 x/models/gemma3/gemma3.go (+0 -3)
📝 x/models/glm4_moe_lite/glm4_moe_lite.go (+0 -3)
📝 x/models/llama/llama.go (+0 -3)
📝 x/models/qwen3/qwen3.go (+0 -3)

📄 Description

This fixes two issues that caused unbounded memory use with MLX and eventual crashing:

  • Replace error-prone reference counting for array lifecycles with a pin/sweep model — pin the arrays we need (outputs, cache), sweep everything else — reducing memory leaks while leveraging MLX's internal reference tracking
  • Simplify the KV cache from a tree structure to single-entry prefix matching, eliminating redundant full-cache copies that caused excessive memory growth during conversations

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14341 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 2/21/2026 **Status:** ✅ Merged **Merged:** 2/23/2026 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/mlx-memory` --- ### 📝 Commits (2) - [`35c89dc`](https://github.com/ollama/ollama/commit/35c89dc9fd3dc27c7b7f8eb3b1f03ca804d2309b) mlxrunner: Fix memory leaks with pin/sweep lifecycle management - [`7a0f609`](https://github.com/ollama/ollama/commit/7a0f6092415b422ab9905398389d97ebb072c4bd) mlxrunner: Simplify KV cache to single-entry prefix matching ### 📊 Changes **17 files changed** (+214 additions, -227 deletions) <details> <summary>View changed files</summary> 📝 `x/mlxrunner/cache.go` (+41 -70) 📝 `x/mlxrunner/cache/cache.go` (+13 -2) 📝 `x/mlxrunner/mlx/array.go` (+60 -65) 📝 `x/mlxrunner/mlx/fast.go` (+4 -4) 📝 `x/mlxrunner/mlx/io.go` (+3 -1) 📝 `x/mlxrunner/mlx/ops.go` (+31 -31) 📝 `x/mlxrunner/mlx/ops_extra.go` (+15 -23) 📝 `x/mlxrunner/mlx/random.go` (+1 -1) 📝 `x/mlxrunner/mlx/slice.go` (+2 -2) 📝 `x/mlxrunner/model/base/base.go` (+16 -3) 📝 `x/mlxrunner/pipeline.go` (+23 -7) 📝 `x/mlxrunner/runner.go` (+4 -4) 📝 `x/mlxrunner/server.go` (+1 -2) 📝 `x/models/gemma3/gemma3.go` (+0 -3) 📝 `x/models/glm4_moe_lite/glm4_moe_lite.go` (+0 -3) 📝 `x/models/llama/llama.go` (+0 -3) 📝 `x/models/qwen3/qwen3.go` (+0 -3) </details> ### 📄 Description This fixes two issues that caused unbounded memory use with MLX and eventual crashing: - Replace error-prone reference counting for array lifecycles with a pin/sweep model — pin the arrays we need (outputs, cache), sweep everything else — reducing memory leaks while leveraging MLX's internal reference tracking - Simplify the KV cache from a tree structure to single-entry prefix matching, eliminating redundant full-cache copies that caused excessive memory growth during conversations --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 18:02:56 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#25169