[PR #14887] [MERGED] mlxrunner: share KV cache across conversations with common prefixes #20167

Closed
opened 2026-04-16 07:29:03 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14887
Author: @jessegross
Created: 3/16/2026
Status: Merged
Merged: 3/18/2026
Merged by: @jessegross

Base: mainHead: jessegross/mlx-trie


📝 Commits (2)

  • 18456ba mlxrunner: fix Slice(0, 0) returning full dimension instead of empty
  • 2d9b8c3 mlxrunner: share KV cache across conversations with common prefixes

📊 Changes

12 files changed (+2761 additions, -180 deletions)

View changed files

📝 x/mlxrunner/cache.go (+484 -101)
📝 x/mlxrunner/cache/cache.go (+243 -39)
x/mlxrunner/cache/cache_test.go (+271 -0)
📝 x/mlxrunner/cache/recurrent.go (+57 -29)
x/mlxrunner/cache/recurrent_test.go (+44 -0)
x/mlxrunner/cache_test.go (+859 -0)
x/mlxrunner/cache_trie.go (+296 -0)
x/mlxrunner/cache_trie_test.go (+455 -0)
📝 x/mlxrunner/mlx/mlx.go (+4 -0)
📝 x/mlxrunner/mlx/slice.go (+24 -8)
📝 x/mlxrunner/pipeline.go (+23 -2)
📝 x/mlxrunner/sample/sample.go (+1 -1)

📄 Description

Enable multiple conversations to reuse cached computations when they share token prefixes (e.g. the same system prompt). A prefix trie tracks shared regions so switching between conversations only recomputes tokens that diverge. Inactive conversation state is paged from active GPU memory to other memory and restored on demand, with LRU eviction to keep memory usage bounded.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14887 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 3/16/2026 **Status:** ✅ Merged **Merged:** 3/18/2026 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/mlx-trie` --- ### 📝 Commits (2) - [`18456ba`](https://github.com/ollama/ollama/commit/18456ba2011c896a647d3534aa200561c475f06d) mlxrunner: fix Slice(0, 0) returning full dimension instead of empty - [`2d9b8c3`](https://github.com/ollama/ollama/commit/2d9b8c333ae5209c922ad0bf3b934fbb49c88ab7) mlxrunner: share KV cache across conversations with common prefixes ### 📊 Changes **12 files changed** (+2761 additions, -180 deletions) <details> <summary>View changed files</summary> 📝 `x/mlxrunner/cache.go` (+484 -101) 📝 `x/mlxrunner/cache/cache.go` (+243 -39) ➕ `x/mlxrunner/cache/cache_test.go` (+271 -0) 📝 `x/mlxrunner/cache/recurrent.go` (+57 -29) ➕ `x/mlxrunner/cache/recurrent_test.go` (+44 -0) ➕ `x/mlxrunner/cache_test.go` (+859 -0) ➕ `x/mlxrunner/cache_trie.go` (+296 -0) ➕ `x/mlxrunner/cache_trie_test.go` (+455 -0) 📝 `x/mlxrunner/mlx/mlx.go` (+4 -0) 📝 `x/mlxrunner/mlx/slice.go` (+24 -8) 📝 `x/mlxrunner/pipeline.go` (+23 -2) 📝 `x/mlxrunner/sample/sample.go` (+1 -1) </details> ### 📄 Description Enable multiple conversations to reuse cached computations when they share token prefixes (e.g. the same system prompt). A prefix trie tracks shared regions so switching between conversations only recomputes tokens that diverge. Inactive conversation state is paged from active GPU memory to other memory and restored on demand, with LRU eviction to keep memory usage bounded. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 07:29:03 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#20167