[PR #15914] mlxrunner: add decode checkpoints for exact-restore caches #77651

Open
opened 2026-05-05 10:19:43 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15914
Author: @ParthSareen
Created: 5/1/2026
Status: 🔄 Open

Base: mainHead: parth-mlx-decode-checkpoints


📝 Commits (1)

  • e478be4 mlxrunner: add decode checkpoints for exact-restore caches

📊 Changes

5 files changed (+269 additions, -13 deletions)

View changed files

📝 x/mlxrunner/cache.go (+55 -11)
📝 x/mlxrunner/cache/cache.go (+19 -0)
📝 x/mlxrunner/cache/recurrent.go (+4 -0)
📝 x/mlxrunner/cache_test.go (+155 -2)
📝 x/mlxrunner/pipeline.go (+36 -0)

📄 Description

Summary

I was suspicious that MLX cache reuse was falling apart specifically around long generations and follow-up turns. The trie remembered generated history as part of its structure, but recurrent/sliding-window cache state was not always restorable at the matched offset as we were not checkpointing on generation. That meant we could “match” a prior assistant response but still replay most of it.

This PR adds decode-time checkpoints for MLX cache stacks that need exact restore points, currently recurrent and rotating/sliding-window caches. A normal KV cache doesn't benefit from the checkpointing as it can arbitrarily rewind the KV tensors to the last match

Probe

I tested this with pi in headless mode using:

  • qwen3.6:35b-a3b-coding-nvfp4
  • MLX worktree server with debug logs
  • small prompts, exact repeats, prefix variants
  • long prompt repeats and mid-prompt variants
  • long generations followed by exact-history follow-ups
  • fake mid-generation divergence histories
  • tool-call and web-fetch style flows

The key pre-fix signal:

long_generation_exact_followup:
total=430 matched=96 cached=30 left=400

So the trie matched into generated assistant history (matched=96), but cache restore fell back to the prompt-side checkpoint at cached=30.

Long prompt caching was already healthy:

long_prompt_exact_repeat:
total=49623 matched=49623 cached=49619 left=4

That made this look like a decode checkpointing problem, not a general prefill checkpointing problem.

Findings

Recurrent and rotating/sliding-window caches cannot cheaply restore to arbitrary matched offsets:

  • KVCache can slice/rewind in many cases.
  • RecurrentCache needs cumulative state at the exact target.
  • RotatingKVCache can lose arbitrary earlier positions once the window has rotated.

Before this PR, generated tokens advanced the trie on close(), but decode did not proactively create restore snapshots. That left long assistant outputs structurally matchable but not cheaply restorable.

What Changed

This PR adds an internal exact-restore capability:

  • RecurrentCache opts in.
  • RotatingKVCache opts in.
  • plain KVCache stays default/no-op.

When any cache layer requires exact restore points, the decode loop now creates preserved checkpoints using this adaptive policy:

every 64 generated tokens through 512
every 256 generated tokens from 513 through 2048
every 1024 generated tokens after 2048
final checkpoint if generation produced at least 64 tokens

Prompt checkpoint behavior is unchanged.

After the patch, the same long-generation follow-up improved to:

long_generation_exact_followup:
total=430 matched=96 cached=94 left=336

The first decode checkpoint landed at 94, so restore no longer fell back to the prompt checkpoint at 30.

Tradeoffs

This intentionally only enables decode checkpoints for exact-restore cache stacks.

Costs:

  • More paged-out snapshot memory for recurrent/sliding-window models.
  • More snapshot work on the decode hot path.
  • More preserved trie nodes, which can increase fragmentation.
  • The interval policy is heuristic and may need tuning.
  • This does not fix rendered-history drift from clients. If a client reserializes hidden context, tool results, thinking text, or assistant text differently, the token trie may diverge before a useful decode checkpoint.

I did not expose an option or env knob in this PR.
The first goal is to fix the measured recurrent/sliding-window failure mode without expanding API/config surface.

Tests

Added unit coverage for:

  • recurrent cache restoring inside generated history via decode checkpoints
  • sliding-window cache restoring inside generated history
  • pure KV cache not creating decode checkpoints in auto mode
  • final decode checkpoint creation
  • no snapshot leaks or double closes
  • checkpoint progress starting from the active restore frontier

Ran:

go test ./x/mlxrunner -run 'Test.*Cache|Test.*Snapshot' -count=1
go test ./x/mlxrunner/cache -count=1
go test ./x/mlxrunner -count=1

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15914 **Author:** [@ParthSareen](https://github.com/ParthSareen) **Created:** 5/1/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `parth-mlx-decode-checkpoints` --- ### 📝 Commits (1) - [`e478be4`](https://github.com/ollama/ollama/commit/e478be4c00001e1cc82fedf674c70bd720f5d587) mlxrunner: add decode checkpoints for exact-restore caches ### 📊 Changes **5 files changed** (+269 additions, -13 deletions) <details> <summary>View changed files</summary> 📝 `x/mlxrunner/cache.go` (+55 -11) 📝 `x/mlxrunner/cache/cache.go` (+19 -0) 📝 `x/mlxrunner/cache/recurrent.go` (+4 -0) 📝 `x/mlxrunner/cache_test.go` (+155 -2) 📝 `x/mlxrunner/pipeline.go` (+36 -0) </details> ### 📄 Description ## Summary I was suspicious that MLX cache reuse was falling apart specifically around long generations and follow-up turns. The trie remembered generated history as part of its structure, but recurrent/sliding-window cache state was not always restorable at the matched offset as we were not checkpointing on generation. That meant we could “match” a prior assistant response but still replay most of it. This PR adds decode-time checkpoints for MLX cache stacks that need exact restore points, currently recurrent and rotating/sliding-window caches. A normal KV cache doesn't benefit from the checkpointing as it can arbitrarily rewind the KV tensors to the last match ## Probe I tested this with `pi` in headless mode using: - `qwen3.6:35b-a3b-coding-nvfp4` - MLX worktree server with debug logs - small prompts, exact repeats, prefix variants - long prompt repeats and mid-prompt variants - long generations followed by exact-history follow-ups - fake mid-generation divergence histories - tool-call and web-fetch style flows The key pre-fix signal: ```text long_generation_exact_followup: total=430 matched=96 cached=30 left=400 ``` So the trie matched into generated assistant history (`matched=96`), but cache restore fell back to the prompt-side checkpoint at `cached=30`. Long prompt caching was already healthy: ```text long_prompt_exact_repeat: total=49623 matched=49623 cached=49619 left=4 ``` That made this look like a decode checkpointing problem, not a general prefill checkpointing problem. ## Findings Recurrent and rotating/sliding-window caches cannot cheaply restore to arbitrary matched offsets: - `KVCache` can slice/rewind in many cases. - `RecurrentCache` needs cumulative state at the exact target. - `RotatingKVCache` can lose arbitrary earlier positions once the window has rotated. Before this PR, generated tokens advanced the trie on `close()`, but decode did not proactively create restore snapshots. That left long assistant outputs structurally matchable but not cheaply restorable. ## What Changed This PR adds an internal exact-restore capability: - `RecurrentCache` opts in. - `RotatingKVCache` opts in. - plain `KVCache` stays default/no-op. When any cache layer requires exact restore points, the decode loop now creates preserved checkpoints using this adaptive policy: ```text every 64 generated tokens through 512 every 256 generated tokens from 513 through 2048 every 1024 generated tokens after 2048 final checkpoint if generation produced at least 64 tokens ``` Prompt checkpoint behavior is unchanged. After the patch, the same long-generation follow-up improved to: ```text long_generation_exact_followup: total=430 matched=96 cached=94 left=336 ``` The first decode checkpoint landed at `94`, so restore no longer fell back to the prompt checkpoint at `30`. ## Tradeoffs This intentionally only enables decode checkpoints for exact-restore cache stacks. Costs: - More paged-out snapshot memory for recurrent/sliding-window models. - More snapshot work on the decode hot path. - More preserved trie nodes, which can increase fragmentation. - The interval policy is heuristic and may need tuning. - This does not fix rendered-history drift from clients. If a client reserializes hidden context, tool results, thinking text, or assistant text differently, the token trie may diverge before a useful decode checkpoint. I did not expose an option or env knob in this PR. The first goal is to fix the measured recurrent/sliding-window failure mode without expanding API/config surface. ## Tests Added unit coverage for: - recurrent cache restoring inside generated history via decode checkpoints - sliding-window cache restoring inside generated history - pure KV cache not creating decode checkpoints in auto mode - final decode checkpoint creation - no snapshot leaks or double closes - checkpoint progress starting from the active restore frontier Ran: ```sh go test ./x/mlxrunner -run 'Test.*Cache|Test.*Snapshot' -count=1 go test ./x/mlxrunner/cache -count=1 go test ./x/mlxrunner -count=1 ``` --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 10:19:43 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#77651