[PR #8092] [MERGED] llama: Ensure KV cache is fully defragmented. #43869

Closed
opened 2026-04-24 23:26:14 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/8092
Author: @jessegross
Created: 12/14/2024
Status: Merged
Merged: 12/17/2024
Merged by: @jessegross

Base: mainHead: jessegross/defrag


📝 Commits (1)

  • 6126c4d llama: Ensure KV cache is fully defragmented.

📊 Changes

3 files changed (+289 additions, -61 deletions)

View changed files

📝 llama/llama.cpp (+46 -53)
llama/patches/0014-llama-Ensure-KV-cache-is-fully-defragmented.patch (+242 -0)
📝 llama/runner/runner.go (+1 -8)

📄 Description

Sometimes the KV cache requires defragmentation even without triggering the threshold heuristic. In this case, decoding will not being able to find a KV cache slot. This is particularly difficult for the caller to handle if it happens in between ubatches. To avoid this, we should immediately trigger a defrag.

In addition, a heavily fragmented cache can require more than max_moves to defragment. Currently, we stop when we hit the limit but this can leave a cache that still does not have adequate space even after defragmentation is triggered. Instead, we should do multiple batches of processing until everything is complete.

Fixes #7949

I also plan to send this patch upstream to llama.cpp.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/8092 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 12/14/2024 **Status:** ✅ Merged **Merged:** 12/17/2024 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/defrag` --- ### 📝 Commits (1) - [`6126c4d`](https://github.com/ollama/ollama/commit/6126c4d5fe97ead823b785dee8cf1b09eb37544b) llama: Ensure KV cache is fully defragmented. ### 📊 Changes **3 files changed** (+289 additions, -61 deletions) <details> <summary>View changed files</summary> 📝 `llama/llama.cpp` (+46 -53) ➕ `llama/patches/0014-llama-Ensure-KV-cache-is-fully-defragmented.patch` (+242 -0) 📝 `llama/runner/runner.go` (+1 -8) </details> ### 📄 Description Sometimes the KV cache requires defragmentation even without triggering the threshold heuristic. In this case, decoding will not being able to find a KV cache slot. This is particularly difficult for the caller to handle if it happens in between ubatches. To avoid this, we should immediately trigger a defrag. In addition, a heavily fragmented cache can require more than max_moves to defragment. Currently, we stop when we hit the limit but this can leave a cache that still does not have adequate space even after defragmentation is triggered. Instead, we should do multiple batches of processing until everything is complete. Fixes #7949 I also plan to send this patch upstream to llama.cpp. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-24 23:26:14 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#43869