[PR #6735] [MERGED] runner.go: Prompt caching #38083

Closed
opened 2026-04-22 22:45:24 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/6735
Author: @jessegross
Created: 9/10/2024
Status: Merged
Merged: 9/11/2024
Merged by: @jessegross

Base: jmorganca/llamaHead: jessegross/prompt_cache


📝 Commits (1)

  • 0273dfe runner.go: Prompt caching

📊 Changes

4 files changed (+378 additions, -24 deletions)

View changed files

📝 llama/llama.go (+4 -0)
llama/runner/cache.go (+156 -0)
llama/runner/cache_test.go (+192 -0)
📝 llama/runner/runner.go (+26 -24)

📄 Description

Currently, KV cache entries from a sequence are discarded at the end of each processing run. In a typical chat conversation, this results in each message taking longer and longer to process as the entire history of the conversation needs to be replayed.

Prompt caching retains the KV entries as long as possible so that we only need to process the newest message in the converation, at least until there are too many simultaneous conversations and something needs to be evicted.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/6735 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 9/10/2024 **Status:** ✅ Merged **Merged:** 9/11/2024 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `jmorganca/llama` ← **Head:** `jessegross/prompt_cache` --- ### 📝 Commits (1) - [`0273dfe`](https://github.com/ollama/ollama/commit/0273dfe0e7710c2d84f625bc14de5438d4b33c1d) runner.go: Prompt caching ### 📊 Changes **4 files changed** (+378 additions, -24 deletions) <details> <summary>View changed files</summary> 📝 `llama/llama.go` (+4 -0) ➕ `llama/runner/cache.go` (+156 -0) ➕ `llama/runner/cache_test.go` (+192 -0) 📝 `llama/runner/runner.go` (+26 -24) </details> ### 📄 Description Currently, KV cache entries from a sequence are discarded at the end of each processing run. In a typical chat conversation, this results in each message taking longer and longer to process as the entire history of the conversation needs to be replayed. Prompt caching retains the KV entries as long as possible so that we only need to process the newest message in the converation, at least until there are too many simultaneous conversations and something needs to be evicted. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-22 22:45:24 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#38083