[PR #15768] runner: expose prompt cache hit count in completion response #46499

Open
opened 2026-04-25 01:55:19 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15768
Author: @anishesg
Created: 4/23/2026
Status: 🔄 Open

Base: mainHead: fix/ph-issue-15758


📝 Commits (1)

  • d6b8a1b feat(runner): expose prompt cache hit count in completion response

📊 Changes

2 files changed (+4 additions, -0 deletions)

View changed files

📝 llm/server.go (+1 -0)
📝 runner/ollamarunner/runner.go (+3 -0)

📄 Description

When the runner processes a completion request it consults the KV cache to skip re-evaluating prompt tokens that were already processed in a previous turn. The number of tokens actually served from that cache was tracked internally but never returned to callers, making it impossible to observe cache efficiency.

A new numCachedInputs field is added to the Sequence struct in runner/ollamarunner/runner.go. Immediately after loadCache populates seq.cache.Inputs, the length of that slice is captured as seq.numCachedInputs. That value is then included in the outgoing CompletionResponse as PromptCachedCount (omitted when zero so existing clients see no change).

On the server side, llm/server.go gains the corresponding PromptCachedCount int field with the prompt_cached_count,omitempty JSON tag so the information surfaces through the HTTP API without breaking backward compatibility.

Fixes #15758


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15768 **Author:** [@anishesg](https://github.com/anishesg) **Created:** 4/23/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix/ph-issue-15758` --- ### 📝 Commits (1) - [`d6b8a1b`](https://github.com/ollama/ollama/commit/d6b8a1bdb538d60b3850f2cc7d5ac6ca252e3e9c) feat(runner): expose prompt cache hit count in completion response ### 📊 Changes **2 files changed** (+4 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `llm/server.go` (+1 -0) 📝 `runner/ollamarunner/runner.go` (+3 -0) </details> ### 📄 Description When the runner processes a completion request it consults the KV cache to skip re-evaluating prompt tokens that were already processed in a previous turn. The number of tokens actually served from that cache was tracked internally but never returned to callers, making it impossible to observe cache efficiency. A new `numCachedInputs` field is added to the `Sequence` struct in `runner/ollamarunner/runner.go`. Immediately after `loadCache` populates `seq.cache.Inputs`, the length of that slice is captured as `seq.numCachedInputs`. That value is then included in the outgoing `CompletionResponse` as `PromptCachedCount` (omitted when zero so existing clients see no change). On the server side, `llm/server.go` gains the corresponding `PromptCachedCount int` field with the `prompt_cached_count,omitempty` JSON tag so the information surfaces through the HTTP API without breaking backward compatibility. Fixes #15758 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 01:55:19 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#46499