[PR #13141] [MERGED] kvcache: Use SetRows to store cache data #24632

Closed
opened 2026-04-19 17:41:59 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13141
Author: @jessegross
Created: 11/19/2025
Status: Merged
Merged: 11/19/2025
Merged by: @jessegross

Base: mainHead: jessegross/set_rows


📝 Commits (2)

  • 07f10e3 ggml: Automatically make tensors contiguous on reshape
  • af5ca95 kvcache: Use SetRows to store cache data

📊 Changes

4 files changed (+168 additions, -235 deletions)

View changed files

📝 kvcache/causal.go (+41 -174)
📝 kvcache/causal_test.go (+115 -61)
📝 ml/backend.go (+1 -0)
📝 ml/backend/ggml/ggml.go (+11 -0)

📄 Description

We currently copy data into the KV cache in contiguous buffers using ggml_cpy(). ggml_set_rows() was introduced to allow scatter operation so that contiguous buffers are no longer required. The direct primary benefit of this is that we no longer need to perform defragmentation.

However, GGML recently removed an optimization for ggml_cpy() and we picked it up in 544b673 "ggml update to b6840 (#12791)". This caused a roughly 40% drop in token generation performance on CUDA due to CUDA graphs no longer being used. By switching to ggml_set_rows(), the original optimization is no longer necessary and CUDA performance is restored.

Fixes #13112


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13141 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 11/19/2025 **Status:** ✅ Merged **Merged:** 11/19/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/set_rows` --- ### 📝 Commits (2) - [`07f10e3`](https://github.com/ollama/ollama/commit/07f10e3a4922220e1cef09f864961de7c0887694) ggml: Automatically make tensors contiguous on reshape - [`af5ca95`](https://github.com/ollama/ollama/commit/af5ca9515f0d11ed0f2808f0dde3b78415fed235) kvcache: Use SetRows to store cache data ### 📊 Changes **4 files changed** (+168 additions, -235 deletions) <details> <summary>View changed files</summary> 📝 `kvcache/causal.go` (+41 -174) 📝 `kvcache/causal_test.go` (+115 -61) 📝 `ml/backend.go` (+1 -0) 📝 `ml/backend/ggml/ggml.go` (+11 -0) </details> ### 📄 Description We currently copy data into the KV cache in contiguous buffers using ggml_cpy(). ggml_set_rows() was introduced to allow scatter operation so that contiguous buffers are no longer required. The direct primary benefit of this is that we no longer need to perform defragmentation. However, GGML recently removed an optimization for ggml_cpy() and we picked it up in 544b673 "ggml update to b6840 (#12791)". This caused a roughly 40% drop in token generation performance on CUDA due to CUDA graphs no longer being used. By switching to ggml_set_rows(), the original optimization is no longer necessary and CUDA performance is restored. Fixes #13112 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 17:41:59 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#24632