[PR #9591] [MERGED] ml: Add support for quantized KV cache #13010

Closed
opened 2026-04-13 00:15:20 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/9591
Author: @jessegross
Created: 3/8/2025
Status: Merged
Merged: 3/8/2025
Merged by: @jessegross

Base: mainHead: jessegross/kvquant


📝 Commits (3)

  • 03fe41b ggml-backend: Ensure allocation meet backend requirements
  • 375a875 kvcache: Set context for shift offsets
  • bf72aa3 ml: Add support for quantized KV cache

📊 Changes

4 files changed (+20 additions, -5 deletions)

View changed files

📝 kvcache/causal.go (+1 -1)
📝 ml/backend.go (+3 -1)
📝 ml/backend/ggml/ggml.go (+14 -1)
📝 runner/ollamarunner/cache.go (+2 -2)

📄 Description

Similar to the llama engine, quantizing the KV cache requires flash attention to be enabled through the Ollama server.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/9591 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 3/8/2025 **Status:** ✅ Merged **Merged:** 3/8/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/kvquant` --- ### 📝 Commits (3) - [`03fe41b`](https://github.com/ollama/ollama/commit/03fe41b37148328ef522e626ad1c56d03aa77e15) ggml-backend: Ensure allocation meet backend requirements - [`375a875`](https://github.com/ollama/ollama/commit/375a875ced51b021329d36efb1f3fdf1479788cc) kvcache: Set context for shift offsets - [`bf72aa3`](https://github.com/ollama/ollama/commit/bf72aa3995bdb57ee7d78f7743a7fe0227b98635) ml: Add support for quantized KV cache ### 📊 Changes **4 files changed** (+20 additions, -5 deletions) <details> <summary>View changed files</summary> 📝 `kvcache/causal.go` (+1 -1) 📝 `ml/backend.go` (+3 -1) 📝 `ml/backend/ggml/ggml.go` (+14 -1) 📝 `runner/ollamarunner/cache.go` (+2 -2) </details> ### 📄 Description Similar to the llama engine, quantizing the KV cache requires flash attention to be enabled through the Ollama server. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-13 00:15:20 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#13010