[PR #9987] [MERGED] Improve memory estimates for sliding window attention #18384

Closed
opened 2026-04-16 06:33:32 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/9987
Author: @jessegross
Created: 3/25/2025
Status: Merged
Merged: 3/26/2025
Merged by: @jessegross

Base: mainHead: jessegross/swa_estimation


📝 Commits (3)

  • 965e61e kvcache: Sliding window cache only needs a single batch total
  • 76628c3 llm: Fix debug logging for memory estimates
  • 8216285 ggml: Support heterogeneous KV cache layer sizes in memory estimation

📊 Changes

6 files changed (+52 additions, -31 deletions)

View changed files

📝 fs/ggml/ggml.go (+27 -12)
📝 kvcache/causal.go (+2 -2)
📝 llm/memory.go (+16 -10)
📝 llm/memory_test.go (+2 -2)
📝 llm/server.go (+1 -1)
📝 server/sched.go (+4 -4)

📄 Description

Recent optimizations for sliding window attention significantly reduced memory usage for models that use it. However, our memory estimates didn't reflect that, causing us to spill over to system memory unnecessarily in cases where GPU memory is constrained.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/9987 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 3/25/2025 **Status:** ✅ Merged **Merged:** 3/26/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/swa_estimation` --- ### 📝 Commits (3) - [`965e61e`](https://github.com/ollama/ollama/commit/965e61e28669d4f8071766ca803997419a323448) kvcache: Sliding window cache only needs a single batch total - [`76628c3`](https://github.com/ollama/ollama/commit/76628c3a21a1ba89454a1f59e5b1e84cba2c1a94) llm: Fix debug logging for memory estimates - [`8216285`](https://github.com/ollama/ollama/commit/82162856cddf1780f7d35d3653ab26dc3d74a3bd) ggml: Support heterogeneous KV cache layer sizes in memory estimation ### 📊 Changes **6 files changed** (+52 additions, -31 deletions) <details> <summary>View changed files</summary> 📝 `fs/ggml/ggml.go` (+27 -12) 📝 `kvcache/causal.go` (+2 -2) 📝 `llm/memory.go` (+16 -10) 📝 `llm/memory_test.go` (+2 -2) 📝 `llm/server.go` (+1 -1) 📝 `server/sched.go` (+4 -4) </details> ### 📄 Description Recent optimizations for sliding window attention significantly reduced memory usage for models that use it. However, our memory estimates didn't reflect that, causing us to spill over to system memory unnecessarily in cases where GPU memory is constrained. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 06:33:32 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#18384