[PR #9892] [MERGED] Optimize sliding window attention #23624

Closed
opened 2026-04-19 17:06:59 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/9892
Author: @jessegross
Created: 3/19/2025
Status: Merged
Merged: 3/21/2025
Merged by: @jessegross

Base: mainHead: jessegross/swa


📝 Commits (2)

  • a55ba0b kvcache: Pass granular cache size into implementations
  • b975d8f kvcache: Optimize sliding window attention

📊 Changes

7 files changed (+106 additions, -33 deletions)

View changed files

📝 kvcache/cache.go (+7 -2)
📝 kvcache/causal.go (+68 -16)
📝 kvcache/causal_test.go (+17 -7)
📝 kvcache/encoder.go (+5 -1)
📝 kvcache/wrapper.go (+2 -2)
📝 runner/ollamarunner/cache.go (+6 -4)
📝 runner/ollamarunner/runner.go (+1 -1)

📄 Description

Currently sliding window attention allocates and uses the full context size and just masks out any tokens that are outside of the window. However, we really only need (roughly) the sliding window size.

At large context sizes this improves two things:

  • Memory allocated - since the fully context size is allocated up front, memory requirements drop substantially. On Gemma3:4b with a 32k context window, total memory usage (including weights and non-sliding layers) drops from ~20GB to ~8GB.
  • Computation - ranges that are completely outside of the sliding window are now removed from the tensors that are returned from the cache rather than simply being masked out. This results in more efficient processing, scaling with the size of the context that has actually been used.

Notable, this does not update the scheduler for any model to be aware of the smaller memory requirements. This is difficult for Gemma3 because the layers are heterogeneous between sliding and non-sliding attention. As a result, while actual memory consumption will be reduced, the scheduler will over-estimate the requirements of the model. This means that splitting between GPUs or GPUs and CPUs will still be suboptimal.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/9892 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 3/19/2025 **Status:** ✅ Merged **Merged:** 3/21/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/swa` --- ### 📝 Commits (2) - [`a55ba0b`](https://github.com/ollama/ollama/commit/a55ba0ba210b79bd519af86056b5aabf38ce476b) kvcache: Pass granular cache size into implementations - [`b975d8f`](https://github.com/ollama/ollama/commit/b975d8f8f05c3eafc257e92f5665664427fc2766) kvcache: Optimize sliding window attention ### 📊 Changes **7 files changed** (+106 additions, -33 deletions) <details> <summary>View changed files</summary> 📝 `kvcache/cache.go` (+7 -2) 📝 `kvcache/causal.go` (+68 -16) 📝 `kvcache/causal_test.go` (+17 -7) 📝 `kvcache/encoder.go` (+5 -1) 📝 `kvcache/wrapper.go` (+2 -2) 📝 `runner/ollamarunner/cache.go` (+6 -4) 📝 `runner/ollamarunner/runner.go` (+1 -1) </details> ### 📄 Description Currently sliding window attention allocates and uses the full context size and just masks out any tokens that are outside of the window. However, we really only need (roughly) the sliding window size. At large context sizes this improves two things: - Memory allocated - since the fully context size is allocated up front, memory requirements drop substantially. On Gemma3:4b with a 32k context window, total memory usage (including weights and non-sliding layers) drops from ~20GB to ~8GB. - Computation - ranges that are completely outside of the sliding window are now removed from the tensors that are returned from the cache rather than simply being masked out. This results in more efficient processing, scaling with the size of the context that has actually been used. Notable, this does not update the scheduler for any model to be aware of the smaller memory requirements. This is difficult for Gemma3 because the layers are heterogeneous between sliding and non-sliding attention. As a result, while actual memory consumption will be reduced, the scheduler will over-estimate the requirements of the model. This means that splitting between GPUs or GPUs and CPUs will still be suboptimal. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 17:06:59 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#23624