[PR #9751] Add --no-kv-offload parameter #18324

Open
opened 2026-04-16 06:31:54 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/9751
Author: @apt-install-coffee
Created: 3/14/2025
Status: 🔄 Open

Base: mainHead: add-no-kv-offload-parameter


📝 Commits (1)

  • 2c286f8 Add --no-kv-offload parameter

📊 Changes

7 files changed (+19 additions, -3 deletions)

View changed files

📝 cmd/cmd.go (+1 -0)
📝 envconfig/config.go (+3 -0)
📝 llama/llama.go (+2 -1)
📝 llm/server.go (+4 -0)
📝 ml/backend.go (+3 -0)
📝 runner/llamarunner/runner.go (+4 -2)
📝 runner/ollamarunner/runner.go (+2 -0)

📄 Description

allows pushing context window into system memory, allows vram-constrained users to offload more model layers, especially for models with larger context windows. Currently starting ollama with OLLAMA_NO_KV_OFFLOAD=1 and OLLAMA_FLASH_ATTENTION=1 enables this. I believe flash attention is a prerequisite, but my understanding of the relationship between flash attention and the kv cache is minimal.

I am unfamiliar with this code-base; any pointers on what additional changes are missing would be very welcome.

TODO:

  • When OLLAMA_NEW_ENGINE=0, this seems to work and performs really well; though it seems to fail when OLLAMA_NEW_ENGINE=1, and I believe I am missing some areas in runner/ollamarunner which need to be modified.
  • Anything extra which may be needed for this parameter to be modified as a model is loaded instead of when starting ollama.

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/9751 **Author:** [@apt-install-coffee](https://github.com/apt-install-coffee) **Created:** 3/14/2025 **Status:** 🔄 Open **Base:** `main` ← **Head:** `add-no-kv-offload-parameter` --- ### 📝 Commits (1) - [`2c286f8`](https://github.com/ollama/ollama/commit/2c286f89edff1aeeb6252a34bba13babc858e7d3) Add --no-kv-offload parameter ### 📊 Changes **7 files changed** (+19 additions, -3 deletions) <details> <summary>View changed files</summary> 📝 `cmd/cmd.go` (+1 -0) 📝 `envconfig/config.go` (+3 -0) 📝 `llama/llama.go` (+2 -1) 📝 `llm/server.go` (+4 -0) 📝 `ml/backend.go` (+3 -0) 📝 `runner/llamarunner/runner.go` (+4 -2) 📝 `runner/ollamarunner/runner.go` (+2 -0) </details> ### 📄 Description allows pushing context window into system memory, allows vram-constrained users to offload more model layers, especially for models with larger context windows. Currently starting ollama with `OLLAMA_NO_KV_OFFLOAD=1` and `OLLAMA_FLASH_ATTENTION=1` enables this. I believe flash attention is a prerequisite, but my understanding of the relationship between flash attention and the kv cache is minimal. I am unfamiliar with this code-base; any pointers on what additional changes are missing would be very welcome. TODO: - When `OLLAMA_NEW_ENGINE=0`, this seems to work and performs really well; though it seems to fail when `OLLAMA_NEW_ENGINE=1`, and I believe I am missing some areas in runner/ollamarunner which need to be modified. - Anything extra which may be needed for this parameter to be modified as a model is loaded instead of when starting ollama. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 06:31:54 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#18324